Blog (English)

Searching for patterns among the dead

Version en español


We wanted to investigate the data in aggregate. Could the data reveal changes in the wave of violence sweeping the country? Which places are more likely to have victims and has that changed over time? What is the most common cause of death? As useful as it is, the Quinto Elemento website is too slow and inflexible to allow aggregate searches such as this. Moreover, the raw data itself is in an execrable state, an indication of the utter lack of care on the part of authorities for tackling the problem of so many missing people. To deal with these problems we created an instance of Datasette, a Python program that enables rapid, customizable searching of large data bases, as well as visualization of the data. We also applied simple machine learning algorithms to overcome the worst of the data inconsistencies. In another post we'll provide a brief tutorial in these techniques for anyone who wishes to do the same. Here, we'll review the few interesting observations they allowed us to tease out.

Click any graphic to see the SQL query that produced it.

Incomplete data

The data is is such poor shape that it is difficult to believe it was created by professional forensic services. Each entry in the database lacks at least one piece of data, and many have no data at all. Almost 18% of the entries have no date of discovery, 27% don't indicate the sex of the deceased, 55% omit the place of discovery, 54% don't even show the cause of death, which you would think would be the principal product of a forensic investigation.

There are many inconsistencies among the data that does exist, especially concerning the cause of death. Some of the causes are veritable narratives...

Others are nothing more than "traffic accident" or "decapitation", as if the forensic examine was keeping an impatient examiner from an important lunch date. Some entries describe the medical cause of death, for example "thoracic trauma," others the human implement that caused it, whether "knife", "bullet" or "train".

Wave of violence

Taking the data in this state, we can see that the number of cases of unidentified bodies is increasing every year.

Given the number of mass graves left by the drug war, and daily shootout victims in Mexico, you would except the yearly rate of unidentified bodies in Mexico to closely match the number of homicides. It's close, but not quite.

fuente: Inegi

There is a dip in the number of homicides between 2011 and 2015 which is not seen in the constant upward trajectory of the unidentified bodies. What's going on? Were narcos getting away with a large number of undetected murders during those years? The data is in such bad shape, it's tempting to throw up our hands and assume we can't draw any conclusion about the relationship to tendencies in Mexico's wave of violence. But don't just yet. Let's see what else Datasette can show us first.

The relationship between unidentified bodies and violence is a little clearer when we divide the data by state.

Here we can see that the states with the largest number of unidentified bodies are the State of Mexico, Baja California (especially Tijuana), Mexico City and Chihuahua, all well know for their narcoviolence. The pattern is not 100% trustworthy. Even though Guanajuato is among the most violent state in recent years, it actually has relatively few cases of unidentified bodies. Still, there is enough correlation to suggest the historical trend for unidentified bodies should follow the trend for homicide.

fuente: Inegi

Machine learning and the cause of death

As mentioned earlier, not all the unidentified bodies died of violence. Searching through the data shows other causes of death, such as traffic accidents, abortions, and health problems. Perhaps these other causes of death are clouding the picture. If we could separate the violent from the non-violent deaths, we might see the trend we would expect. As mentioned above there is no sistematization in the reporting of cause of death. Datasette shows 3,658 different causes listed.

Even among what appears to the untrained eye to be similar causes of death, there are many textual differences which make it impossible to draw aggregate conclusions.

In theory it should be possible to go through the entries and classify each by hand, but this would take hours or days. Instead, we decided to apply a bit of machine learning. We put the database into another Python program called SpaCy, which offers a variety of language-related functions, including identifying parts of speech and finding textual similarity between different documents. It is extremely simple to operate, at least at an introductory level, even for non-machine learning experts. In all, it took us 30 minutes to train SpaCy to distinguish between several broad categories of cause of death: "violence", "health", "accident", "abortion" and "unknown." Words such as "weapon", "hit", or "projectile" suggested a violent cause of death. Expressions like "heart attack", "hypertension" might suggest a health-related cause of death. In another post we will explain how we used SpaCy for the technically-minded who would like to do the same. For now, it is enough to point out that we did not provide SpaCy with a list of keywords to distinguish among the categories. We only showed it a few examples of each category and let it draw it's own conclusion. The process is not exact and we spent very little time in refining it. However, plotting only cases categorized as violent by SpaCy...

...we now find the same tendency as the historical trend in violence, with the same decline in the middle of the period, just as we would have expected. There's no great revelation here, but it clearly shows that the machine learning technique was able to separate violent from non-violent deaths in a pool of highly hetereogeneous texts, with very little effort on our part.

With more effort, it could be used to further categorize types of violent death or other patterns that might be interesting to a researcher. Are certain weapons more common in some states? Can executions be distinguished from improvised assaults? How frequent are cases of dismemberment that we read so much about in the media?

La calidad de los datos bajo la lupa

We can also use Datasette to explore the quality of the data itself. Quinto Elemento has written about a "forensic crisis" in Mexico. Poor forensic data is symptom of this. But which are doing the worst job? Using SQL queries we can see the percentage of incomplete data by state. In the following figures, we can see the percentage of entries by state which lack data of different types. The higher the bar, the greater the number of cases missing that type.

The number of entries lacking the year of discovery...

Guanajuato and the State of Mexico did not record the year of discovery for any unidentified bodies.

Place of discovery...

Very few states bothered to record the place of discovery...

Autopsia...

Here is a fundamental inconsistency. More than half of states report having completed autopsies on 100% of unidentified bodies (states with no bar). However, many of these are the same that don't record a cause of death.

Among those states reporting causes of death, which did the best job? We'll take the number of characters as an indicator of quality. The longer the description of cause of death, the higher the quality. We'll choose 50 characters as the dividing line between poor and high quality and plot the number of high quality descriptions as a percentage of the total per state.

Oaxaca seems to be the most diligent in describing the cause of death, followed by Nuevo Leon, Coahuila and Puebla.


Our team

Conrad Fox is an award-winning journalist and producer. He has reported from Mexico, Central America and Haiti, with his work appearing on the CBC, NPR, BBC World Service and other outlets. He is dynamic and creative instructor, and has taught journalism, English, robotics, soccer, physics and anthropology in classrooms, lecture halls, boardrooms, mud huts, playing fields and online. He is also a developer, specialized in Python and the web.

Rosi Rodriguez is a social anthropologist with specializations in education, migration and the environment. She has spent more than 30 years involved in education of children and adults, in formal and informal settings. She was co-founder of the first indigenous-led community development organization in Southern Veracruz, bringing together dozens of communities through workshops, events and educational activities. She is still happiest seated in front of a palapa with a group of neighbours, sharing thoughts and experiences.