This post is about what can’t be found.
It Started with A Stroke
Someone close to me recently suffered a traumatic brain injury – a series of them. First a stroke, then a seizure cut quality of life down to a fraction of what is was only months before. I wanted to hunt down data that might inform me what to look for, what tests matter, what factors and features predict recovery or relapse. And I hit a wall.
The Raw Data Isn’t There
Having worked in clinical research before, I knew that I had to see the data and work with it. Frankly, no one should trust the published numbers without confirming how the analysis was run and on what sample it was run on.
The more I read, the more I realized a series of joins were going to be necessary. There simply isn’t a single dataset that covers the factors I need. It also makes sense to test any prescriptive statistic (likelihood of stroke in set A) from one set on morbidity targets (did set B have a stroke or not?). And I ran into a jam.
The data isn’t available. I wasted $70US for some crappy p-values and a carbon-copy methods sections (they weren’t using innovate procedures, just running the tests I needed to look). This is manufactured ignorance in the science community. Reproduction is vital to scientific integrity and that starts with at least being able to come up with the same graph from the same data. There is no way to really know how much finessing some grad student did to produce a graph and whether or not peer reviewers actually took the time to produce pivot tables from the right sample populations, or review any models used in assessment or bias checks. That’s crap. How bad is the problem?
It’s Bad. Real Bad
To find out just how much data isn’t available, I turned to the sidebar of Mendeley.com. If you haven’t checked it out, I strongly suggest you do. They are a part of Elsevier, a paywall for providing research (where I wasted all that money), but who are helping pioneer data ethics and repeatable analyses in research by guilt-tripping scientists into sharing their tidy data for brownie points.
With our developer access granted, let’s take a quick peek at where the information is rendered normally.
Mendeley offers a rather attractive search page and their database lookups are fast:
I searched for “stroke prediction adult” and received this:
Sorting on data repositories yields everything from pdfs of the article to pictures of tables. That number is useless, despite the name. Tabular datasets are nothing but the cex or similar formatted tables in the paper. Not the raw data. The raw data is key if we are to (1) repeat the analysis for validity testing and (2) combine the data with other datasets for group testing and comparison. You cannot directly compare findings in summary form from one paper to another easily as the models and analysis are often very different. The tables are effectively useless for machine learning downstream.
The only filter that regularly provides some sort of data or repository access portal is “File Set” and that’s not perfect. Even assuming 100% of file sets are in fact full data used in the presented paper, only 1112 of 57114 (~2%) of the published work pulled in this search likely have the datasets available for review on Mendeley.
Most academic research done in the United States Universities and published in Journals is in some part funded by the government. We paid for the raw data and don’t get access to it.
Can you email the authors and ask for the data? Sure, but they may not give it to you for whatever reason they come up with or because the data is lost, buried, crap, or holds finding they are trying to commercialize. Is there more data available in repositories? Absolutely, but that doesn’t excuse authors for not publishing their subset, pull method, and code used in analysis. I mention subset here specifically because in pulling data from an API or merging data from multiple surveys, one of the easiest ways to introduce an error in downstream computation in to mess up the join, search, and renaming. By having subsets, the community might try validate that the information was constructed correctly.
Next Step - Collection
In the next article, we’ll build the fetch class. To help key in on searches for the query, we can visit trends.google.com and use keywords from the top 100 causes of morbidity in the US. Trends will spit back the most common queries, I hope, which can be based to Mendeley’s catalog lookup and downloaded. That’ll be the collection pipeline v1 for available medical knowledge. Time to get started!
Any tricks, tips, or want to help? Bug me on the About page.