Attention conservation notice: These are just loose thoughts on a topic that I’ve been thinking about for a few years. It’s important to point out that I’m far from an expert in either machine learning or medicine. Also, I’m not a doctor.
When I was applying to med school four years ago, I often wondered why machine learning wasn’t used more commonly in medicine. I thought to myself, if I can get a statistical recommendation of what movie I will most likely enjoy from Netflix, why can’t I get a statistical recommendation for what treatment will most likely help me at the doctor?
As I went through the first two years of med school and memorized countless facts about physiology and pathophysiology, I occasionally did some more research on the topic. For example, when we were learning about interpreting ECGs, I read about trials showing that the best computer programs for ECG interpretation nearly matched those of cardiologists. And after we learned about all of the antibiotics with which you treat each of the major infections, I read about the trials showing that a computer-based system outperformed experts in choosing an antimicrobial to treat meningitis.
Lest you think that these are new capacities that medicine hasn’t had a chance to catch up with yet, let’s be clear about the publication dates on those articles: the ECG comparison was published in 1991, and the infectious disease treatment comparison was published in 1979. That’s long enough for one of the authors of the latter paper, Robert Blum, to spend a whole career practicing emergency medicine.
Indeed, over the past 20-40 years, due to improvements in computing and statistics, our capacity to do accurate machine learning seems to have gotten way, way better. Computers now beat humans in checkers and chess. Computers can drive cars in six U.S. states. As a sign of the times, software recruiting agencies, looking for people more interested in a diverse set of topics, report that there is an “epidemic” of interest in machine learning among college graduates.
So why do healthcare professionals still get paid in part for their ability to look at an image and make a judgment call about what is going on in it, or do one of the other countless things in medicine that could theoretically also be done by a computer?
A cynical explanation is that healthcare providers such as doctors are blocking the application of machine learning to medicine out of self-interest. But for a number of reasons, I think that’s misguided . Medicine is not perfect, but I think almost all healthcare providers really do want to give better healthcare to their patients. To wit, many of the articles written about new types of machine learning in medicine are written by healthcare providers who would like to see improvements in clinical practice. Moreover, healthcare providers don’t ultimately call the shots — hospital and reimbursement administrators do. And it seems that most of them would be quite happy — for good reason — to find ways to reliably decrease cost while maintaining quality of care, even if it required decreasing the hours of or even firing some employees.
Instead, I think the main reason that machine learning hasn’t been applied very broadly in medicine is that one that I had a vague feeling about but couldn’t really articulate well until I read Sculley et al.’s 2015 paper, “Hidden Technical Debt in Machine Learning Systems” — pdf here.
Technical debt is a wonderful phrase which nicely conveys the idea that in software development you can often make quick progress by cutting corners to ship your code — such as ignoring edge cases and not unit testing exhaustively — that will eventually need to be paid off in more time and hours of development down the road.
In their article, Sculley et al. describe how technical debt is particularly pernicious in machine learning, because in addition to the software side being capable of decay, machine learning also has the data side on which the models are trained. This data, too, is capable of decay and a host of problems as time goes on. The authors point out a number of specific ways in which machine learning algorithms can decay over time, and here are the four that I think are most relevant to medicine:
- Entanglement: This refers to the phenomenon of a shift in one input variable being liable to affect other variables in the system. Since no two input variables are ever truly independent, changing the distribution of one variable is likely to dramatically alter the predictive ability of a pre-specified machine learning algorithm. In medicine, this is a huge problem, because disease dynamics and correlations between medical variables can change rapidly. To take infectious disease as an obvious example, consider the recent increase in chikungunya virus in the U.S.. Or consider the impact of a particularly severe flu season, +/- an effective vaccine. And as treatments, diagnosis rates, or risk factors for chronic diseases change, their distributions are also liable to become radically different over short timeframes. Because nearly everything in medicine is correlated with everything, the CACE principle that “Changing Anything Changes Everything” of Sculley et al. very much applies to any machine learning algorithm in medicine.
- Unstable Data Dependencies. The input data sources for any machine learning algorithm in medicine are likely to be messy and capricious. An important example of this are electronic medical records, especially patient notes, which healthcare providers write about patient encounters. These notes are highly unstructured and nearly impossible to parse, which makes it a common summer research project for med students to read through a bunch of charts, extract relevant data from each patient, and put it into a spreadsheet. Any NLP approach here will need to be updated for different locations, times, and healthcare providers. And this problem of messy, unstable data is likely to be a difficult problem for a long time, because of the huge need for privacy in health data makes health data less likely to be open and thus standardized.
- Feedback Loops. Let’s imagine that a machine learning algorithm started to be used for diagnosis or management of a set of patients. How would this change the input data streams and/or how would it change the output of other machine learning algorithms in medicine? Theoretically, this could and probably should be addressed via a simulation-based “sandbox” that contains the full ecosystem of algorithms in a given health system along with simulated patient encounters. I don’t know of any such system, however.
- Abstraction Debt. The lack of abstract ways of understanding the operations of most machine learning algorithms is striking. Indeed, many of the machine learning competitions are won by teams that use ensembles of methods, each of which even individually can’t be understood well. Especially in the transition period when machine learning methods become more commonly used in medicine, it’s going to be crucial to figure out useful abstractions so that the people actually applying the algorithms to individual patients can understand them and thus make sure they are working as expected and debug any problems.
None of these four factors have been robustly addressed by any of the machine learning approaches to medicine that I’ve seen. And what type of system would solve them? For starters, here’s a quick wish list of features for a robust and healthy machine learning algorithm that would be used in medicine:
- Consistently higher sensitivity and/or specificity than a realistic, collaborating team of humans .
- Validated to work on novel, quality-controlled, gold-standard data sets sampled from current patient populations every so often (e.g., every year).
- Stably accurate over a number of years, both retrospectively and prospectively.
- The methods should be published in the open and peer-reviewable. To me, it doesn’t have to be published in a journal — a paper on a pre-print survey would work too. The key is to allow for post-publication peer review.
- Open-source, in part so that it can be picked apart for possibly unstable predictor variables.
- Alongside open-source, its results must be reproducible. For example, this means that methods which rely on random seeds must have enough samples so that the starting point is irrelevant.
Although none of these wishlist items are easy or common, to me, the hardest part of this is likely #2: the data collection and validation. How are we going to assemble high-quality, pseudo-anonymized, gold-standard data sets that can be used by, at a minimum, authorized researchers? And how can we make sure that this happens year-in and year-out? That won’t be easy.
That said, long-term, I’m quite bullish on machine learning in medical diagnosis and management. I think it will happen, and I think it will be a great advance that improves well-being and saves lives. Indeed, as machine learning technical debt problems are addressed adequately in other domains, it will increase societal confidence and motivation to apply them to medicine as well. Face recognition systems (e.g., on Facebook) and self-driving cars both come to mind here.
To summarize, machine learning has long had a lot of potential in medicine. But in order for it to be really utilized, we’ll need to develop much better data sets and algorithms that can overcome the problems of technical debt that would otherwise accumulate. Until then, I guess us med students will just have to keep on memorizing countless numbers of facts for our exams.
: Though to be fair, since I’m in med school, I’m obviously biased on this point. I feel like I’m adjusting for this, but maybe I’m not adjusting enough.
: Because if your ML algorithm can’t beat humans, then what’s the point?
Sculley et al., 2015. Hidden Technical Debt in Machine Learning Systems.
Willems JL, Abreu-lima C, Arnaud P, et al. The diagnostic performance of computer programs for the interpretation of electrocardiograms. N Engl J Med. 1991;325(25):1767-73.
Yu VL, Fagan LM, Wraith SM, et al. Antimicrobial selection by a computer. A blinded evaluation by infectious diseases experts. JAMA. 1979;242(12):1279-82.