Last month, a team from Google Research published a paper on the results of a field test of a novel deep-learning model to detect diabetic retinopathy from images of patients’ eyes. The paper, titled “A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy,” is based on research done in partnership with the Ministry of Public Health in Thailand to conduct field research in 11 rural clinics across the provinces of Pathum Thani and Chiang Mai.
TechCrunch wasted no time in summarizing the study: “Google medical researchers humbled when AI screening tool falls short in real-life testing.” The article goes on to summarize the failures of the system in practice — from the lack of dedicated screening rooms that could be darkened to take high-quality images, to inconsistent broadband connectivity, to patients’ concerns about having to follow up at a hospital. But I believe this coverage misses the mark in three important aspects, which should be of prime concern to people actually working to deploy medical AI in the field.
Research Is Not Engineering
First, there is a difference between research and engineering, and research studies like this one should be heralded for the progress they enable. According to Google Health, “This is one of the first published studies examining how a deep learning system is used in patient care.” We need more studies about medical AI deployments published at the Conference of Human Factors in Computing Systems — and these studies need to describe things the way they are. Unlike a startup going to market that must spin whatever happens as a success story, research work is only about uncovering the truth.
Implying such studies are failures not only misrepresents their goal and achievement but also contributes to the issues of nonreproducible research and “science by press release” that plagues today’s science. If you’re one of the many people trying to apply deep learning for medical imaging in practice, then you’ll find this paper to be a gem.
AI Success = Science + Engineering + Process Change
Second, there must be an understanding of what it takes to get an AI system from idea to production. Assuming that a basic scientific breakthrough makes a system ready for wide use would have caused the invention of the steam engine to receive press coverage like this: “Scientists humbled to find we’re nowhere near a robust national railway system.” This is how cars where originally covered in the media, so there’s nothing new under the sun with this happening again with AI.
Taking on the analogy of cars, here are the three workstreams that must come together for medical AI systems to become an effective everyday reality:
1. Science: We need to develop highly accurate data science algorithms for specific problems, as Google did with its original deep learning models for detecting diabetic retinopathy. In the analogy to cars, this would be like the invention of the internal combustion engine.
2. Engineering: We need to develop ways to productize these inventions at high quality, high scale, safely and cheaply. In the analogy to cars, we need to invent the equivalents of the mass production line, hand brakes, electric starters, air conditioners, airbags and headrests. In the AI space, think MLOps, explainability, bias detection and model governance (as a start). This is the area of the ecosystem where I personally work and specialize.
3. Process change: We need to develop the human-centered processes that enable people to use these innovations effectively and safely. In the analogy to cars, think splitting the public space between roads and sidewalks, establishing driver licensing, public education, safety standards and pollution regulation. In medical AI, we’ve barely started on this, which makes the recent Google field study an important baby step.
It’s important for practitioners to know that real success — helping real patients, in the field, at scale, safely — requires all three of these aspects to work together. It’s important for media coverage to educate people about this.
Once More With Feeling: Health Care AI Models Do Not Generalize
The third insight from this new study is based on the major differences between the 11 clinics that took part in it. The researchers reported major differences between them — from how the physical rooms at each clinic were laid out to the personalities and background of the nurses who worked there. As a result, the trained model could not successfully operate at each of these distinct environments.
This is such a well-known phenomenon in medical AI that it no longer requires academic validation. Medical AI models generally perform poorly across locations. This not only applies to models deployed in Thailand versus Nigeria but also models deployed in two clinics that are 5 kilometers apart and serve essentially the same population. This happens in both first-world and third-world countries and across just about every medical specialty that’s taken the time to measure it.
As a result: If you have a successfully deployed model in one location (or 10), you do not have an accurate model that’s ready for the next clinic. Continuously tuning and monitoring AI models is part of the engineering work underway in the “Science + Engineering + Process Change” trifecta. At this point in time, I expect every sound medical AI field deployment to be addressing this issue.
Turning medical AI from aspiration into a reality that improves humanity’s well-being is going to be a long ride. It will take us all of the first half of the 21st century — and that’s if we’re efficient about it. Maybe this isn’t original, but it may be the adventure of a generation.