The testing of AI in medicine is a mess. Here’s how it should be done

Hundreds of medical algorithms have been approved on basis of limited clinical data. Scientists are debating who should test these tools and how best to do it.

When Devin Singh was a paediatric resident, he attended to a young child who had gone into cardiac arrest in the emergency department after a prolonged wait to see a doctor. “I remember doing CPR on this patient and feeling that kiddo slip away,” he says. Devastated by the child’s death, Singh remembers wondering whether a shorter waiting time could have prevented it.

The incident convinced him to combine his paediatric expertise with his other speciality — computer science — to see whether artificial intelligence (AI) might help to cut waiting times. Using emergency-department triage data from the Hospital for Sick Children (SickKids) in Toronto, Canada, where Singh currently works, he and his colleagues built a collection of AI models that provide potential diagnoses and indicate which tests will probably be required. “If we can predict, for example, that a patient has a high likelihood of appendicitis and needs an abdominal ultrasound, we can automate ordering that test almost instantly after a patient arrives, rather than having them wait 6–10 hours to see a doctor,” he says.

A study using retrospective data from more than 77,000 emergency-department visits to SickKids suggested that these models would expedite care for 22.3% of visits, speeding up results by nearly 3 hours for each person requiring medical tests¹. The success of an AI algorithm in a study such as this, however, is only the first step in verifying whether such an intervention would help people in real life.

Properly testing AI systems for use in a medical setting is a complex multiphase process. But relatively few developers are publishing the results of such analyses. Only 65 randomized controlled trials of AI interventions were published between 2020 and 2022, a review shows². Meanwhile, regulators such as the US Food and Drug Administration (FDA) have approved hundreds of AI-powered medical devices for use in hospitals and clinics.

“Health-care organizations are seeing many approved devices that don’t have clinical validation,” says David Ouyang, a cardiologist at Cedars-Sinai Medical Center in Los Angeles, California. Some hospitals opt to test such equipment themselves.

And although researchers know what an ideal clinical trial for an AI-based intervention should look like³, in practice, testing these technologies is challenging. Implementation depends on how well health-care professionals interact with the algorithms: a perfectly good tool will fail if humans ignore its suggestions. AI programs can be particularly sensitive to differences between the populations whose data they were trained on and the ones they’re aiming to help. Moreover, it’s not yet clear how best to inform patients and their families about these technologies and ask for their consent to use their data for testing the devices.

Some hospitals and health-care systems are experimenting with ways to use and evaluate AI systems in medicine. And as more AI tools and companies are entering the market, groups are coming together to seek consensus on what kinds of assessment work best and provide the most rigour.

Blog

Testowanie AI w medycynie to bałagan. Oto jak powinno się to robić

Who is testing medical AI systems?