An artificial intelligence (AI) tool can convert the text of doctors’ notes summarizing patients’ hospital visits into accurate lay language, a new study found.
The research focuses on discharge notes used to capture patient’s health status in the medical record as they are discharged from the hospital. Effective summaries are essential for patient safety during these transitions in care, but are filled with technical language and abbreviations that are hard to understand and increase patient anxiety, say the study authors.
To address the problem, ٺƵ Health has been testing the capabilities of generative AI, which develops likely options for the next word in any sentence based on how billions of people use words in context on the internet. A result of this next-word prediction is that such generative AI chatbots have become good at replying to questions in realistic, simple language and at producing clear summaries of complex texts. However, AI programs, which work based on probabilities instead of actually thinking, may produce inaccurate summaries, and so are therefore meant to assist, not replace, human providers.
To explore generative AI, ٺƵ in March 2023 received access to GPT-4, the latest tool from OpenAI, the company that created the famous ChatGPT chatbot. ٺƵ licensed one of the first “private instances” of the tool, which allowed hundreds of its frontline clinicians to experiment with AI-based solutions to clinical problems using real patient data, while still adhering to federal standards that protect patient privacy.
One of the first studies by researchers using GPT-4, , looked at how well the tool could convert the text in 50 patient discharge notes into patient-friendly language. Specifically, running discharge notes through generative AI dropped the reports from an 11th-grade reading level on average to a 6th-grade level, the gold standard for patient education materials.
The team also ranked the AI discharge report translations using the Patient Education Materials Assessment Tool (PEMAT), which generates a percentage score based on 19 factors that represent the ability of patients to understand any piece of reading material. GPT-4 translation raised PEMAT understandability scores to 81 percent, up from the 13 percent score achieved by the original doctor-written discharge reports from the medical record.
The research team designed the study to look at AI performance by itself as a scientific question: How far could it go independently when translating discharge reports?
“GPT-4 worked well alone, with some gaps in accuracy and completeness, but did more than well enough to be highly effective when combined with physician oversight, the way it would be used in the real world,” said senior study author Jonah Feldman, MD, medical director of clinical transformation and informatics within ٺƵ’s Medical Center Information Technology (MCIT) Department of Health Informatics. “One focus of the study was on how much work physicians must do to oversee the tool, and the answer is very little. Such tools could reduce patient anxiety even as they save each provider hours each week in medical paperwork, a major source of burnout.”
To measure the accuracy of the AI tool’s translations, the authors also asked two physicians to review the AI discharge summary for accuracy based on a six-point scale. The reviewing physicians awarded 54 percent of the AI-generated discharge notes the best-possible accuracy rating. They also found that 56 percent of notes created by AI were entirely complete. These results must be considered in context, say the authors. For instance, they say, the results signify that even at the current performance level, providers would not have to make a single change in more than half of the AI summaries reviewed.
Dr. Feldman notes that generative AI tools are sensitive, and asking a question of the tool in two subtly different ways may yield divergent answers. The skill required to frame the questions asked of chatbots in a way that elicits the desired response, called prompt engineering, combines intuition and experimentation. Physicians and nurses, with their deep understanding of individual cases and nuanced medical contexts, are best positioned to engineer prompts, say the authors, and they can do this without learning to write computer code.
Within weeks, the research team intends to launch a program asking patients waiting to be discharged whether AI-generated reports are clear and helpful after physician review. By the summer, the team expects to launch a pilot program to provide lay language discharge summaries that have been generated by GPT-4 and reviewed by physicians to patients on a larger scale.
“Having more than half of the AI reports generated being accurate and complete is an amazing start,” said first study author Jonah Zaretsky, MD, associate chief of medicine at ٺƵ Hospital—Brooklyn. “Even at the current level of performance, which we expect to improve shortly, the scores achieved by the AI tool suggest that it can be taught to recognize subtleties.”
Along with Dr. Feldman and Dr. Zaretsky, ٺƵ study authors were Jonathan S. Austrian, MD, and , from the MCIT Department of Health Informatics; Saul B. Blecker, MD, from the Departments of and ; Yunan Zhao, from the Department of Population Health; Jeong Min Kim, MD, and Samuel Baskharoun, MD, from the at NYU Grossman Long Island School of Medicine; and Ravi Gupta, MD, from Long Island Community Hospital, which is affiliated with ٺƵ.
Media Inquiries
Greg Williams
Phone: 212-404-3500
Gregory.Williams@NYULangone.org