The promise and pitfalls of large language models in clinical applications

Since the unveiling of ChatGPT almost a year ago, the health care sector has been abuzz with anticipation and intrigue surrounding the capabilities and potential applications of large language models (LLMs) within the medical sphere. As senior health care executives know all too well, our industry is no stranger to technological advances. But as with every promising tool, it’s essential to recognize its inherent limitations—especially when patient care and safety are at stake.

The appeal of LLMs is understandable –– they can rapidly ingest massive datasets like electronic health records (EHRs) and generate human-like text summarizing patient information. This has led many to contemplate using LLMs for clinical documentation, prior authorization, predictive analytics, and even diagnostic assistance. ChatGPT’s launch accelerated this hype, with health care executives racing to pilot these tools.

Risks of bias and inaccuracy loom large.

Yet, tempered by this enthusiasm is a consensus among leaders that these models are not without their flaws. Central to these concerns are the tendencies of LLMs to ‘hallucinate’—or produce factually incorrect or irrelevant information—and their vulnerability to bias. Without the ability to distinguish fact from fiction, LLMs risk conveying harmful misinformation to clinicians.

Bias in health care is a multi-faceted problem, and introducing a tool that’s inherently prone to replicating (or even amplifying) these biases is a significant concern. The initial thought might be to feed LLMs a vast amount of EHR data to improve accuracy. However, this strategy is fraught with pitfalls. EHR records, while a treasure trove of patient data, are often laden with biases themselves. If used as the primary training data, these biases could be propagated by the LLM, leading to skewed or inaccurate suggestions and analyses.

Moreover, the black-box nature of LLM outputs exacerbates this problem. If a clinician receives a recommendation from an LLM, they cannot easily trace back to the foundational data or reasoning behind that suggestion. This lack of transparency is problematic, particularly when making crucial clinical decisions.

Integrating relevancy filters to refine outputs

However, not all hope is lost. One emerging solution is the integration of a clinical relevancy engine with LLMs. By pairing the vast knowledge capacity of an LLM with a system that filters outputs based on clinical relevance, we can harness the power of these models more safely and effectively. Take, for instance, a scenario where a clinician uses an LLM to scan a patient’s chart for symptoms related to left-sided chest pain. Without a relevancy filter, the model might produce a plethora of unrelated findings, muddying the clinician’s assessment. But with a clinical relevancy engine, these findings would be refined to only the most pertinent symptoms, enhancing both the speed and accuracy of the clinician’s analysis.

Concerns around compliance and patient safety

The growth in interest in LLMs within the medical community is undeniable. However, data indicating instances of hallucinations, biases, and other inaccuracies have been a source of concern. Especially in the realm of medical compliance, where precision is paramount, there’s unease about LLMs’ potential for error. LLMs often summarize medical charts inconsistently, omitting critical details. Generative AI cannot reliably identify concepts like diagnoses or quality measures. Summaries produced by LLMs, if inaccurate or incomplete, can pose significant risks, both in terms of patient care and compliance. In this domain, even a 90% accuracy rate might be deemed insufficient.

The future role of LLMs in clinical settings

So where does this leave the future of LLMs in health care? The key is developing structured training frameworks that instill clinical relevance and accountability. LLMs must be taught to associate EHR concepts with validated medical ontologies and terminologies. Natural language capabilities should be paired with the ability to reference authoritative coded data sources.

Advanced techniques can also enhance transparency. For instance, LLMs could highlight which sections of the medical chart informed their generated text. Confidence scores would help clinicians weigh the reliability of AI-generated insights.

While LLMs like ChatGPT present a tantalizing opportunity for health care advancement, it’s crucial for senior health care executives to approach their integration with caution and discernment. Ensuring the safe and effective implementation of LLMs in clinical settings will require more than just vast datasets—it’ll necessitate a thoughtful, multi-faceted approach that prioritizes patient safety and care quality above all else.

With rigorous, tailored training approaches, LLMs may one day provide valuable assistance to human clinicians. Until generative AI can reliably separate facts from “hallucinations,” these tools remain a source of considerable clinical risk.

David Lareau is a health care executive.