Natural language processing in health care: The breakthrough we've been waiting for?

Natural language processing might seem a bit arcane and technical — the type of thing that software engineers talk about deep into the night, but of limited usefulness for practicing docs and their patients.

Yet software that can “read” physicians’ and nurses’ notes may prove to be one of the seminal breakthroughs in digital medicine. Exhibit A, from the world of medical research: a recent study linked the use of proton pump inhibitors to subsequent heart attacks. It did this by plowing through 16 million notes in electronic health records. While legitimate epidemiologic questions can be raised about the association (more on this later), the technique may well be a game-changer.

Let’s start with a little background.

One of the great tensions in health information technology centers on how to record data about patients. This used to be simple. At the time of Hippocrates, the doctor chronicled the patient’s symptoms in prose. The chart was, in essence, the physician’s journal. Medical historian Stanley Reiser describes the case of a gentleman named Apollonius of Abdera, who lived in the 5th century BCE. The physician’s note read:

There were exacerbations of the fever; the bowels passed practically nothing of the food taken; the urine was thin and scanty. No sleep. . . . About the fourteenth day from his taking to bed, after a rigor, he grew hot; wildly delirious, shouting, distress, much rambling, followed by calm; the coma came on at this time.

The cases often ended with a grim coda. In the case of Apollonius, it read: “Thirty-fourth day. Death.”

Over the next two thousand years, more data elements were added: vital signs, physical examination findings, and, eventually, results of blood tests and radiology studies. Conventions were crafted to document these elements in shorthand, everything from “VSS” (vital signs stable) to “no HSM” (no hepatosplenomegaly) to the little boxes in which the components of the basic metabolic panel (sodium, potassium, etc.) are recorded.

This all worked out reasonably well, if often illegibly, until two major forces took root. The first was the change in the audience for the doctor’s note. Over the past 50 years — as medicine became a massive portion of the GDP and concerns were raised about its quality and safety – a number of stakeholders became very interested in what the doctor was doing and thinking. These “strangers at the bedside” (the title of a 1991 book by Columbia historian David Rothman) included government officials, regulators, accreditors, payers, quality measurers, and malpractice attorneys. Their need to judge and value the physician’s work naturally centered on the doctor’s note.

The second change, of course, has been the digital transformation of medicine. Over the past seven years, driven by $30 billion in federal incentive payments, the adoption of electronic health records has skyrocketed, from 10 percent in doctors’ offices and hospitals in 2008 to about 75 percent today. Now, all the parties with a keen interest in the physician’s note no longer had to sift through illegible scrawl. Fine. But they really wanted to know a limited set of facts, which could most efficiently be inventoried by forcing the doctor to fill in various templates and check dozens of boxes.

This, of course, became an instant source of conflict, since physicians continue to be trained and socialized to think in stories. But the payer wants to know if the doctor recorded at least nine review of systems elements. The quality measurer wants to know if the doctor documented smoking cessation counseling. And so on.

None of these “strangers” would say that they were trying to turn the doctors’ documentation into a dehumanized, desiccated sea of checkboxes. In fact, most would undoubtedly endorse the need for a narrative description of the patient’s concerns, even if such description was of no use to them. All they needed for their purposes, many would say, is for the doctor to record “just one more thing.”

But the cumulative impact of a few dozen “one more things” has turned doctors into glorified (or not) data entry clerks who spend nearly half their time clicking boxes to satisfy these outside parties. And, despite the hope that forcing the doctor to record patient information as discrete, analyzable data elements would make the note crisper and easier to use, precisely the opposite has occurred. Today’s notes are bloated, filled with copied & pasted gobbledygook, and oftentimes worthless as a clinical aid.

A popular, if expensive, solution to the data entry burden has been the hiring of scribes. Good ones do far more than simply transcribe the doctor’s words into the EHR. They are checking boxes, interpreting, anticipating, at times even prompting the doctor or patient to take certain actions or answer key questions. But the fundamental problem remains: whether the boxes are being checked by scribes or physicians, the boxes can’t go away as long as all of the “strangers” demand structured data to meet their needs.

Enter natural language processing, the ability of a computer program to understand human language in written form. If a computer could just “read” and understand the words in a doctor’s note, or so the thinking goes, the need for all the checkboxes would melt away. (Of course, this is a bit simplistic, in that a perfect NLP system would still need to monitor the physician’s and patient’s words and prompt both to record key elements demanded by the billers and the quality measurers. While there might not need to be a checkbox called “documented smoking cessation counseling,” if the software was looking for it and didn’t detect it, it would need to prompt the doctor by “saying” something like, “I didn’t hear you mention anything about smoking. Is the patient smoking?”)

If robust natural language processing were up and humming, there would be another benefit: the chance to turn the medical record into a vast opportunity for learning. That was the subject of last week’s study. The Stanford investigators used advanced language processing technology to search through 16 million notes (of nearly 3 million patients) for evidence that patients had taken a proton-pump inhibitor (PPI; they’re used for ulcers and heartburn, and are now the third most prescribed drug category in the U.S.) and then, later in the chart, for evidence that the patient suffered a heart attack or died of a cardiac-related complication.

Searching for these words and concepts is tougher than it looks. It’s not hard for a computer to search for pantoprozole or Nexium, particularly when it can look these up on a patient’s medication list. But to get at cardiac complications, the software has to find myocardial infarction, MI, CHF, low ejection fraction, EF<35%, angina, cath, L main disease, and more. It then must be sure that these complications occurred after the patient began the PPI. Then it must cope with linguistic tricksters like negation and family history: “The patient had an MI last month” counts, but “The patient was seen in the ER and ruled out for MI” and “The patient’s brother had an MI” do not.

Finally, there’s the problem of context. When a cardiologist says “depression,” she is likely discussing a deviation in an ECG tracing that might indicate heart disease. When a psychiatrist says it, she is probably referring to a mood disorder. Ditto the cardiologist’s “MS” (mitral stenosis) versus the neurologist’s (multiple sclerosis).

In prior work, the Stanford researchers demonstrated that their software was accurate and rarely duped by such problems – the rate of false positives and negatives was impressively low (accuracy of 89 percent; positive predictive value of 81 percent). In the current study, by using this technique they found that a history of PPI use was associated with a significant bump (by 16 percent) in the adjusted rate of MI’s, and a two-fold increase in cardiac-related deaths.

Of course, there are all kinds of potential problems associated with this type of analysis. First, there are the familiar hazards of data dredging: if you sift through a huge dataset, you’ll find many associations by chance alone, particularly if there was no physiologic plausibility for the link and you were metaphorically throwing the data against the wall to see what stuck. There are statistical techniques to counteract this problem; in essence, they lower the threshold (from the usual p<0.05) for accepting an association as real.

But there’s more. For example, there’s the tricky matter of sorting out causality vs. association. Perhaps all those people popping Purple Pills for “heartburn” were really having angina. This would make it appear that the PPI’s were causing heart disease when in fact they were just a marker for it.

The authors worked hard to address these limitations, and partly did so. They found the PPI-cardiac association in two different datasets: one from an academic practice setting (Stanford’s health care system), and another from users of the Practice Fusion outpatient EHR system, which is primarily used in small community practices. They adjusted for severity of illness, and demonstrated the absence of an association between the use of another type of acid blocker (H2 blockers) and cardiac events. They also found that the cardiac risk did not depend on whether the patient was on the antiplatelet drug clopidogrel (prior studies had pointed to a drug interaction between PPIs and clopidogrel as the mechanism for a higher risk of cardiac problems). Finally, there is apparently a bit of biologic plausibility for the association, with some evidence that PPIs can deplete the vasodilating compound, nitric oxide.

That said, if I needed to take a PPI for reflux esophagitis, I wouldn’t stop it on the basis of this study alone. But plenty of people are on PPIs who shouldn’t be. One study found that more than half of outpatients on PPIs had no indication for them; another found that 69 percent of patients were inappropriately prescribed PPIs at hospital discharge. A third study found that, of 55 patients discharged from the hospital on PPIs for what should have been a brief course, only one of them was instructed by a clinician to discontinue the drug during the next six months. The Stanford paper offers one more reason to be thoughtful about when to prescribe PPIs, and when to stop them.

But the reason I wanted to highlight this study goes well beyond the clinical content. Natural language processing is already used in other industries, and is part of the magic behind Siri (when she’s in a good mood) and similar “intelligent” language recognition and question-answering tools. Think about a world in which a patient’s electronic notes, going back many years, can be mined for key risk factors or other historical elements, without the need to constrain the search to structured data fields like prescription lists or billing records. (Conflict alert: I’m an advisor to a company, QPID Health, which builds such a tool.)

Or imagine a world in which big data approaches can review the clinical notes of a given patient, and say, “patients like this did better on drug A than on drug B” — an analytic feat not significantly harder than what Amazon does when it says that “Customers like you also liked this book.” Or picture a world in which the use of certain drugs or other approaches (exercise, chocolate, yoga, whatever) could be associated with positive (fewer cancers) or negative (heart attacks) clinical outcomes through NLP-aided analysis. Using a Harry Potter analogy, Michael Lauer of the National Heart, Lung, and Blood Institute called such approaches “Magic in the Muggle world.”

Now that most patients’ data are stored in digital records, natural language software might just allow us to turn such data into actionable insights. That would be a welcome but ironic twist. Computers have wrenched us out of our world of narrative notes and placed us into an increasingly regimented, dehumanized world of templates and checklists. Wouldn’t it be lovely if computers could liberate us from the checkboxes, allowing us to get back to the business of talking to our patients and describing their findings, and our thinking, in prose?

Bob Wachter is a professor of medicine, University of California, San Francisco. He coined the term “hospitalist” and is one of the nation’s leading experts in health care quality and patient safety. He is author of Understanding Patient Safety, Second Edition and The Digital Doctor: Hope, Hype, and Harm at the Dawn of Medicine’s Digital Age. He blogs at Wachter’s World, where this article originally appeared.

Image credit: Shutterstock.com