Unstructured and semi-structured data is, by its nature, largely text based. Even numbers are often surrounded by words, like an article on a company’s stock. This section describes some of the tools that translate what is often a bunch of gobbledygook into meaningful information, the result of which can be utilized by many of the tools described in the next chapter.
Natural language processing
Electronic health records (EHRs) have been with us for quite some time. Some cite legitimate security concerns and thorny system issues as reasons that their adoption rate has been sluggish in the United States. Holding off on the veracity of these claims for a moment, there’s little doubt that we would benefit from the widespread adoption of EHRs. Karen Bell, director of the Office of Health IT Adoption at the U.S. Department of Health and Human Services, said as much in an interview in September 2008. Bell noted that “health care . . . problem(s) could be solved, or at least drastically reduced, by electronic health records, which allow data to be easily shared among physicians, pharmacies, and hospitals. Such systems help coordinate a patient’s care, eliminating duplicate testing and conflicting prescriptions, and ultimately cutting costs. But despite the benefits, only 15 to 18 percent of U.S. physicians have adopted electronic health records.”
EHRs don’t happen overnight, even with government- provided subsidies like those announced by the Obama administration. Countries like Denmark with high EHR adoption rates didn’t magically move from 0 to 100 percent. But even if EHR adoption hits 100 percent, is digitizing medical data the best that we can do here? Not even close. After the data is put into a usable and accessible format, we can get to the good stuff.
Consider natural language processing (NLP), a technology that can produce readable summaries of chunks of text. Basic examples of NLP include social media, newspaper articles, and, as the Parliament of Canada and the European Union have done, translating governmental proceedings into all official languages. But this is just the tip of the iceberg. NLP can do much, much more, including deciphering doctors’ notes and other unstructured information generated during patient visits. NLP can take EHRs to an entirely different level.
While turning unstructured data into something useful may not get your juices flowing, many people feel passionately about the subject. Count among them tech-savvy doctors like Jaan Sidorov and Kevin Pho, the web’s top social media influencer in health care and medicine according to Klout. In an article on KevinMD (Pho’s site), Sidorov cites statistics that an astonishing 80 percent of clinical documentation existing in health care today is unstructured. Yet that information is largely ignored, sometimes:
. . . referred to as ‘the text blob’ and is buried within electronic health records (EHRs). The inherent problem with ‘the text blob’ is that locked within it lies an extraordinary amount of key clinical data—valuable information that can and should be leveraged to make more informed clinical decisions, to ultimately improve patient care and reduce healthcare costs. To date, however, because it consists of copious amounts of text, the healthcare industry has struggled to unlock meaning from ‘the text blob’ without intensive, manual analysis or has chosen to forego extracting the value completely.
Sidorov goes on to tell the story of NLP-based applications that accurately read and analyze text from doctors’ visits. In one instance, an application amazingly spotted diseases with an accuracy rate north of 90 percent based solely upon doctors’ text-based descriptions—in other words, before any lab testing. NLP has a similar impact on medicine and the treatment of disease to Google Flu. Imagine trends discovered via NLP that allow doctors to proactively contact and treat their patients after they have exhibited similar symptoms—without having staff cobble through hundreds of patient records. And Google is hardly alone. Consider DataSift, a company that uses NLP to turn Twitter firehoses and other unstructured social data into structured, digestible, and valuable information. In mid-November 2012, the company received $15 million in venture funding.
Examples like this prove that NLP can be both more effective and less expensive than traditional methods of disease detection. Upon reading this, you should be asking yourself several questions:
- Isn’t this similar to speed bump?
- Why aren’t more health care organizations using NLP?
- When will Sidorov and Pho start the technology equivalents of medical Fight Clubs?
Phil Simon is the author of Too Big to Ignore: The Business Case for Big Data (Wiley and SAS Business Series). A recognized technology expert, he consults companies on how to optimize their use of technology.
The Amazon links are affiliate links.