June 7, 2011 | Vas Vasiliadis

Natural Language Processing (NLP) has the potential to dramatically influence the way in which clinical care and medical research is conducted. Pilot studies have shown that NLP engines, such as MedLEE and MetaMap, and the use of community-defined medical ontologies like the Unified Medical Language System (UMLS), can help to more accurately identify disease risks and environmental factors in patient clinical narratives. In addition to helping improve disease diagnosis, NLP can help automate the analysis of patient narratives over extended visits and across a wide sampling of subjects.

While more research is needed, the opportunities are well understood. Yet, NLP has not seen wide-spread adoption in clinical settings. This is in part because successful application of off-the-shelf NLP engines often involves deep familiarity with their function and design.

Employing NLP on a massive scale would also require researchers to possess expertise in systems design, data management, provisioning of computational resources, security and Health Insurance Portability and Accountability Act (HIPAA) compliancy. For example, NLP execution systems would need to insure that researchers have acquired appropriate software licenses as well as the necessary credentials for accessing patient data. Moreover, research is partially inhibited in that there are few mechanisms available today for sourcing patient narratives and de-identifying patient data for use in open studies.

At the Computation Institute we aim to address these medical NLP challenges in at least two ways. First, we aim to greatly improve the usability of NLP engines by developing an easy-to-use web environment for executing NLP engines. By delivering functionality via the web browser, we can remove the need for familiarity with the idiosyncrasies of NLP engine execution. The web also will allow us to more easily introduce value-added tools that will enable researchers to experiment with different medical ontologies, vary input parameters to NLP engines, run batch studies and analyze results across any number of patients or sets of narratives as well curate and share results among select colleagues. Next, we aim to look to cloud computing for addressing the core computational and data management requirements for NLP execution and supporting functions. Our first foray into this world has proved quite promising from both a performance perspective and a cost perspective. The figure below illustrates a study we conducted in scaling the use of MetaMap across multiple Elastic Compute Cloud (EC2) virtual machine (VM) instances in a parallel manner. We processed 1000 3KB size narratives with roughly 1 million UMLS mappings and indexed the results with a service built on Solr. With EC2, we found that scaling up performance was as easy as adding a new VM and distributing MetaMap load accordingly. The only inhibiting performance factor was our ability to index the results. However, our Solr deployment was quite simple, being deployed on a single VM, and there are well-known mechanisms for scaling Solr on EC2 that we might explore.

We also compared price and performance across different EC2 instance sizes. The results, presented in the table below with prices reflected for the US East region, show how cost effective EC2 can be. For example, using micro and small instances results in a cost of less than 5 hundredths of a cent to process a single document. This demonstrates the potential to provide wide-scale NLP on-demand with very low cost overhead to end-users.


  Micro Small Large XLarge
Cores Up to 2 1 4 8
Memory (GB) 0.613 1.7 7.5 15
Architecture 64 32 64 64
Storage (GB) EBS 160 850 1690
Instance Cost ($/hr) 0.02 0.085 0.34 0.68
Documents per hour 66.8 170.0 270.4 391.1
Price per document (cents) 0.0003 0.0005 0.0012 0.0017


This is exciting stuff and we are only just getting started. As our aim is to standup a service for researchers to experiment with our work, one of our next steps will be to start leveraging Globus Online services to manage user access and to provide more robust data import and export capabilities. I hope to be blogging on that topic in the near future. To learn more about our current efforts, feel free to read "Scalability and Cost of a Cloud-based Approach to Medical NLP", recently accepted at "The 24th International Symposium on Computer-Based Medical Systems (CBMS)", in Bristol, UK.