-
Identification of pneumonia and influenza deaths using the death certificate pipeline
Background:
Death records are a rich source of data, which can be used to assist with public surveillance and/or decision support. However, to use these type of data for such purposes it has to be transformed into a coded format to make it computable. Because the cause of death in the certificates is reported as free text, encoding the data is currently the single largest barrier of using death certificates for surveillance. Therefore, the purpose of this study was to demonstrate the feasibility of using a pipeline, composed of a detection rule and a natural language processor, for the real time encoding of death certificates using the identification of pneumonia and influenza cases as an example and demonstrating that its accuracy is comparable to existing methods.
Results:
A Death Certificates Pipeline (DCP) was developed to automatically code death certificates and identify pneumonia and influenza cases. The pipeline used MetaMap to code death certificates from the Utah Department of Health for the year 2008. The output of MetaMap was then accessed by detection rules which flagged pneumonia and influenza cases based on the Centers of Disease and Control and Prevention (CDC) case definition. The output from the DCP was compared with the current method used by the CDC and with a keyword search. Recall, precision, positive predictive value and F-measure with respect to the CDC method were calculated for the two other methods considered here. The two different techniques compared here with the CDC method showed the following recall/ precision results: DCP: 0.998/0.98 and keyword searching: 0.96/0.96. The F-measure were 0.99 and 0.96 respectively (DCP and keyword searching). Both the keyword and the DCP can run in interactive form with modest computer resources, but DCP shows superior performance.
Conclusion:
The pipeline proposed here for coding death certificates and the detection of cases is feasible and can be extended to other conditions. This method provides an alternative that allows for coding free-text death certificates in real time that may increase its utilization not only in the public health domain but also for biomedical researchers and developers.Trial RegistrationThis study did not involved any clinical trials.
-
Recognition of medication information from discharge summaries using ensembles of classifiers
Background:
Extraction of clinical information such as medications or problems from clinical text is an important task of clinical natural language processing (NLP). Rule-based methods are often used in clinical NLP systems because they are easy to adapt and customize. Recently, supervised machine learning methods have proven to be effective in clinical NLP as well. However, combining different classifiers to further improve the performance of clinical entity recognition systems has not been investigated extensively. Combining classifiers into an ensemble classifier presents both challenges and opportunities to improve performance in such NLP tasks.
Methods:
We investigated ensemble classifiers that used different voting strategies to combine outputs from three individual classifiers: a rule-based system, a support vector machine (SVM) based system, and a conditional random field (CRF) based system. Three voting methods were proposed and evaluated using the annotated data sets from the 2009 i2b2 NLP challenge: simple majority, local SVM-based voting, and local CRF-based voting.
Results:
Evaluation on 268 manually annotated discharge summaries from the i2b2 challenge showed that the local CRF-based voting method achieved the best F-score of 90.84% (94.11% Precision, 87.81% Recall) for 10-fold cross-validation. We then compared our systems with the first-ranked system in the challenge by using the same training and test sets. Our system based on majority voting achieved a better F-score of 89.65% (93.91% Precision, 85.76% Recall) than the previously reported F-score of 89.19% (93.78% Precision, 85.03% Recall) by the first-ranked system in the challenge.
Conclusions:
Our experimental results using the 2009 i2b2 challenge datasets showed that ensemble classifiers that combine individual classifiers into a voting system could achieve better performance than a single classifier in recognizing medication information from clinical text. It suggests that simple strategies that can be easily implemented such as majority voting could have the potential to significantly improve clinical entity recognition.
-
Leveraging H1N1 infection transmission modeling
with proximity sensor microdata
Background:
The contact networks between individuals can have a profound impact on the evolution of aninfectious outbreak within a network. The impact of the interaction between contact networkand disease dynamics on infection spread has been investigated using both synthetic andempirically gathered micro-contact data, establishing the utility of micro-contact data forepidemiological insight. However, the infection models tied to empirical contact data werehighly stylized and were not calibrated or compared against temporally coincident infectionrates, or omitted critical non-network based risk factors such as age or vaccination status.
Methods:
In this paper we present an agent-based simulation model firmly grounded in diseasedynamics, incorporating a detailed characterization of the natural history of infection, and 13weeks worth of micro-contact and participant health and risk factor information gatheredduring the 2009 H1N1 flu pandemic.
Results:
We demonstrate that the micro-contact data-based model yields results consistent with thecase counts observed in the study population, derive novel metrics based on the logarithm ofthe time degree for evaluating individual risk based on contact dynamic properties, andpresent preliminary findings pertaining to the impact of internal network structures on thespread of disease at an individual level.
Conclusions:
Through the analysis of detailed output of Monte Carlo ensembles of agent based simulationswe were able to recreate many possible scenarios of infection transmission using anempirically grounded dynamic contact network, providing a validated and groundedsimulation framework and methodology. We confirmed recent findings on the importance ofcontact dynamics, and extended the analysis to new measures of the relative risk of differentcontact dynamics. Because exponentially more time spent with others correlates to a linearincrease in infection probability, we conclude that network dynamics have an important, butnot dominant impact on infection transmission for H1N1 transmission in our studypopulation.
-
Identification of methicillin-resistant Staphylococcus
aureus within the Nation's Veterans Affairs Medical
Centers using natural language processing
Background:
Accurate information is needed to direct healthcare systems' efforts to control methicillinresistantStaphylococcus aureus (MRSA). Assembling complete and correct microbiologydata is vital to understanding and addressing the multiple drug-resistant organisms in ourhospitals.
Methods:
Herein, we describe a system that securely gathers microbiology data from the Department ofVeterans Affairs (VA) network of databases. Using natural language processing methods, weapplied an information extraction process to extract organisms and susceptibilities from thefree-text data. We then validated the extraction against independently derived electronic dataand expert annotation.
Results:
We estimate that the collected microbiology data are 98.5% complete and that methicillinresistantStaphylococcus aureus was extracted accurately 99.7% of the time.
Conclusions:
Applying natural language processing methods to microbiology records appears to be apromising way to extract accurate and useful nosocomial pathogen surveillance data. Bothscientific inquiry and the data's reliability will be dependent on the surveillance system'scapability to compare from multiple sources and circumvent systematic error. The datasetconstructed and methods used for this investigation could contribute to a comprehensiveinfectious disease surveillance system or other pressing needs.
-
Studying the potential impact of automated
document classification on scheduling a systematic
review update
Background:
Systematic Reviews (SRs) are an essential part of evidence-based medicine, providingsupport for clinical practice and policy on a wide range of medical topics. However,producing SRs is resource-intensive, and progress in the research they review leads to SRsbecoming outdated, requiring updates. Although the question of how and when to update SRshas been studied, the best method for determining when to update is still unclear,necessitating further research.
Methods:
In this work we study the potential impact of a machine learning-based automated system forproviding alerts when new publications become available within an SR topic. Some of thesenew publications are especially important, as they report findings that are more likely toinitiate a review update. To this end, we have designed a classification algorithm to identifyarticles that are likely to be included in an SR update, along with an annotation schemedesigned to identify the most important publications in a topic area. Using an SR databasecontaining over 70,000 articles, we annotated articles from 9 topics that had received anupdate during the study period. The algorithm was then evaluated in terms of the overallcorrect and incorrect alert rate for publications meeting the topic inclusion criteria, as well asin terms of its ability to identify important, update-motivating publications in a topic area.
Results:
Our initial approach, based on our previous work in topic-specific SR publicationclassification, identifies over 70% of the most important new publications, while maintaininga low overall alert rate.
Conclusions:
We performed an initial analysis of the opportunities and challenges in aiding the SR updateplanning process with an informatics-based machine learning approach. Alerts could be auseful tool in the planning, scheduling, and allocation of resources for SR updates, providingan improvement in timeliness and coverage for the large number of medical topics needingSRs. While the performance of this initial method is not perfect, it could be a usefulsupplement to current approaches to scheduling an SR update. Approaches specificallytargeting the types of important publications identified by this work are likely to improve
Results:
-
Evaluating the impact of patients' online access to doctors' visit notes: designing and executing the OpenNotes project
Background:
Providers and policymakers are pursuing strategies to increase patient engagement in health care. Increasingly, online sections of medical records are viewable by patients though seldom are clinicians' visit notes included. We designed a one-year multi-site trial of online patient accessible office visit notes, OpenNotes. We hypothesized that patients and primary care physicians (PCPs) would want it to continue and that OpenNotes would not lead to significant disruptions to doctors' practices.Methods/DesignUsing a mixed methods approach, we designed a quasi-experimental study in 3 diverse healthcare systems in Boston, Pennsylvania, and Seattle. Two sites had existing patient internet portals; the third used an experimental portal. We targeted 3 key areas where we hypothesized the greatest impacts: beliefs and attitudes about OpenNotes, use of the patient internet portals, and patient-doctor communication. PCPs in the 3 sites were invited to participate in the intervention. Patients who were registered portal users of participating PCPs were given access to their PCPs' visit notes for one year. PCPs who declined participation in the intervention and their patients served as the comparison groups for the study. We applied the RE-AIM framework to our design in order to capture as comprehensive a picture as possible of the impact of OpenNotes. We developed pre- and post-intervention surveys for online administration addressing attitudes and experiences based on interviews and focus groups with patients and doctors. In addition, we tracked use of the internet portals before and during the intervention.
Results:
PCP participation varied from 19% to 87% across the 3 sites; a total of 114 PCPs enrolled in the intervention with their 22,000 patients who were registered portal users. Approximately 40% of intervention and non-intervention patients at the 3 sites responded to the online survey, yielding a total of approximately 38,000 patient surveys.DiscussionMany primary care physicians were willing to participate in this "real world" experiment testing the impact of OpenNotes on their patients and their practices. Results from this trial will inform providers, policy makers, and patients who contemplate such changes at a time of exploding interest in transparency, patient safety, and improving the quality of care.
-
Measuring diversity in medical reports based on categorized attributes and international classification systems
Background:
Narrative medical reports do not use standardized terminology and often bring insufficient information for statistical processing and medical decision making. Objectives of the paper are to propose a method for measuring diversity in medical reports written in any language, to compare diversities in narrative and structured medical reports and to map attributes and terms to selected classification systems.
Methods:
A new method based on a general concept of f-diversity is proposed for measuring diversity of medical reports in any language. The method is based on categorized attributes recorded in narrative or structured medical reports and on international classification systems. Values of categories are expressed by terms. Using SNOMED CT and ICD 10 we are mapping attributes and terms to predefined codes. We use f-diversities of Gini-Simpson and Number of Categories types to compare diversities of narrative and structured medical reports. The comparison is based on attributes selected from the Minimal Data Model for Cardiology (MDMC).
Results:
We compared diversities of 110 Czech narrative medical reports and 1119 Czech structured medical reports. Selected categorized attributes of MDMC had mostly different numbers of categories and used different terms in narrative and structured reports. We found more than 60% of MDMC attributes in SNOMED CT. We showed that attributes in narrative medical reports had greater diversity than the same attributes in structured medical reports. Further, we replaced each value of category (term) used for attributes in narrative medical reports by the closest term and the category used in MDMC for structured medical reports. We found that relative Gini-Simpson diversities in structured medical reports were significantly smaller than those in narrative medical reports except the "Allergy" attribute.
Conclusions:
Terminology in narrative medical reports is not standardized. Therefore it is nearly impossible to map values of attributes (terms) to codes of known classification systems. A high diversity in narrative medical reports terminology leads to more difficult computer processing than in structured medical reports and some information may be lost during this process. Setting a standardized terminology would help healthcare providers to have complete and easily accessible information about patients that would result in better healthcare.
-
Clinical software development for the Web: lessons learned from the BOADICEA project
Background:
In the past 20 years, society has witnessed the following landmark scientific advances: (i) the sequencing of the human genome, (ii) the distribution of software by the open source movement, and (iii) the invention of the World Wide Web. Together, these advances have provided a new impetus for clinical software development: developers now translate the products of human genomic research into clinical software tools; they use open-source programs to build them; and they use the Web to deliver them. Whilst this open-source component-based approach has undoubtedly made clinical software development easier, clinical software projects are still hampered by problems that traditionally accompany the software process. This study describes the development of the BOADICEA Web Application, a computer program used by clinical geneticists to assess risks to patients with a family history of breast and ovarian cancer. The key challenge of the BOADICEA Web Application project was to deliver a program that was safe, secure and easy for healthcare professionals to use. We focus on the software process, problems faced, and lessons learned. Our key objectives are: (i) to highlight key clinical software development issues; (ii) to demonstrate how software engineering tools and techniques can facilitate clinical software development for the benefit of individuals who lack software engineering expertise; and (iii) to provide a clinical software development case report that can be used as a basis for discussion at the start of future projects.
Results:
We developed the BOADICEA Web Application using an evolutionary software process. Our approach to Web implementation was conservative and we used conventional software engineering tools and techniques. The principal software development activities were: requirements, design, implementation, testing, documentation and maintenance. The BOADICEA Web Application has now been widely adopted by clinical geneticists and researchers. BOADICEA Web Application version 1 was released for general use in November 2007. By May 2010, we had >1200 registered users based in the UK, USA, Canada, South America, Europe, Africa, Middle East, SE Asia, Australia and New Zealand.
Conclusions:
We found that an evolutionary software process was effective when we developed the BOADICEA Web Application. The key clinical software development issues identified during the BOADICEA Web Application project were: software reliability, Web security, clinical data protection and user feedback.
-
CDAPubMed: a browser extension to retrieve EHR-based biomedical literature
Background:
Over the last few decades, the ever-increasing output of scientific publications has led to new challenges to keep up to date with the literature. In the biomedical area, this growth has introduced new requirements for professionals, e.g., physicians, who have to locate the exact papers that they need for their clinical and research work amongst a huge number of publications. Against this backdrop, novel information retrieval methods are even more necessary. While web search engines are widespread in many areas, facilitating access to all kinds of information, additional tools are required to automatically link information retrieved from these engines to specific biomedical applications. In the case of clinical environments, this also means considering aspects such as patient data security and confidentiality or structured contents, e.g., electronic health records (EHRs). In this scenario, we have developed a new tool to facilitate query building to retrieve scientific literature related to EHRs.
Results:
We have developed CDAPubMed, an open-source web browser extension to integrate EHR features in biomedical literature retrieval approaches. Clinical users can use CDAPubMed to: (i) load patient clinical documents, i.e., EHRs based on the Health Level 7-Clinical Document Architecture Standard (HL7-CDA), (ii) identify relevant terms for scientific literature search in these documents, i.e., Medical Subject Headings (MeSH), automatically driven by the CDAPubMed configuration, which advanced users can optimize to adapt to each specific situation, and (iii) generate and launch literature search queries to a major search engine, i.e., PubMed, to retrieve citations related to the EHR under examination.
Conclusions:
CDAPubMed is a platform-independent tool designed to facilitate literature searching using keywords contained in specific EHRs. CDAPubMed is visually integrated, as an extension of a widespread web browser, within the standard PubMed interface. It has been tested on a public dataset of HL7-CDA documents, returning significantly fewer citations since queries are focused on characteristics identified within the EHR. For instance, compared with more than 200,000 citations retrieved by breast neoplasm, fewer than ten citations were retrieved when ten patient features were added using CDAPubMed. This is an open source tool that can be freely used for non-profit purposes and integrated with other existing systems.
-
The use of regional platforms for managing electronic health records for the production of regional public health indicators in France
Background:
In France, recent developments in healthcare system organization have aimed at strengthening decision-making and action in public health at the regional level. Firstly, the 2004 Public Health Act, by setting 100 national and regional public health targets, introduced an evaluative approach to public health programs at the national and regional levels. Meanwhile, the implementation of regional platforms for managing electronic health records (EHRs) has also been under assessment to coordinate the deployment of this important instrument of care within each geographic area. In this context, the development and implementation of a regional approach to epidemiological data extracted from EHRs are an opportunity that must be seized as soon as possible. Our article addresses certain design and organizational aspects so that the technical requirements for such use are integrated into regional platforms in France. The article will base itself on organization of the Rhone-Alpes regional health platform.DiscussionDifferent tools being deployed in France allow us to consider the potential of these regional platforms for epidemiology and public health (implementation of a national health identification number and a national information system interoperability framework). The deployment of the Rhone-Alpes regional health platform began in the 2000s in France. By August 2011, 2.6 million patients were identified in this platform. A new development step is emerging because regional decision-makers need to measure healthcare efficiency. To pool heterogeneous information contained in various independent databases, the format, norm and content of the metadata have been defined. Two types of databases will be created according to the nature of the data processed, one for extracting structured data, and the second for extracting non-structured and de-identified free-text documents.SummaryRegional platforms for managing EHRs could constitute an important data source for epidemiological surveillance in the context of epidemic alerts, but also in monitoring a number of indicators of infectious and chronic diseases for which no data are yet available in France.
|