Extraction and Analysis of Information from Structured and Unstructured Clinical Records

Extraction and Analysis of Information from Structured and Unstructured Clinical Records Richard Power Open University Henk HarkemaAndrea SetzerIan RobertsRob GaizauskasMark Hepple University of Sheffield Jeremy Rogers University of Manchester AHM 2005 Text Mining Workshop 29/9/5

Overview • Background • Information Extraction • Information Integration

Background: CLEF • Clinical e-Science Framework • Objective: • To develop a high quality, secure and interoperable information repository, derived from operational electronic patient records to enable ethical and user-friendly access to patient information in support of clinical care and biomedical research • Duration, funding, participants: • 2003 – 2005 (CLEF), 2005 – 2007 (CLEF-Services) • Funded by Medical Research Council (MRC) • Six universities, Royal Marsden Hospital, industrial partners engaged through CLEF Industrial Forum Meetings

Sheffield NLP & CLEF • Information Extraction • Analyzing clinical narratives to extract medically relevantentities and events, and their properties and relationships • Information Importation • Importing extracted information into the CLEF repository • Information Integration • Combining extracted information with structured information(i.e., non-narrative data) already in repository in order to build summary of patient’s conditions and treatment over time

Medical IE • Standard Information Extraction tasks: • Entity/event extraction & relationship extraction • Additional challenges: • Cross-document event co-reference • Same event mentioned in multiple documents; many documentsprovide only partial descriptions of events • Modality of Information • Negation: “I cannot feel any lump in her right supraclavicular fossa” • Uncertainty: “I just wonder if there is an outside possibility that she might have mediastinal fibrosis to account for her symptomology” • Temporality of Information

Entities, Events & Relationships • Entities, events: • Problem: melanoma, swelling, … • Present/absent • Clinical course: getting worse, getting better, no change • Intervention: amputation, chemotherapy, … • Status: planned, booked, started, completed, … • Investigation: CT scan, ultrasound, … • Status: planned, booked, started, completed, … • Goal: treat, cure, palliate • Drug: Atenolol, antibiotics, … • Locus: abdomen, blood, … • Laterality: left, right

Entities, Events & Relationships • Relationships: • Location of problem: problem  locus • hip pain • lesions in her liver • Finding of investigation: investigation  problem • An ECG examination revealed atrial fibrillation • CT scan of her thorax and abdomen shows progressive disease • Target of intervention: intervention  locus • radiotherapy to back • breast radiotherapy • Further relationships

IE Approach • Pipeline of processing modules • Pre-processing: • Tokenization, sentence splitting • Lexical & terminological processing: • Morphological analysis, term look-up, term parsing • Syntactic & semantic processing: • Sentence-based syntactic, semantic analysis • Discourse processing & IE pattern application: • Integration of semantic representations into discourse model • Application of patterns to collect information to be extracted

Terminology Processing • Termino: a large-scale terminological resource to support term processing for information extraction, retrieval, and navigation • Termino contains a database holding large numbersof terms imported from various existing terminological resources, including UMLS • Efficient recognition of terms in text is achieved through use of finite state recognizers compiled from contents of database • The results of lexical look-up in Termino can feed into further term processing components, e.g., term parser

location_np  latitude_adj area_noun latitude_adj: upper, middle, lower, mid, basal area_noun: zone, region, area, field, lung, lobe Termino Terminology Processing • Termino for CLEF • Imported 160,000 terms from UMLS drawn from semantictypes such as pharmacologic substances, anatomical structures, therapeutic procedures, diagnostic procedures, … • Term grammars • Rules for combining terms identified by term look-up in Termino into longer terms • Example: locations in the lung

Information Extraction Patterns • IE patterns inspect syntactic and semanticanalyses and assert properties of entities and relationships between entities • Example: finding of investigation • “CT scan of her thorax showsprogressive disease” • IE pattern: invest_finding(I, P) if investigation(I), problem(P), show_event(S), lsubj(S, I), lobj(S, P).

Information Extraction Patterns • Finding patterns • Hand-crafted patterns • “Redundancy” approach: • given a patient for whom a relationship between two particular entities is known to exist (e.g., we know patienthas a tumor in his lung), … • find all sentences in all notes of this patient that contain these two entities, … • and assume these sentences express the same relationship

Information Integration • Combining structured information in repository with information extracted from narratives into coherent overview of patient’s condition and treatment over time • Issues in Information Integration: • Ambiguity: given an event extracted from a narrative, to whichevent in the structured data does it correspond? • Fragmentation & duplication: Information Extraction over narrative data produces collection of potentially fragmented and duplicated descriptions of medical events which need to be sorted out • Investigation of contribution of temporal information found within narratives to Information Integration

Linking extracted and structured events • Reduce ambiguity through use of: • Medical information: type of event, relationships, … • Temporal information: time stamps, temporal expressions,verbal tense & aspect, … Events in narratives Chest X-RAY arranged for next week. 2000-05-16 The chest X-RAY performed … 2000-05-24 1 2 Type: MRI Location: abdomen Date: 2000-05-23 Type: X-RAY Location: chest Date: 2000-05-23 Type: X-RAY Location: chest Date: 2000-05-26 Type: X-RAY Location: chest Date: 2000-07-19 1 2 3 4 Events in structured data

Constraint Satisfaction • Ambiguity reduction as a Constraint Satisfaction problem • Each narrative event is associated with a time domain, i.e., setof possible dates on which event could have taken place • Temporal and medical information extracted from narratives is formulated as set of constraints on time domain of narrative event • Use Constraint Logic Programming tools to resolve time domains of narrative events • If resolved time domain of narrative event contains date of structured event, link narrative event to structured event

Evaluation • Evaluation of effectiveness of temporal constraints in Information Integration • Link each narrative event to set of potentially matching eventsof same type in structured data according to medical constraints • Measure how well application of temporal constraints narrowdown this initial set of “structured” candidates • We used a semi-automated pipeline to produce an idealised version of what a fully automatic system would provide as the input to the CSP component • Results must be viewed in the light of the idealised input

Data and Gold Standard • Confined to investigation events • Patient notes of 5 patients analysed and annotated (large overhead of manual annotation) • 446 documents, of which 94 contain 152 investigation events • Manually created Gold Standard linking each narrative event to structured events of the same type, and correct targets

Annotating Temporal Information • We annotate times, events (i.e., investigations)and temporal relations holding between these • The annotation scheme used is a subset of the TimeML annotation scheme • Example: We have arranged an MRI scan for next week. during

Evaluation: Recall & Precision • We want to quantify the impact of using temporal constraints to reduce the ambiguity of mapping narrative events to structured events • Ideally, temporal constraints should greatly reduce ambiguity by eliminating incorrect candidates from the set of possible targets in structured data – but not eliminate the true target • Global evaluation measures: • Recall: proportion of correct targets recognised as possible targets • Precision: proportion of recognised possible targets that are correct • We applied both metrics before and after application of temporal constraints in CSP and compared the results

Evaluation: Strict & Liberal Accuracy • The limitation of the Recall and Precision metrics is that they score for the overall data set – i.e. over all events for all 5 patients • If even only a small number of events retain a large number of possible targets, the overall precision score will be low even though most events are close to being correctly resolved • Consequently, we developed two “accuracy” based scores (liberal and strict), which quantify for each narrative event the extent to which it is correctly resolved, and then average across all narrative events • Liberal score for single event: 1 if at least one true target is correctly preserved, 0 otherwise • Strict score for single event: proportion of recognised possible targets that are correct

Results

Discussion • The results show that there is a substantial amount of ambiguity at the start, which is reduced by application of temporal constraints, as best shown by the strict accuracy score • A large degree of ambiguity remains, but … • Use of temporal information is conservative • E.g., a “past” narrative event is linked to all structured events dated before the date of the letter, but could heuristically be linked to the one structured event dated immediately before the date of the letter • We have not yet exploited additional medical information, e.g.,the locus of an investigation, nor additional temporal information, e.g., temporal relationships between events

Conclusions & Future Work • Information Extraction • Essential functionality implemented • Extending coverage of system • Evaluating performance • Information Integration • Initial assessment of approach • Automating processing pipeline • Extending method to other events

Extraction and Analysis of Information from Structured and Unstructured Clinical Records

Extraction and Analysis of Information from Structured and Unstructured Clinical Records

Presentation Transcript

Structured and Unstructured Information

Extracting Predicates from Semi-structured and Unstructured Texts

Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web

Bootstrapping information extraction from semi-structured web pages

Information Extraction: Distilling Structured Data from Unstructured Text.

Schema-Driven Relationship Extraction from Unstructured Text

Structured / Unstructured Market Trials

Structured Information Extraction from Natural Disaster Events on Twitter

Structured / Unstructured Market Trials

Structured / Unstructured Market Trials

Automatic Extraction of Individual and Family Information from Primary Genealogical Records

Extraction of Ontological Information from Corpora (and Lexicon)

Extraction of Ontological Information from Lexicon and Corpora

Schema-Driven Relationship Extraction from Unstructured Text

Bootstrapping Information Extraction from Semi-Structured Web Pages

Extraction of Adverse Drug Effects from Clinical Records

Text Classification and Information Extraction from Abstracts of Randomized Clinical Trials:

From Unstructured Text to StructureD Data

Structured / Unstructured Market Trials

Structured / Unstructured Market Trials

Text Classification and Information Extraction from Abstracts of Randomized Clinical Trials:

Structured / Unstructured Market Trials