360 likes | 502 Views
Intelligent Information Directory System for Clinical Documents. Qinghua Zou 6/3/2005. Dr. Wesley W. Chu (Advisor). Keyword Search Problems Hard to compose good keywords Lack an outlook of the content Interchangeable words. When searching clinical reports. Intelligent Directory System.
E N D
Intelligent Information Directory System for Clinical Documents Qinghua Zou 6/3/2005 Dr. Wesley W. Chu (Advisor)
Keyword Search Problems Hard to compose good keywords Lack an outlook of the content Interchangeable words When searching clinical reports
Intelligent Directory System • 1. Overview • 2. Extracting Key Concepts • 3. Mining Topics • 4. Building Directories • 5. Searching • 6. Conclusion
2. Concept Extraction • 2.1 Introduction • 2.2 Our approach: IndexFinder • Index Phase (Offline) • Search Phase (Real Time) • 2.3 Experiments • 2.4 Summary
Clinical Texts • Extract key info. • Standard terms 2.1 Motivation • Clinical texts are valuable in medical practice • Search relevant reports • Search similar patients • What is key information? • UMLS provides • key medical concepts • Our Goal • Extract UMLS concepts from clinical texts
UMLS NLP Parser Noun phrases Mapping ip dp i1 • lambs • oats UMLS Concepts lambs i0 vp v0 dp will eat oats 2.1 Previous Approaches Free text
2.1 Problems of Previous Approaches • Concepts cannot be discovered if they are not in a single noun phrase. • E.g. In “second, third, and fourth ribs”, “Secondrib” can not be discovered. • Difficult to scale to large text computing. • Natural language processing requires significant computing resources
Our approach: UMLSfree text Free text NLP Parser Index phase (offline) UMLS 2GB Noun phrases Indexing Mapping Concepts Index Data ~80MB UMLS Extracting Filtering concepts Search phase (real time) Free text 2.2 Our Approach: IndexFinder Previous: free textUMLS Suppose UMLS contains only “Lung cancer” We would discard all words in the text except “lung” and “cancer”.
2.2 Our Approach: What’s New? • Knowledge-based approach • Using the compact index data without using any database system • Permuting words in a sentence to generate UMLS concept candidates. • Using filters to eliminate irrelevant concepts.
2.2 Concept Candidates Generation Assumptions • Knowledge base provides a phrase table. • Each phrase (concept) is a set of words. • An input text T is represented as a set of words. Goal • Combining words in T to generate concept candidates Example • T={D,E,F} • Answer: 5
2.2 Search Phase: Filtering Use filters to eliminate irrelevant concepts • Syntactic filter: • Word combination is limited within a sentence. • Semantic filter: • Filter out irrelevant concepts using semantic types (e.g. body part, disease, treatment, diagnose). • Filter out general concepts using the ISA relationship and keep the more specific ones.
MetaMap IndexFinder 2.3 Experiment Comparison with MetaMap [3] Input:A small mass was found in the left hilum of the lung.
2.4 Summary • An efficient method that maps from UMLS to free text for extracting concepts without using any database system. • Syntactic and semantic filters are used to eliminate irrelevant candidates. • IndexFinder is able to find more specific concepts than NLP approaches. • IndexFinder is scalable and can be operated in real time.
3. Mining Topics: SmartMiner • 3.1 Introduction • 3.2 Search Space • 3.3 SmartMiner • 3.4 Experiment • 3.5 Summary
3.1 Introduction • A Topic (assumption) • a set of concepts • a frequent pattern • Finding topics by data mining • Frequent patterns, or • Maximal frequent patterns • Require efficient data mining
Dataset id: item set a, b, c, d, e, 1: a b c d e 2: a b c d 3: b c d 4: b e 5: c d e ab, ac, ad, bc, bd, be, cd, ce, de, abc, abd, acd, bcd, cde, abcd MinSup=2 3.1 Data Mining Problem What itemsets are frequent itemsets (FI)? Maximal frequent itemset(MFI): No superset is frequent. MFI abcd, be, cde
3.1 Why MFI not FI? • Mining FI is infeasible when there exists long FI. • E.g, Suppose we have a 20-item frequent set a1 a2 …a20. All of its subset are frequent, i.e., 220=1,048,576 • Mining MFI is fast and we can generate all the FI.
3.1 Previous work • Superset checking. • A study shows that CPU spends 40% time for superset checking. • Search tree is too large • A large number of support counting • Need more efficient method
simplify Ø:abcde :abcde What is the space of ? ab:cd 3.2 Search space Given 5 items: a, b, c, d, e. What is the search space? Ø, a, b, c, d, e, ab, ac, ad, ae, bc, …, abcde We use “head:tail” to denote the space as: ab, abc, abd, abcd
3.2 Space decomposition For a space :abcde, if abcg is frequent, • Then, the known space • any subset of abc is frequent • known space is :abc • The unknown space are: • Any itemsets contain d or e. • d:abceande:abc • :abcde = d:abce + e:abc + :abc
A1 A1 B1 B2 … Bn B1 B’ … Creating B2 before exploring B1 Creating B’ after exploring B1 … … 3.3 The basic idea Using information from B to prune the space at B’ (b) SmartMiner Strategy (a) Previous approach SmartMiner takes advantages of the information from previous steps.
3.3 The tail information • For the space :abcde, if we know abcf, abcg and abfg are frequent, then we project them to the space. • abcfabc. • abcgabc. • abfgab. • Thus • Tinf(abcf,abcg, abfg|:abcde)={abc}
3.5 Summary • SmartMiner uses tail information to guide the mining, efficient since • A smaller search tree. • No superset checking. • Reduces the number of support counting.
4. Building Directories • 4.1 Introduction • 4.2 Knowledge Hierarchies • 4.3 User Specification • 4.4 Directory Generation • 4.5 Integration various directories • 4.6 Summary
Three Inputs Topics Key Content Knowledge trees Meaningful User specs Customized 4.1 Introduction
4.2 Knowledge Hierarchies • UMLS concept hierarchies • PA: parent-child relationship • RA: rather-than relationship • Problems • A concept: several parents, different granularity • [lung cancer] [Neoplasms, Respiratory Tract] • [lung cancer] [Neoplasms, Respiratory System] • A concept: hundreds of paths to roots • [lung cancer]: 233 different paths in UMLS by PA
4.2 Select Proper Hierarchies • Set source preference order, e.g • [disease]: ICD9>SNOMED>MeSH • [body part]: SNOMED>ICD9 • Select proper granularity • C: a set of concepts; n: a path node • Score function for selecting the node n • S(n)=|{ci| cin, ci in C}| • Expert review
4.3 User Specifications • A good directory ~ usage pattern • User spec usage pattern • User may have different specs • A spec: a series of knowledge names • [disease] + [body part], or • [body part] + [disease] • Build a directory for a spec by the ordering
4.4 Directory GenerationAn example User spec 1: d + p [disease] + [body part] User spec 2: p + d [body part] + [disease]
4.4 ~ An example d + p 1 1 1 1 p + d 1 1 1 1
For each Di, get all dir paths to Di A Di is tree: XML Key words can associate with tree nodes Query: xpath Exist redundant information 4.5 Integration various directories
4.5 simplified model • Keep only the first level knowledge trees • For //d6//p6, we use XPath query //doc[//d6 and //p6] • Size smaller, require some computation
4.6 Summary • Build directory by • Topics • Knowledge hierarchies • User specifications • Mapping directories to XML • By collecting directory paths for each document • Leverage on existing XML technologies