210 likes | 316 Views
Infrastructure for Semantic Expansion and Curation of the RadLex Ontology. Rebecca Hazen & Alexander van Esbroeck Northwestern University Dr. David Channin, Mentor. Background. RadLex - Radiology Lexicon Reduce variation and improve clarity in radiology reports
E N D
Infrastructure for Semantic Expansion and Curation of the RadLex Ontology Rebecca Hazen & Alexander van Esbroeck Northwestern University Dr. David Channin, Mentor
Background RadLex - Radiology Lexicon Reduce variation and improve clarity in radiology reports 11,962 terms over 12 categories
Establishing the need… • Missing many terms • Imaging Observations • Imaging Observation Characteristics • Committee dependent development process • Manual, time consuming, expensive • Larger lexicons are harder to manage • Difficult to sustain
Proposed Solution • Develop an automatic term extraction system • Focusing on Imaging Observation and Characteristics • Accelerate the expansion of RadLex • Decrease the demands on committees • Propose lists of strong candidates for inclusion • Reduce development costs
Processing System Description Collect free full-text articles from medical journals Identify new terms using LexEVS and NLP techniques Create ranked lists of imaging observations and characteristics
Processing System Overview LexEVS Concepts/Relationships Article Text Article Finder Candidate Term Identification Data/Annotations Ranked Lists of Imaging Observations & Characteristics Context Processing
LexEVS LexEVS was developed by NCI, NIH, caBIG, Mayo Clinic Designed to fulfill a community need for standards in storing, accessing, managing and distributing controlled vocabularies Combination of LexBIG, LexGrid, EVS Programmable interfaces for accessing and distributing controlled vocabularies Provides a common API
UIMA Architecture • Framework for processing large collections of documents • Processing modules can be connected into pipelines
Article Finder • Locates and retrieves scientific articles • Searches PubMed • Returns free full-text, English, HTML articles. • Removes tags and extracts the article text
Articles Processed • 1,128 Documents {Imaging|CT|MR|PET|X-ray|US|angiography|tomography} findings [Title]
Candidate Phrase Identification • Identifies a list of candidate phrases from the articles • Tokenizer • Part-of-speech Tagger • Linguistic Filter • Extracts sequences of words matching a specific pattern • Increased renal enhancement “-ed” verb, adj, noun
LexEVS Annotator • Use LexEVS to access vocabularies • RadLex 2.0; NCI Thesaurus; HL7; CTCAE • Determine if phrases exist in RadLex as a single concept • Retrieve vocabulary metadata • What us that… • Annotate the document • Build database of annotations • Develop inclusion/exclusion criteria
Context Processing • Find “indicator” words that are associated with existing RadLex terms • Assign weights to those words as a function of the number of RadLex terms with which they are associated. Focal confluent fibrosis can occur in the cirrhotic liver as a hepatic mass in approximately 14% of cases [ ]. This fibrosis is accompanied by atrophy of the affected liver parenchyma and retraction of the overlying liver capsule (Figure 9 ).
Context Processing • Use those “indicator” words to identify new phrases • Score new phrases as a function of the strength of their association with the “indicator” words. Less extensive findings included interlobular septal thickening. Interlobular septal thickening was seen in 32 patients (89%). A luminal mass was considered to be present if there was a soft-tissue mass in the lumen that arose from the bowel wall.
Phrase Ranking • Calculate a termhood1 value for each phrase • Termhood is based on a combination of: • Nesting • Context Scores • Length • Orthography • Stop List 1 “termhood” refers to the likelihood that a candidate is a real term [2]
Term Splitting • Phrases typically consist of an observation accompanied by one or more characteristics of that observation • Term splitting splits phrases into component characteristics and observations • Based on frequency ratios • Makes two new ranked lists Candidate Term: “mediastinal soft tissue infiltration” • mediastinal • soft tissue • infiltration
Results • # imaging observations • # imaging observation characteristics • % precision • Precision is defined as ….
Conclusions LexEVS is a powerful tool for exploiting a variety of controlled vocabularies Automatic term extraction can identify new imaging observations and observation characteristics Adjusting context and processing can lead to other kinds of terms Broader searches for articles will lead to larger collections of terms
Future Work • Use syntactic structure to improve extraction • Automatic identification of relationships • Infrastructure for distributed editing • Semantic Wiki
Selected References 1. Langlotz CP. RadLex: a new method for indexing online educational materials. Radiographics. 2006 Nov-Dec;26(6):1595-7. 2. Frantzi K, Ananiadou S, Mima H. Automatic recognition of multi-word terms: the C-value/NC-value method. International Journal on Digital Libraries 2000 3(2);115-130. 3. Baneyx A, Charlet J, Jaulent M. Building an ontology of pulmonary diseases with natural language processing tools using textual corpora. International Journal of Medical Informatics 2007 76:(2-3); 208-215. 4. Zhou L, Tao Y, Cimino J, Chen E, Liu H, Lussier Y, Hripcsak G, Friedman C. Terminology model discovery using natural language processing and visualization techniques. Journal of Biomedical Informatics. 2006 39(6);626-636. 5. Church K, Hanks P. Word association norms, mutual information, and lexicography. Computational linguistics 1990 16(1);22-29. 6. Snow R, Jurafsky D, Ng A. Learning syntactic patterns for automatic hypernym discovery. Advances in Neural Information Processing Systems 2005 17;1297-1304.