Progress on the ODIE Toolkit: Research, Results, and Future Developments

NCBO Seminar SeriesMarch 3, 2010Progress on the ODIE ToolkitRebecca Crowley, University of Pittsburgh

Agenda Quick overview of project and people Progress on project and software Demo of ODIE version 1.0 What’s planned for version 1.1 Sample of results from research projects Manuscripts and collaborations Future of the project

Two Tasks ~ One problem Information Extraction: Use ontologies to create structured data from unstructured clinical data Ontology Text Ontology Enrichment: Uses concepts as source of concepts and relationships to enrich and validate ontology

Specific Aims Specific Aim 1:Develop and evaluate methods for information extraction (IE) tasks using existing OBO ontologies, including: Named Entity Recognition Co-reference ResolutionDiscourse Reasoning and Attribute Value Extraction Specific Aim 2:Develop and evaluate general methods for clinical-text mining to assist in ontology development, including: Concept Discovery Concept ClusteringTaxonomic positioning Specific Aim 3: Develop reusable software for performing information extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture. Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit. Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.

Domain Will attempt to develop general tools whenever possible Priorities for evaluation of components in : Radiology and pathology reports NCIT as well as clinically relevant OBO ontologies (e.g. RadLex) Cancer domains (including hematologic oncology)

Product Goals Toolkit for developers of NLP applications and ontologies Support interaction and experimentation Foster cycle of enrichment and extraction needed to advance development of NLP systems Package systems at the conclusion of working with ODIE (not yet) Ontology enrichment as opposed to denovo development Human-machine collaboration as opposed to fully automated learning

Users/Workflow ODIE is intended for: • users who want to use NCBO ontologies to perform various NLP tasks (+/- may need to add concepts locally to achieve sufficient performance) • users who want to enrich ontologies using concepts derived from documents (very early in process of ontology development)

People and Organization Ontology Enrichment Coreference Resolution Software Develop and implement architecture and UI; Create framework for using results of research; Implement work of research groups Develop annotation scheme; create Reference Standard, consider and test existing algorithms; design, implement & test new algorithms Study and compare methods for ontology enrichment; design methods for evaluation

Progress since last NCBO Talk ODIE 1.0 released with UIMA pipeline (12/14/09) Now releasing on NCBO G-Forge Site (2X/yr) Y2 ODIE Face to Face held in Pittsburgh with participation of all three groups 3 submitted manuscripts, 3 more in preparation Nearing release of ODIE 1.1 (expected 3/10)

What’s new in ODIE 1.0? Load and use any ontology from NCBO BioPortal or .owl/.obo file. Can run any UIMA pipeline on a set of documents. Can install PEAR or work directly with an Analysis Engine Descriptor. Visualize annotated documents No configuration support in this version. Users must configure pipeline using the descriptor file. If pipeline uses certain ODIE configuration parameters additional UI features are exposed. Two new statistical methods for enrichment Analyze up to 250 documents at a time. (With 1.5GB RAM)

ODIE v1.0 System Requirements Recommended System Windows or Linux OS. Intel Core2 Duo @ 1.5Ghz+ or equivalent 1.5GB RAM 1GB disk space. Additional space required as you add more ontologies. Internet access for connecting to BioPortal

System Architecture

ODIE Download/Info GForge Site: https://bmir-gforge.stanford.edu/gf/project/odie/ User Forums: https://bmir-gforge.stanford.edu/gf/project/odie/forum/ ODIE on NCBO Tools Page: http://bioontology.org/ODIE ODIE Installer: http://caties.cabig.upmc.edu/ODIE/odieinstaller.exe User Manual: http://caties.cabig.upmc.edu/ODIE/odiev1_0manual.doc

ODIE 1.0 Demo

Lexical Syntactic Pattern (LSP) Uncovered NPs and known NEs fit precompiled hyponymy pattern

LSP Implementation Used GATE Gazetteer and Java Annotations Processing Engine (JAPE) Year One work ported to new UIMA environment Provided UIMA Wrappers for GATE Processing Resources GATE Annotations flow generically in and out of the UIMA CAS Patterns taken from literature including Hearst, Snow, Charniak New patterns derived on hand inspection of clinical corpora by Kaihong Lui

Mutual Information Measure (Church) Terms always appear together

Mutual Information (Church) Implementation Term pairs scored based upon I(x,y) = log2( ( f(x,y,w) * N) / f(x) * f(y) )) f(x,y,w) defined as frequency at which two terms cooccur in a window of w. We used window size 4. Pairs must have frequency of at least 3 I(x,y) range is Normalized between 1.0 and 0.0 and suggestions are presented in descending order

Similarity Measure (Lin) Words are used in similar contexts

Similarity Measure (LIN) Implementation Based upon Minipar Broad Coverage Parser that provides word level triples like (copd,s,involves) Used Minipar exe wrapped with Gate 5.0 distribution First calculate mutual information for each triple across corpus as I(w1,r,w2) = ( ||w1,r,w2|| x ||*,r,*|| ) / ( ||w1,r,*|| x ||*,r,w2|| ) Define T(w) as set of pairs (r,w’) such that I(w,r,w’) is positive Compute Similarity of two nouns w1 and w2 as the I(w,r,w’) quotient between T(w1) intersect T(w2) and the sum of T(w1) and T(w2) individually.

All Techniques Only uncovered Noun Phrase terminology that has a method-scored relationship with a known Named Entity are elevated to an ODIE Suggestion All methods need cTAKES Chunker for Noun Phrase discovery and ODIE IndexFinder NER for NamedEntity discovery NP and NE annotation are shared across all methods

Next Release • ODIE v1.1 will be released Late March. • New Features • All OE methods will include multi-word terms • Co-reference Visualization • Additional charts and statistics for NER analyses • Easier installation with zero configuration. • Ontology placement for new concepts • Exporting proposal ontologies as OWL or CSV files.

Coref Visualization (simple)

Coref Visualization (advanced)

Research Project 1:Ontology Enrichment Survey of OE methods Evaluation of utility of LSP Methodology to study OE utility Evaluation of statistical methods Concept Discovery Study and compare methods for ontology enrichment; design methods for evaluation Kaihong Liu Rebecca Crowley Wendy Chapman Kevin Mitchell

From Liu and Crowley, submitted 2/09

Review of methods – Linguistic • Lexico-Syntactic Pattern (LSP) matching • Assumption: syntactic regularities within a specialized corpus could indicate a particular semantic relationship between two nouns • Hearst first explored this method for hypernym discovery • Example: COMPATIBLE WITH BENIGN ECCRINE NEOPLASIA, SUCH AS NODULAR HIDROADENOMA • “Such as”, “including”, “especially”, “other” Hearst, M. A. (1992). "Automatic acquisition of hyponyms from large text corpora." Proc. of ACL

LSP Patterns The presence of certain “lexico-syntactic patterns” can indicate a particular semantic relationship between two nouns Example: DIFFERENTIAL DIAGNOSIS INCLUDES, BUT IS NOT LIMITED TO, SPINDLE CELL NEOPLASM OF PERINEURIAL ORIGIN (SUCH AS SCHWANNOMA) AND SPINDLE CELL MALIGNANT MELANOMA “such as” indicates hyponym relationship between two noun phrase

Evaluation of ontology suggestions Extraction Output Two step process • How many terms extracted by LSP are medically meaningful (MMT)? • How many of MMTs extracted are not in the ontology, therefore, can be new concept candidates? • How many of the relationships between the MMTs are not in the ontology, therefore, can be added to the ontology? Step 1: Domain expert annotation Step 2: Ontology curator judging

Step 1: Domain Experts annotations Input: two sets of data, one for pathologists and one for radiologists Annotation task: Medically meaningful terms (MMTs) that can stand alone before LSP and after LSP The terms before and after LSP have to be related

Example COMPATIBLE WITH BENIGN ECCRINE NEOPLASIA, SUCH AS NODULARHIDROADENOMA. Term that Precedes the LSP LSP Term that follows LSP Output List of paired terms Calculate : total # of MMTs , # of MMTs per LSP

Step 2: Ontologist judging Domain expert annotation output Ontology curators For each term For each pair of terms Is the concept in the ontology? If not, should it be added into the ontology? If not, what is the reason? What is the relationship between them? Is this relationship exist in the ontology? If not, should it be added into the ontology? If not, what is the reason?

Evaluation metrics • Concept suggestion rate (CSR) = • Concept acceptance rate (CAR) = • Concept relationship suggestion rate (CRSR) = • Concept relationship acceptance rate (CRAR) =

Ontology curators • NCIT curator: Dr. Nicholas Sioutos • RadLex curator: Dr. David Channin

Results - LPS distribution result Number of sentences contain lexico-syntactic pastterns

Results - MMT yield using LSP method 1 to 2 MMTs per LSP instance

Results – Ontology Concept Suggestion Rate and Ontology Concept Acceptance Rate

Results – Ontology Concept Relationship Suggestion Rate Ontology Concept Relationship Acceptance Rate

Results – Relationship Distribution

Research Project 2:Coreference Resolution Annotation schema development and implementation in Knowtator Detailed guidelines document Annotated corpus (~ 100K tokens; double-annotations and consensus) Prototype released as part of ODIE First manuscript submitted, second one underway Anticipate public release of corpus and guidelines Coreference Resolution Develop annotation scheme; create Reference Standard, consider and test existing algorithms; design, implement & test new algorithms Wendy Chapman Guergana Savova Melissa Castine

Manuscripts Submitted: Liu K and Crowley RS. Natural Language Processing Methods and Systems for Biomedical Ontology Learning (Review). Submitted to JBI Liu K, Chapman WW, Savova GK, Chute C, Sioutos N, Crowley RS. Effectiveness of Lexico-Syntactic Pattern Matching for Ontology Enrichment with Clinical Documents. Submitted to MIM Savova GK, Chapman WW, Zheng J. Anaphoric relations in the clinical narrative: corpus creation. Submitted to JAMIA Planned (next 3 months): Chavan G, Mitchell K, Liu K, Savova GK, Chapman WW, Chute C, Crowley RS. ODIE – A workbench for cyclic entity recognition and ontology enrichment. Planned for AMIA 2010 submission

Future of the project • Continued releases for remainder of grant • New coreference algorithms • Additional OE algorithms and modifications • Better integration with BioPortal • Planning to apply for competitive renewal (Dec ’10)

Progress on the ODIE Toolkit: Research, Results, and Future Developments

Progress on the ODIE Toolkit: Research, Results, and Future Developments

Presentation Transcript

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda:

Agenda

Agenda

AGENDA