1 / 46

Agenda

NCBO Seminar Series March 3, 2010 Progress on the ODIE Toolkit Rebecca Crowley, University of Pittsburgh. Agenda. Quick overview of project and people Progress on project and software Demo of ODIE version 1.0 What’s planned for version 1.1 Sample of results from research projects

ebritton
Download Presentation

Agenda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NCBO Seminar SeriesMarch 3, 2010Progress on the ODIE ToolkitRebecca Crowley, University of Pittsburgh

  2. Agenda Quick overview of project and people Progress on project and software Demo of ODIE version 1.0 What’s planned for version 1.1 Sample of results from research projects Manuscripts and collaborations Future of the project

  3. Two Tasks ~ One problem Information Extraction: Use ontologies to create structured data from unstructured clinical data Ontology Text Ontology Enrichment: Uses concepts as source of concepts and relationships to enrich and validate ontology

  4. Specific Aims Specific Aim 1:Develop and evaluate methods for information extraction (IE) tasks using existing OBO ontologies, including: Named Entity Recognition Co-reference ResolutionDiscourse Reasoning and Attribute Value Extraction Specific Aim 2:Develop and evaluate general methods for clinical-text mining to assist in ontology development, including: Concept Discovery Concept ClusteringTaxonomic positioning Specific Aim 3: Develop reusable software for performing information extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture. Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit. Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.

  5. Domain Will attempt to develop general tools whenever possible Priorities for evaluation of components in : Radiology and pathology reports NCIT as well as clinically relevant OBO ontologies (e.g. RadLex) Cancer domains (including hematologic oncology)

  6. Product Goals Toolkit for developers of NLP applications and ontologies Support interaction and experimentation Foster cycle of enrichment and extraction needed to advance development of NLP systems Package systems at the conclusion of working with ODIE (not yet) Ontology enrichment as opposed to denovo development Human-machine collaboration as opposed to fully automated learning

  7. Users/Workflow ODIE is intended for: • users who want to use NCBO ontologies to perform various NLP tasks (+/- may need to add concepts locally to achieve sufficient performance) • users who want to enrich ontologies using concepts derived from documents (very early in process of ontology development)

  8. People and Organization Ontology Enrichment Coreference Resolution Software Develop and implement architecture and UI; Create framework for using results of research; Implement work of research groups Develop annotation scheme; create Reference Standard, consider and test existing algorithms; design, implement & test new algorithms Study and compare methods for ontology enrichment; design methods for evaluation

  9. Progress since last NCBO Talk ODIE 1.0 released with UIMA pipeline (12/14/09) Now releasing on NCBO G-Forge Site (2X/yr) Y2 ODIE Face to Face held in Pittsburgh with participation of all three groups 3 submitted manuscripts, 3 more in preparation Nearing release of ODIE 1.1 (expected 3/10)

  10. What’s new in ODIE 1.0? Load and use any ontology from NCBO BioPortal or .owl/.obo file. Can run any UIMA pipeline on a set of documents. Can install PEAR or work directly with an Analysis Engine Descriptor. Visualize annotated documents No configuration support in this version. Users must configure pipeline using the descriptor file. If pipeline uses certain ODIE configuration parameters additional UI features are exposed. Two new statistical methods for enrichment Analyze up to 250 documents at a time. (With 1.5GB RAM)

  11. ODIE v1.0 System Requirements Recommended System Windows or Linux OS. Intel Core2 Duo @ 1.5Ghz+ or equivalent 1.5GB RAM 1GB disk space. Additional space required as you add more ontologies. Internet access for connecting to BioPortal

  12. System Architecture

  13. ODIE Download/Info GForge Site: https://bmir-gforge.stanford.edu/gf/project/odie/ User Forums: https://bmir-gforge.stanford.edu/gf/project/odie/forum/ ODIE on NCBO Tools Page: http://bioontology.org/ODIE ODIE Installer: http://caties.cabig.upmc.edu/ODIE/odieinstaller.exe User Manual: http://caties.cabig.upmc.edu/ODIE/odiev1_0manual.doc

  14. ODIE 1.0 Demo

  15. Lexical Syntactic Pattern (LSP) Uncovered NPs and known NEs fit precompiled hyponymy pattern

  16. LSP Implementation Used GATE Gazetteer and Java Annotations Processing Engine (JAPE) Year One work ported to new UIMA environment Provided UIMA Wrappers for GATE Processing Resources GATE Annotations flow generically in and out of the UIMA CAS Patterns taken from literature including Hearst, Snow, Charniak New patterns derived on hand inspection of clinical corpora by Kaihong Lui

  17. Mutual Information Measure (Church) Terms always appear together

  18. Mutual Information (Church) Implementation Term pairs scored based upon I(x,y) = log2( ( f(x,y,w) * N) / f(x) * f(y) )) f(x,y,w) defined as frequency at which two terms cooccur in a window of w. We used window size 4. Pairs must have frequency of at least 3 I(x,y) range is Normalized between 1.0 and 0.0 and suggestions are presented in descending order

  19. Similarity Measure (Lin) Words are used in similar contexts

  20. Similarity Measure (LIN) Implementation Based upon Minipar Broad Coverage Parser that provides word level triples like (copd,s,involves) Used Minipar exe wrapped with Gate 5.0 distribution First calculate mutual information for each triple across corpus as I(w1,r,w2) = ( ||w1,r,w2|| x ||*,r,*|| ) / ( ||w1,r,*|| x ||*,r,w2|| ) Define T(w) as set of pairs (r,w’) such that I(w,r,w’) is positive Compute Similarity of two nouns w1 and w2 as the I(w,r,w’) quotient between T(w1) intersect T(w2) and the sum of T(w1) and T(w2) individually.

  21. All Techniques Only uncovered Noun Phrase terminology that has a method-scored relationship with a known Named Entity are elevated to an ODIE Suggestion All methods need cTAKES Chunker for Noun Phrase discovery and ODIE IndexFinder NER for NamedEntity discovery NP and NE annotation are shared across all methods

  22. Next Release • ODIE v1.1 will be released Late March. • New Features • All OE methods will include multi-word terms • Co-reference Visualization • Additional charts and statistics for NER analyses • Easier installation with zero configuration. • Ontology placement for new concepts • Exporting proposal ontologies as OWL or CSV files.

  23. Coref Visualization (simple)

  24. Coref Visualization (advanced)

  25. Research Project 1:Ontology Enrichment Survey of OE methods Evaluation of utility of LSP Methodology to study OE utility Evaluation of statistical methods Concept Discovery Study and compare methods for ontology enrichment; design methods for evaluation Kaihong Liu Rebecca Crowley Wendy Chapman Kevin Mitchell

  26. From Liu and Crowley, submitted 2/09

  27. Review of methods – Linguistic • Lexico-Syntactic Pattern (LSP) matching • Assumption: syntactic regularities within a specialized corpus could indicate a particular semantic relationship between two nouns • Hearst first explored this method for hypernym discovery • Example: COMPATIBLE WITH BENIGN ECCRINE NEOPLASIA, SUCH AS NODULAR HIDROADENOMA • “Such as”, “including”, “especially”, “other” Hearst, M. A. (1992). "Automatic acquisition of hyponyms from large text corpora." Proc. of ACL

  28. LSP Patterns The presence of certain “lexico-syntactic patterns” can indicate a particular semantic relationship between two nouns Example: DIFFERENTIAL DIAGNOSIS INCLUDES, BUT IS NOT LIMITED TO, SPINDLE CELL NEOPLASM OF PERINEURIAL ORIGIN (SUCH AS SCHWANNOMA) AND SPINDLE CELL MALIGNANT MELANOMA “such as” indicates hyponym relationship between two noun phrase

  29. Evaluation of ontology suggestions Extraction Output Two step process • How many terms extracted by LSP are medically meaningful (MMT)? • How many of MMTs extracted are not in the ontology, therefore, can be new concept candidates? • How many of the relationships between the MMTs are not in the ontology, therefore, can be added to the ontology? Step 1: Domain expert annotation Step 2: Ontology curator judging

  30. Step 1: Domain Experts annotations Input: two sets of data, one for pathologists and one for radiologists Annotation task: Medically meaningful terms (MMTs) that can stand alone before LSP and after LSP The terms before and after LSP have to be related

  31. Example COMPATIBLE WITH BENIGN ECCRINE NEOPLASIA, SUCH AS NODULARHIDROADENOMA. Term that Precedes the LSP LSP Term that follows LSP   Output List of paired terms Calculate : total # of MMTs , # of MMTs per LSP

  32. Step 2: Ontologist judging Domain expert annotation output Ontology curators For each term For each pair of terms Is the concept in the ontology? If not, should it be added into the ontology? If not, what is the reason? What is the relationship between them? Is this relationship exist in the ontology? If not, should it be added into the ontology? If not, what is the reason?

  33. Evaluation metrics • Concept suggestion rate (CSR) = • Concept acceptance rate (CAR) = • Concept relationship suggestion rate (CRSR) = • Concept relationship acceptance rate (CRAR) =

  34. Ontology curators • NCIT curator: Dr. Nicholas Sioutos • RadLex curator: Dr. David Channin

  35. Results - LPS distribution result Number of sentences contain lexico-syntactic pastterns

  36. Results - MMT yield using LSP method 1 to 2 MMTs per LSP instance

  37. Results – Ontology Concept Suggestion Rate and Ontology Concept Acceptance Rate

  38. Results – Ontology Concept Relationship Suggestion Rate Ontology Concept Relationship Acceptance Rate

  39. Results – Relationship Distribution

  40. Research Project 2:Coreference Resolution Annotation schema development and implementation in Knowtator Detailed guidelines document Annotated corpus (~ 100K tokens; double-annotations and consensus) Prototype released as part of ODIE First manuscript submitted, second one underway Anticipate public release of corpus and guidelines Coreference Resolution Develop annotation scheme; create Reference Standard, consider and test existing algorithms; design, implement & test new algorithms Wendy Chapman Guergana Savova Melissa Castine

  41. Manuscripts Submitted: Liu K and Crowley RS. Natural Language Processing Methods and Systems for Biomedical Ontology Learning (Review). Submitted to JBI Liu K, Chapman WW, Savova GK, Chute C, Sioutos N, Crowley RS. Effectiveness of Lexico-Syntactic Pattern Matching for Ontology Enrichment with Clinical Documents. Submitted to MIM Savova GK, Chapman WW, Zheng J. Anaphoric relations in the clinical narrative: corpus creation. Submitted to JAMIA Planned (next 3 months): Chavan G, Mitchell K, Liu K, Savova GK, Chapman WW, Chute C, Crowley RS. ODIE – A workbench for cyclic entity recognition and ontology enrichment. Planned for AMIA 2010 submission

  42. Future of the project • Continued releases for remainder of grant • New coreference algorithms • Additional OE algorithms and modifications • Better integration with BioPortal • Planning to apply for competitive renewal (Dec ’10)

More Related