200 likes | 341 Views
Using ontologies for text processing. Overview. Thesis: Ontologies (or even more elaborated knowledge-bases) are required to solve the lexical ambiguity problem Describe the lexical ambiguity problem and its central importance in natural language processing
E N D
Overview • Thesis: Ontologies (or even more elaborated knowledge-bases) are required to solve the lexical ambiguity problem • Describe the lexical ambiguity problem and its central importance in natural language processing • Demonstrate how GO, combined with Direct Memory Access Parsing, provides a simple solution to some instances of this problem • Argue no alternative is likely to work as well
Lexical Ambiguity • A word (character string) means different things in different contexts • How can a program disambiguate (tell which is meant)? • Widespread problem even in “simple” bioNLP • DNA vs. mRNA vs. protein [Hatzivassiloglou et al. 2001] • Gene symbol vs. non-gene acronym [Pustejovsky et al. 2001], [Chang et al. 2002], [Liu and Friedman 2003], [Schwartz and Hearst 2003] • Gene/product vs. any other noun [Tanabe and Wilbur, 2002]
A particular example • “Hunk” can be a • Cell type: human natural killer • Gene: hormonally upregulated Neu-associated kinase • Medical abbreviation: radiographic/orthopedic joint classification system • Non-technical English: a large lump, piece, or portion All occur in Medline documents…. (e.g. “hunk of metal” in article on ambulance design)
How do ontologies help? • The idea that knowledge is relevant to understanding words in context is controversial only among linguists, but… • Direct Memory Access Parsing (DMAP) [Martin, 1991] [Fitzgerald, 2000] technique demonstrates the power of knowledge-based method for disambiguation • GO & similar efforts make DMAP (or other knowledge-based methods) practical today
What is DMAP? • Conceptual parser • Maps from text to conceptual representations organized in packaging and abstraction hierarchies (like GO) • In contrast to: pure syntactic parsers, pattern matching and machine learning systems • Conceptual representations include lexical patterns that specify how to recognize the concept in text • Patterns consist of text literals and/or references to other concepts • Organized around concepts, not words; no independent lexicon. • Recognition creates expectations for related concepts
ID: cell-type-HUNKIS-A: cell-type lex: human natural killer ID: gene-expression ID: GO-0006350 slots: HUNK lex: expressed-item: gene transcription mechanism: expression ID: gene-26559IS-A: gene expression lex: (gene) (expression) lex: hormonally upregulated Neu-associated kinase RESULTS HUNK hormonally upregulated neu tumor-associated kinase A real example “…Hunk expression is restricted to subsets of cells…”[Gardner et al. 2000]
DMAP output with and without context (parse ‘(Hunk)) e-gene-26559 begin: 1 end: 1 e-cell-type-HUNK begin: 1 end: 1 (parse ‘(Hunk expression)) c-gene-expression-1 begin: 1 end: 2 expressed-item: e-gene-26559 begin: 1 end: 1 mechanism: GO:0006350 begin: 2 end: 2 Hunk alone: ambiguous Hunk expression: not ambiguous
DMAP can handle much more complex constructions “Hunk is expressed in mouse epithelial cells during cell proliferation.” c-localized-gene-expression expressed-item: e-gene-26559 mechanism: GO:0006350 where: c-epithelial-cell taxon: ncbi_10090 when: GO:0008283 But uses our enriched knowledge-base, not just GO
Even just DMAP/GO is a big win • Recall 7,042 ambiguous symbols for 9,723 genes • Straightforward to disambiguate symbols that map to 2 or more genes when: • Each ambiguous gene referent has GO annotations, and • There is no overlap between the annotations for the genes • 3,333 of the symbols (for 4715 of the genes) have this feature – nearly half the problem is solved!
Compare the alternatives • Statistical or machine learning approaches • Must avoid being fooled by word “cells” in example • Scalability: need statistics for many covariates of every ambiguous word; doesn’t exploit the abstraction hierarchy • Full syntactic parse doesn’t disambiguate at all! • Cascaded FST’s, pattern-matching, etc. • Where is source of knowledge for these? • Much DMAP lexical information can be taken directly from GO (and LocusLink, etc.)
Acknowledgments • Philip V. Ogren • Daniel J. McGoldrick • Christoffer S. Crosby • Jens Eberlein • George K. Acquaah-Mensah • I/NET’s (http://inetmi.com) CM / CMP software • Support from Wyeth Genetics Institute, NIAAA http://compbio.uchsc.edu
Attachment ambiguity • Attachment ambiguity • These findings suggest that FAK functions in the regulation of cell migration and cell proliferation. (Gilmore and Romer 1996:1209) • What does FAK do? • ALMOST RIGHT: • FAK functions in the regulation of cell migration • FAK functions in cell proliferation • RIGHT: • FAK functions in the regulation of cell migration • FAK functions in the regulation of cell proliferation
Attachment ambiguity GO-0016477 isA go-process lex: cell migration GO-0008283 isA go-process lex: cell proliferation GO-0042127 isA go-process lex: regulation of cell proliferation regulation of ((go-process) and)* cell proliferation GO-0030334 lex: regulation of cell migration regulation of ((go-process) and)* cell migration
Attachment ambiguity (parse ‘(These findings suggest that FAK functions in the regulation of cell migration and cell proliferation)) GO:30334 begin: 9 end: 12 GO:0042127 begin: 9 end: 15
What do we have so far? • Gene Ontology • UMLS • MeSH • …
What more do we need? • Family • Location • Macroanatomical • Subcellular localization • Structure • Function • Disease associations • Protein/protein interactions • …..
Where can we get it? • GO definitions • UMLS definitions • MeSH notes • Biomedical literature
full syntactic parse first cascaded FST’s “a little syntax, a little semantics” machine learning pattern-matching All can benefit from ontology/KB If you don’t like DMAP….