1 / 12

Integrated Annotation for Biomedical IE

Integrated Annotation for Biomedical IE. Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA-0205448 5-year grant, now 1.5 years from start University of Pennsylvania Institute for Research in Cognitive Science (IRCS)

Download Presentation

Integrated Annotation for Biomedical IE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integrated Annotation for Biomedical IE Mining the Bibliome:Information Extraction from the Biomedical Literature NSF ITR grant EIA-0205448 • 5-year grant, now 1.5 years from start • University of Pennsylvania Institute for Research in Cognitive Science (IRCS) • subcontract to Children’s Hospital of Philadelphia (CHOP) • cooperation with GlaxoSmithKline (GSK) Biolink

  2. Two Areas of Exploration • Genetic variation in malignancy (CHOP)Genomic entity X is varied by process Y in malignancy Z Ki-rasmutationswere detected in17.2%of theadenomas. Entities: Gene, Variation*, Malignancy* (*relations among sub-components) • Cytochrome P450 inhibition (GSK) Compound X inhibits CYP450 protein Y to degree Z Amiodarone weakly inhibitedCYP3A4-mediated activities with Ki = 45.1 μM Entities: Cyp450, Substance, quant-name, quant-value, quant-units Biolink

  3. Approach • Build hand-annotated corpora in order to train automated analyzers • Mutual constraint of form and content: • parsing helps overcome diversity and complexity of relational expressions • entity types and relations help constrain parsing • Shallow semantics integrated with syntax • entity types, standardized reference, co-reference • predicate-argument relations • Requires significant changes in both syntactic and semantic annotation • Benefits: • automated analysis works better • patterns for “fact extraction” are simpler Biolink

  4. Project Goals • Create and publish corpora integrating different kinds of annotation: • Part of Speech tags • Treebanking (labelled constituent structure) • Entities and relations(relevant to oncology and enzyme inhibition projects) • Predicate/argument relations, co-reference • Integration: textual entity-mentions≈ syntactic constituents • Develop IE tools using the corpus • Integrate IE with existing bioinformatics databases Biolink

  5. Entity Annotation Treebanking Tokenization and POS Annotation Merged Representation Project Workflow (recently revised to a flat pipeline) Biolink

  6. Integration Issues (1) • Modifications to Penn Treeebank guidelines(for tokenization, POS tagging, treebanking) • to deal with biomedical text • to allow for syntactic/semantic integration • to be correct! • Example: Prenominal Modifiers old way:the breast cancer-associated autoimmune antigen DT NN JJ JJ NN (NP.................................................................................)new way:the breast cancer - associated autoimmune antigen DT NN NN - VBN JJ NN (NML................) (ADJP........................................) (NML............................)*(NP...................................................................................................)*implicit Biolink

  7. Integration Issues (2) • Coordinated entities • “point mutations at codons 12, 13 or 61of the human K-, H- and N-ras genes” • Wordfreak allows for discontinous entities • Treebank guidelines modified, e.g.: (NP (NOM-1 codons) 12) , (NP (NOM-1 *P* ) 13) or (NP (NOM-1 *P* ) 61) • Modification works recursively Biolink

  8. Entity Annotation Biolink

  9. Treebanking Biolink

  10. Tagger Development (1) POS tagger retrained 2/10: (Tokenizer also retrained -- new tokenizer used in both cases) Biolink

  11. Tagger Development (2) (Precision & recall from 10-fold cross-validation, exact string match) Taggers are being integrated into the annotation process. Biolink

  12. References • Project homepage: http://ldc.upenn.edu/myl/ITR • Annotation info: http://www.cis.upenn.edu/~mamandel/annotators/ • Wordfreak: http://www.sf.net/projects/wordfreak • Taggers:http://www.cis.upenn.edu/datamining/software_dist/biosfier/ • Integration analysis (entities and treebanking):http://www.cis.upenn.edu/~skulick/biomerge.html • LAWhttp://www.sf.net/projects/law Biolink

More Related