Integrated Annotation for Biomedical IE

Integrated Annotation for Biomedical IE Mining the Bibliome:Information Extraction from the Biomedical Literature NSF ITR grant EIA-0205448 • 5-year grant, now 1.5 years from start • University of Pennsylvania Institute for Research in Cognitive Science (IRCS) • subcontract to Children’s Hospital of Philadelphia (CHOP) • cooperation with GlaxoSmithKline (GSK) Biolink

Two Areas of Exploration • Genetic variation in malignancy (CHOP)Genomic entity X is varied by process Y in malignancy Z Ki-rasmutationswere detected in17.2%of theadenomas. Entities: Gene, Variation*, Malignancy* (*relations among sub-components) • Cytochrome P450 inhibition (GSK) Compound X inhibits CYP450 protein Y to degree Z Amiodarone weakly inhibitedCYP3A4-mediated activities with Ki = 45.1 μM Entities: Cyp450, Substance, quant-name, quant-value, quant-units Biolink

Approach • Build hand-annotated corpora in order to train automated analyzers • Mutual constraint of form and content: • parsing helps overcome diversity and complexity of relational expressions • entity types and relations help constrain parsing • Shallow semantics integrated with syntax • entity types, standardized reference, co-reference • predicate-argument relations • Requires significant changes in both syntactic and semantic annotation • Benefits: • automated analysis works better • patterns for “fact extraction” are simpler Biolink

Project Goals • Create and publish corpora integrating different kinds of annotation: • Part of Speech tags • Treebanking (labelled constituent structure) • Entities and relations(relevant to oncology and enzyme inhibition projects) • Predicate/argument relations, co-reference • Integration: textual entity-mentions≈ syntactic constituents • Develop IE tools using the corpus • Integrate IE with existing bioinformatics databases Biolink

Entity Annotation Treebanking Tokenization and POS Annotation Merged Representation Project Workflow (recently revised to a flat pipeline) Biolink

Integration Issues (1) • Modifications to Penn Treeebank guidelines(for tokenization, POS tagging, treebanking) • to deal with biomedical text • to allow for syntactic/semantic integration • to be correct! • Example: Prenominal Modifiers old way:the breast cancer-associated autoimmune antigen DT NN JJ JJ NN (NP.................................................................................)new way:the breast cancer - associated autoimmune antigen DT NN NN - VBN JJ NN (NML................) (ADJP........................................) (NML............................)*(NP...................................................................................................)*implicit Biolink

Integration Issues (2) • Coordinated entities • “point mutations at codons 12, 13 or 61of the human K-, H- and N-ras genes” • Wordfreak allows for discontinous entities • Treebank guidelines modified, e.g.: (NP (NOM-1 codons) 12) , (NP (NOM-1 *P* ) 13) or (NP (NOM-1 *P* ) 61) • Modification works recursively Biolink

Entity Annotation Biolink

Treebanking Biolink

Tagger Development (1) POS tagger retrained 2/10: (Tokenizer also retrained -- new tokenizer used in both cases) Biolink

Tagger Development (2) (Precision & recall from 10-fold cross-validation, exact string match) Taggers are being integrated into the annotation process. Biolink

References • Project homepage: http://ldc.upenn.edu/myl/ITR • Annotation info: http://www.cis.upenn.edu/~mamandel/annotators/ • Wordfreak: http://www.sf.net/projects/wordfreak • Taggers:http://www.cis.upenn.edu/datamining/software_dist/biosfier/ • Integration analysis (entities and treebanking):http://www.cis.upenn.edu/~skulick/biomerge.html • LAWhttp://www.sf.net/projects/law Biolink

Integrated Annotation for Biomedical IE

Integrated Annotation for Biomedical IE

Presentation Transcript

Technodat: DS 3D PLM integrated solution for IE

IE 447 COMPUTER INTEGRATED MANUFACTURING

Data Annotation for Classification

Annotation Types for UIMA

Annotation for Hindi PropBank

Annotation for Hindi PropBank

IE 447 COMPUTER INTEGRATED MANUFACTURING

Integrated Biomedical Microdevices for Cell Biology Studies

Towards an integrated scheme for semantic annotation of multimodal dialogue data

Whole School Approach to Integrated Education (IE)

Mining External Resources for Biomedical IE

An Integrated Annotation DB in OntoNotes

Annotation

Ontology-Based Annotation of Biomedical Time Series Data

A knowledge-based approach to integrated genome annotation

IE 447 COMPUTER INTEGRATED MANUFACTURING

IE 590 Integrated Manufacturing Systems Lecture 4

IE 590 INTEGRATED MANUFACTURING GROUP TECHNOLOGY

Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE

Annotation

Sample Sizes for IE

ALM for Biomedical