120 likes | 189 Views
Integrated Annotation for Biomedical IE. Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA-0205448 5-year grant, now 1.5 years from start University of Pennsylvania Institute for Research in Cognitive Science (IRCS)
E N D
Integrated Annotation for Biomedical IE Mining the Bibliome:Information Extraction from the Biomedical Literature NSF ITR grant EIA-0205448 • 5-year grant, now 1.5 years from start • University of Pennsylvania Institute for Research in Cognitive Science (IRCS) • subcontract to Children’s Hospital of Philadelphia (CHOP) • cooperation with GlaxoSmithKline (GSK) Biolink
Two Areas of Exploration • Genetic variation in malignancy (CHOP)Genomic entity X is varied by process Y in malignancy Z Ki-rasmutationswere detected in17.2%of theadenomas. Entities: Gene, Variation*, Malignancy* (*relations among sub-components) • Cytochrome P450 inhibition (GSK) Compound X inhibits CYP450 protein Y to degree Z Amiodarone weakly inhibitedCYP3A4-mediated activities with Ki = 45.1 μM Entities: Cyp450, Substance, quant-name, quant-value, quant-units Biolink
Approach • Build hand-annotated corpora in order to train automated analyzers • Mutual constraint of form and content: • parsing helps overcome diversity and complexity of relational expressions • entity types and relations help constrain parsing • Shallow semantics integrated with syntax • entity types, standardized reference, co-reference • predicate-argument relations • Requires significant changes in both syntactic and semantic annotation • Benefits: • automated analysis works better • patterns for “fact extraction” are simpler Biolink
Project Goals • Create and publish corpora integrating different kinds of annotation: • Part of Speech tags • Treebanking (labelled constituent structure) • Entities and relations(relevant to oncology and enzyme inhibition projects) • Predicate/argument relations, co-reference • Integration: textual entity-mentions≈ syntactic constituents • Develop IE tools using the corpus • Integrate IE with existing bioinformatics databases Biolink
Entity Annotation Treebanking Tokenization and POS Annotation Merged Representation Project Workflow (recently revised to a flat pipeline) Biolink
Integration Issues (1) • Modifications to Penn Treeebank guidelines(for tokenization, POS tagging, treebanking) • to deal with biomedical text • to allow for syntactic/semantic integration • to be correct! • Example: Prenominal Modifiers old way:the breast cancer-associated autoimmune antigen DT NN JJ JJ NN (NP.................................................................................)new way:the breast cancer - associated autoimmune antigen DT NN NN - VBN JJ NN (NML................) (ADJP........................................) (NML............................)*(NP...................................................................................................)*implicit Biolink
Integration Issues (2) • Coordinated entities • “point mutations at codons 12, 13 or 61of the human K-, H- and N-ras genes” • Wordfreak allows for discontinous entities • Treebank guidelines modified, e.g.: (NP (NOM-1 codons) 12) , (NP (NOM-1 *P* ) 13) or (NP (NOM-1 *P* ) 61) • Modification works recursively Biolink
Entity Annotation Biolink
Treebanking Biolink
Tagger Development (1) POS tagger retrained 2/10: (Tokenizer also retrained -- new tokenizer used in both cases) Biolink
Tagger Development (2) (Precision & recall from 10-fold cross-validation, exact string match) Taggers are being integrated into the annotation process. Biolink
References • Project homepage: http://ldc.upenn.edu/myl/ITR • Annotation info: http://www.cis.upenn.edu/~mamandel/annotators/ • Wordfreak: http://www.sf.net/projects/wordfreak • Taggers:http://www.cis.upenn.edu/datamining/software_dist/biosfier/ • Integration analysis (entities and treebanking):http://www.cis.upenn.edu/~skulick/biomerge.html • LAWhttp://www.sf.net/projects/law Biolink