270 likes | 426 Views
Biomedical Information Extraction. Outline. Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name variability [Cohen, Dolbey, Acquaah-Mensah, and Hunter] Name tagging [Tanabe and Wilbur]. PASTA. [Demetriou and Gaizauskas]
E N D
Outline • Intro to biomedical information extraction • PASTA [Demetriou and Gaizauskas] • Biomedical named entities • Name variability [Cohen, Dolbey, Acquaah-Mensah, and Hunter] • Name tagging [Tanabe and Wilbur]
PASTA • [Demetriou and Gaizauskas] • Protein Active Site Template Acquisition
Extraction Tasks • Terminological Tagging • “entities” • Template Filling • “relationships”
protein species residue site region secondary structure supersecondary structure quaternary structure base atom non-protein compound interaction Terminology Tagging
Template Filling protein := NAME: string species := NAME: string in_species := PROTEIN: protein SPECIES: species residue := NAME: string SITE/FUN: string SEC_STRUCT: string QUAT_STRUCT: string REGION: string INTERACTION: string in_protein := RESIDUE: residue PROTEIN protein
PASTA Architecture • Text Preprocessing • Title, author, abstract • Tokenization, sentence boundaries
PASTA Architecture • Terminological Processing • Morphological analysis • biochemical morphemes “-ase” • Lexical lookup • token lookup in databases • token grammatical class tagging • Terminology parsing • create multi-token terms, rule-based parsing using grammatical tags
PASTA Architecture • Syntactic and Semantic Processing • Part-of-speech tags • Phrase structure • Compositional semantics • Discourse Processing • Semantic representations incorporated into discourse model of concept hierarchy and inference rules
PASTA Architecture • Template Extraction • Scan discourse model for template instances, check slots, build template
PASTAWeb • Index • document -> terminology, template • terms -> templates from multiple documents IE tools need to be incorporated into effective interfaces for biology researchers
Indexing Problem • Variations in expression of same protein name
Contrast and Variability • [Cohen, Dolbey, Acquaah-Mensah, and Hunter] • Named Entities • location vs. • identification • Variability • somatotropin • rat somatotropin • growth hormone
Variability • Non-contrast (synonyms) • tumor protein homolog vs tumour protein homologue • Contrast (diffonyms?) • ACE1 vs ACE2
Transformations • Remove first character • Remove first word • Remove last character • Remove last word • Replace sequence of vowels with one letter • Replace hyphen with space • Remove parenthesized material • Convert to lowercase
Experiment • Collect groups of synonym gene names • Get mouse, rat, and human genes from LocusLink • Group OFFICIAL GENE NAME, PREFERRED GENE NAME, OFFICIAL SYMBOL, PREFERRED SYMBOL, PRODUCT, PREFERRED PRODUCT, ALIAS SYMBOL, ALIAS PROT entries together as synonyms
Results • LMW, RMC, RMW identify contrastive variability • Contrasts likely marked at name boundaries • VS, HYPH, CASE, PM identify non-contrastive variability
Pattern Heuristics • Equivalence of vowel sequences • Optionality of hyphens • Optionality of parenthesized material • Case insensitivity
Tagging Genes and Proteins • [Tanabe and Wilbur] • ABGene • Trained on MEDLINE abstracts • Tested on PUBMED full texts
ABGene • Transformation-based tagger • False-positive and false-negative filters • Compound term recovery • Document ranking
Transformation-Based Tagging • Learns sequence of transformation rules of the form • A -> B / C • greedily, based on number of errors corrected in training data tags • Applies rules sequentially to tag new text
Gene Transformations GENE added as additional POS tag • NNP -> GENE / gene fgoodleft • * -> GENE / hassuf –A • * -> GENE / haspref c- • NNP -> GENE / prev1or2wd genes • NNP -> GENE / nextbigram ( GENE • VBG -> JJ nexttage GENE
Results • Precision up to 0.74 • Recall up to 0.64 • depending on score threshold
Problems in Full Text • Terms that do not appear in abstracts • restriction enzyme site, lab protocol kits, primers, vectors, supply companies, chemical reagents • Figures and tables
Summary • Common thread in biomedical information extraction: normalization is hard!