1 / 74

Extracting biological names and relations from texts

Extracting biological names and relations from texts. Ting-Yi Sung 宋定懿 Bioinformatics Program, TIGP Institute of Information Science Academia Sinica 2004/12/16. Motivation. To automatically extract information from natural language text.

adamma
Download Presentation

Extracting biological names and relations from texts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting biological names and relations from texts Ting-Yi Sung 宋定懿 Bioinformatics Program, TIGP Institute of Information Science Academia Sinica 2004/12/16

  2. Motivation • To automatically extract information from natural language text. • The need arises from rapid accumulation of biomedical literature. • Expedite survey efforts • Support the database curation (automatically associate the papers with database records)

  3. Targets of Information Extraction • Protein-Protein interaction/binding/inhibition • Protein-Small Molecules • Gene-Gene regulation • Gene-Gene Product interaction • Gene-Drug relation • Protein-Subcellular location • Amino Acid-Protein relation • Example relationships between gene and drugs: • The gene is the drug target • The gene confers resistance to the drug • The gene metabolizes the drug

  4. Information Extraction Tasks Identify Target Named Entities Identify Relations among Named Entities Identify Relations among Events and Named Entities Associate Results with existing database records

  5. Outline • NER (named entity recognition) in biomedical domain • Challenges in biomedical NER • State of progress in NER • Abbreviation disambiguation • Future works

  6. What is NER? • NER • Named Entity Recognition • Including two tasks • Identification of proper names in text • Classification of proper names in text • Newswire Domain • Person, Location, Organization • Biomedical Domain • Protein, DNA, RNA, Body Part, Cell Type, Lipid, etc.

  7. Example of NER - Biomedical Protein tissue Disease

  8. NER in biomedical domain • BioNER aims to recognize following names • First Priority • Protein name, DNA name, RNA name • Second Priority • cell type, other organic compound, cell line, lipid, multi-cell, virus, cell component, body part, tissue, amino acid monomer, polynucleotide, mono-cell, inorganic, peptide, nucleotide, atom, other artificial source, carbohydrate, organic

  9. The Overall Spectrum • BioNER is only the starting point of biological information extraction • A whole suite of NLP techniques are needed to treat relations, events in literature mining • Techniques developed for BioNER should be adaptable to problems in later stages, • e.g. NE relation recognition

  10. Intrinsic Features of BioNER • Unknown words • Long compound words • Variations of expressions • Nested NEs

  11. Unknown Words • Words containing hyphen, digit, letter, Greek letter, Roman numeral. • Alpha B1 • Adenyly cyclase 76E • Latent membrane protein 1 • 4’-mycarosyl isovaleryl-CoA transferase • oligodeoxyribonucleotide • 18-deoxyaldosterone • Abbreviation and Acronym • IL, TECd, IFN, TPA

  12. Long Compound words • interleukin 1 (IL-1)-responsive kinase • interleukin 1-responsive kinase • epidermal growth factor receptor • SH2 domain containing tyrosine kinase Syk • SH2 domain (GENIA example)

  13. Various expressions of the same NE • Spelling variation • N-acetylcysteine, N-acetyl-cysteine, NAcetylCysteine • Word permutation • beta-1 intergrin, integrin beta-1 • Ambiguous expressions • epidermal growth factor receptor, EGF receptor, EGFR • c-jun, c-Jun, c jun

  14. Various expressions: the name explains its function • the Ras guanine nucleotide exchange factor Sos • the Ras guanine nucleotide releasing protein Sos • the Ras exchanger Sos • the GDP-GTP exchange factor Sos • Sos(mSos), a GDP/GTP exchange protein for Ras

  15. Various expressions: The name includes preposition and/or conjunction (ambiguity of dependencies) • p85 alpha subunit of PI 3-kinase • SH2 and SH3 domains of Src • NF-AT1 , AP-1 , and NF-kB sites • E2F1 and -3 • Residues 432, 435, 437, 438, and 440

  16. Nested Named Entity • An NE embedded in another NE. • IL-2: protein • IL-2gene: gene • CBP/p300 associated factor: protein • CBP/p300 associated factorbinding promoter: DNA

  17. Outline • NER (named entity recognition) in biomedical domain • Challenges in biomedical NER • State of progress in NER • Abbreviation disambiguation • Future works

  18. Challenges of NER • Unknown word identification • Named entity boundary detection • Class disambiguation

  19. Challenges • Unknown word identification • t (10;11) (p13; q14) • DNA methyltransferase • 73 kDa protein • interleukin 1 (IL-1)-responsive kinase (NE may contain an abbreviation within it.) • Some unknown words occur very few times in the corpus  hard to recognize.

  20. Challenges (cont’d) • NE boundary detection Can be a regular English word, unknown word, Roman numeral, digit. • MHC Class II • latent protein 1 (The left boundary is an adjective) • cyclin-like UDG gene product • Conjunction (and, or, …) • alpha- and beta-globin • human and mouse gene

  21. Challenges (cont’d) • Classification of abbreviations • NF-AT • Full name: nuclear factor of activated cells • Class: Protein • HTLV-I • Full name: Human T cell lymphotropic virus I • Class: Virus • TCDD • Full name: 2, 3, 7, 8-tetrachlorodibenzo-p- dioxin • Class: Other Organic • GRE • Full name: glucocorticoid response element • Class: DNA

  22. Outline • NER (named entity recognition) in biomedical domain • Challenges in biomedical NER • State of progress in NER • Abbreviation disambiguation • Future works

  23. State-of-the-art Systems on NER: Two evaluation contests • BioCreative 2004 (March) • Critical Assessment of Information Extraction Systems in Biology • Task 1: Entity extraction • Target: genes (or proteins, where there is ambiguity) • 10000 sentences from Medline as training data, and 5000 sentences as testing data • BioNLP 2004 (August) • GENIA Corpus as training data and 404 abstracts as testing data • Target: 5 classes, including protein, DNA, gene, cell line and cell type. • Both use exact match scoring.

  24. BioNLP 2004 Datasets

  25. Current Methods • Machine Learning • HMM, SVM, ME (Maximum Entropy), CRF (Conditional Random Field) • Hybrid methods • Dictionary Based • Approximate String matching algorithm • Naming Rules • Dynamic Programming

  26. Features for Machine Learning Methods • Morphological Features • Orthographical Features • POS Features • Genia POS tagger • Semantic Trigger Features • Head-noun Features • NF-kappaB consensus site • IL-2 gene

  27. Morphological Features

  28. Orthographical Features

  29. Head Nouns

  30. Additional features used by Manning’s group: local features • Clues within a sentence • Include: • Previous NEs • Abbreviations: an abbr., a long form, neither • Parenthesis-matching • etc.

  31. External resources used by Manning’s group • Motivation • Contextual clues do not provide sufficient evidence for confident classification. • May be vulnerable to incompleteness, noise, and ambiguity. • Web • Least vulnerable to incompleteness, highly vulnerable to noise. • Prepare patterns for each class • For genes: X gene, X antagonist, X mutation • For RNA: X mRNA, … • For proteins: X ligation, … • Features: web-protein, web-RNA, O-web, … • Does not work well in BioNLP Task.

  32. External resources (2) • Gazetteers (dictionaries) • Are arguably subject to all three, and yet have been successfully in some systems. • Compiled a list of gene names from databases (e.g. Locus Link) and GO, the data from BioCreative Tasks 1A and 1B. • Filtering • Single character entries, e.g., ‘A’, ‘1’; entries containing only digits or symbols and digits, e.g., ’37’‘3-1’ • Entries containing only words can be found in an English dictionary (CELEX), e.g., ‘abnormal’, ‘brain tumor’ • 1,731,581 entries • Larger context

  33. State-of-the-art approaches • Machine learning + Post-processing • Our method (BioKDD2004) • Maximum entropy • Post-processing • Boundary extension • Re-classification

  34. Zhou et al. approach • HMM + SVM • Post-processing • Rule-based: used to resolve nested name entities. • Top1 in the NLPBA Task, F=72.5%

  35. Manning et al. method • Machine learning: • ME Markov model • Local features • External resources and larger context • Post-processing • To correct gene’s boundary (mainly for BioCreative Task) • Top 1 in BioCreative, F= 83.2% • Top 2 in NLPBA Task, F=70.1%

  36. Our Method Overview Training Phase Knowledge input Construct boundary word lists and dictionary Dictionary Training Data Mapping features Boundary word lists Knowledge input ME Learning Testing Phase Post-processing Testing Data ME Boundary extension NEs Re-classify

  37. Experimental Results:

  38. Post-Processing • Nested Named Entity • Ex: CIITA mRNA • Nested Annotation: <RNA><DNA>CIITA </DNA>mRNA</RNA> • ME sometimes only recognizes CIITA as DNA • 16.57% of NEs in GENIA 3.02 contains one or more shorter NE [Zhang, 2003] • Post-processing method • Boundary Extension • Re-classification

  39. Boundary Extension (1) • Boundary extension for nested NEs • Extend the R-boundary repeatedly if the NE is followed by another NE, a head noun, or an R-boundary word with a valid POS tag. • Extend the left boundary repeatedly if the NE is preceded by an L-boundary word with a valid POS tag.

  40. Example • ICAM-1 surface protein • ME result: ICAM-1 /1U surface/unknown protein /unknown (1:protein, U: single) • Boundary extension • surface: in R-boundary word list, valid POS tag • Extension: ICAM-1 surface • protein: in R-boundary word list, valid POS tag • Extension: ICAM-1 surface protein

  41. Boundary extension (2) • Boundary extension for NEs containing brackets or slashes • NE := NE + ( + NE + ) + {NE or head noun or R-boundary word with valid POS tag} • NE := NE + / + NE ( + / + NE ) + { NE or head noun or R-boundary word with valid POS tag} • Example • granulocyte-macrophage colony-stimulating factor ( GM-CSF ) gene • ME result: granulocyte-macrophage colony-stimulating factor, GM-CSF • Extension: granulocyte-macrophage colony-stimulating factor ( GM-CSF ) gene

  42. Re-classification • Use dictionary lookup • Use R-boundary word • CIITA mRNA: RNA class • granulocyte-macrophage colony-stimulating factor ( GM-CSF ) gene: DNA class

  43. Experimental Results:NE Identification BE-1:boundary extension for nested NEs BE-2:boundary extension for brackets and slashes BE-3:with human name filter

  44. Experimental Results:NE Recognition RC-1: re-classification using dictionary lookup RC-2: re-classification using R-boundary words

  45. Experimental Results: GENIA v3.02 (10 Fold-CV) Recently, Zhou improve the F-measure of his HMM model to 0.712 by combining SVM

  46. Error Analysis • GENIA inconsistent annotation • IL-2 gene expression • <DNA>IL-2 gene</DNA> expression • <othername><DNA>IL-2 gene</DNA> expression</othername> • Conjunction • Human and mouse gene • Boundary detection error (boundary not in boundary word file) • Squirrel, manic, bursal…

  47. Error Analysis • Abbreviation classification • Orthographical form fits into at least two classed. • Protein: SOS1, FLICE, GAG • Other Organic: CD336 • False negative • A number of errors due to low-frequency words or works not encountered in the training data. • False positive • Ellipsis: • Many inflammatory cytokine genes including TNF, IL-1, and IL-6

  48. Outline • NER (named entity recognition) in biomedical domain • Challenges in biomedical NER • Current methods and our method • State of progress in NER • Future works

  49. Manning’s conclusion (I): Key factor for low performance • Task difficulty does not appear to be the primary factor leading to low performance. • BioCreative: 1 class, BioNLP: 5 classes • Key factor: quality of the training and evaluation data • Higher inconsistency in the annotation of the BioNLP data. • Two of the authors independently review 50 system’s errors; 34-35 are attributed to annotation. • The authors do not think the annotation inconsistencies are due to biological subtleties.

More Related