1 / 28

Text Mining Applications for Literature Curation

Text Mining Applications for Literature Curation. Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium. WormBase: A Database for C. elegans and Other Nematodes. www.wormbase.org. Curating Diverse Data Types . Aggregation Behavior. Which worms aggregate

vea
Download Presentation

Text Mining Applications for Literature Curation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Mining Applications for Literature Curation Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium

  2. WormBase: A Database for C. elegans and Other Nematodes www.wormbase.org

  3. Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics

  4. Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms and and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics

  5. Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics • Strain information: • August 1, 1972 • Pineapple field in Hawaii

  6. Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms (Phenotype) and what contributes to the behavior? Bendesky et al., 2012, PLoS Genetics

  7. Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms (Phenotype) and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics • Worm Phenotype Ontology (WPO): Bordering • (WBPhenotype:0001820) • Life stage ontology, e.g., L3 larval stage • Assay, e.g., food source

  8. Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Bendesky et al., 2012, PLoS Genetics

  9. Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Bendesky et al., 2012, PLoS Genetics • Gene: npr-1 • Variation: ad609 (T(83)->I and T(144)->A) • Gene Ontology for npr-1: • Biological Process: feeding behavior • Molecular Function: neuropeptide receptor activity • Cellular Component: integral to plasma membrane

  10. Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction

  11. Finding Papers: Daily, automated PubMed searches using keyword ‘elegans’ Download citation XML Article type Curator actions PMID Title Authors Abstract Journal

  12. Literature Curation Workflow – Full Text Acquisition • Fully manual step • Done for all papers we select • Electronic copies stored in curation database

  13. Data Type Flagging/Triage • Data Type Flagging/Triage: • General classification of papers • What types of experiments are in a paper? • e.g. RNAi phenotypes, Variation phenotypes, • Expression patterns, Physical interactions

  14. Data Type Flagging Methods • Main pipeline: • Support Vector Machines (SVMs) • Other methods: • Textpresso category searches • hidden Markov models • Pattern matching scripts

  15. Support Vector Machines: Document Classification • Machine learning models • Use positive and negative gold standard sets of papers to train (e.g., papers with/without RNAi experiments) • Positives: 100s, Negatives: 1000s • Resulting model classifies all new papers as negative • or positive (high, medium, low confidence)

  16. Data Type Flagging – Support Vector Machines SVMstrained for tendifferent data types: • Antibody • Genetic Interactions • Physical Interactions • Gene Expression • Regulation of Gene Expression • Variation Phenotypes • Overexpression Phenotypes • RNAi Phenotypes • Variation Sequence Change • Gene Structure Correction See: Fang R, et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics. 13(1):16

  17. Curation from Support Vector Machine Results • SVM resultsleaddirectly to manual curation: • e.g. RNAi Phenotypes • ResultsfromSVMsareprocessedfurther • e.g. VariationSequenceChange Pattern Matching Script – regular expressions New variations (entityrecognition) e.g. mg366, ju43, e1360

  18. Data Type Flagging – Textpresso Full text of articles Terms, phrases, entities – semantically tagged Keyword or category search Match within sentence or entire paper Wnt Pathway HIV Nemtaodes S. cerevisiae RegulonDB ….many others C. elegans Mouse D. melanogaster Neuroscience Arabidopsis Dicty www.textpresso.org

  19. Textpresso Categories • Pre-existing dictionaries, vocabularies: • Gene names • ChEBI(Chemical Entities of Biological Interest) • PATO • Sequence Ontology (SO) • Manually constructed by curators using language from published literature: • Sequence similarity – orthologous, conserved • Localization assays – GFP, antibody, fluorescence • Experimental verbs – required, regulates, exhibits

  20. Data Type Flagging - Textpresso Category Searches • Data Type: C. elegans Human DiseaseHomologs • Three-category Textpresso search: • C. elegansgene • ’Ortholog’, ’Homolog’, ’Similar’, ’Model’ • Human disease ”Wemapthisdefect in dauer response to a mutationin the scd-2gene, which, we show, encodesthe nematodeanaplasticlymphomakinse (ALK) homolog, a proto-oncogenereceptortyrosinekinase.”

  21. Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction

  22. Textpresso: Semi-Automated Fact Extraction • Genetic Interactions • Interestingly, pph-5 (tm2979) behaved similarly to pph-5 (av101) in its • ability to dominantly (but weakly) suppresssep-1 (e2406ts), but • recessively suppresssep-1(ax110) (supplementary material Table S1). • Physical Interactions – after SVM document classifier Remarkably, only AIN-1coimmunoprecipitatedHA-tagged CePAB-1 • (Figure 3A and B, lane 7). • Gene Ontology – Cellular Component Curation • During embryogenesis , PAN-1 protein is uniformlydistributed • throughout the cytoplasm of the germline and somatic blastomeres , as • seen for pan-1 mRNA (Fig. 2A), with no obvious concentration of PAN-1 • in the P granules (Fig. 2K, N).

  23. Textpresso: Semi-Automated GO Cellular Component Curation Textpresso Component Gene Products Suggested GO Annotations Textpresso Search Results See: Van Auken KM, Jaffery J, Chan J, Müller HM, Sternberg PW. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) cellular component curation. BMC Bioinformatics. 10:228.

  24. Future Directions • Textpresso, other methods (HMMs) applied to additional data types • e.g. GO Biological Process curation (Phenotypes) • Focusing triage and fact extraction on novel findings • How best to integrate existing knowledge into curation pipelines to focus curator effort on new experimental results? • e.g. Commonly used molecular markers

  25. Literature Annotation Tool – Tracking Evidence WB, GO Common Annotation Framework, BioCreative

  26. Summary • Text Mining Applications for Literature Curation: • Paper approval and full text acquisition • Data type flagging and entity recognition • Fact extraction – record evidence • All steps of our pipeline incorporate some form of • semi- or fully automated approaches: • Scripts for downloads, pattern matching • Support Vector Machines for document classification • Textpresso for flagging and fact extraction • (Hidden Markov Models for flagging, fact extraction)

  27. The WormBase Consortium, Textpresso WormBase - Caltech Textpresso - Caltech Hans-Michael Muller Yuling Li James Done Former member: ArunRangarajan Paul Sternberg JuancarlosChan Wen Chen Chris Grove RanjanaKishore Raymond Lee Cecilia Nakamura Daniela Raciti Gary Schindelman Kimberly Van Auken Daniel Wang XiaodongWang Karen Yook Former member: Ruihua Fang WormBase – OICR, Toronto Lincoln Stein Abigail Cabunoc Todd Harris JD Wong WormBase – Washington University John Spieth TamberlynBieri Phil Ozersky CGC – Oxford University, Oxford, UK WormBase – EBI, Sanger, Hinxton, UK Jonathan Hodgkin Richard Durbin Paul Kersey Matt Berriman Paul Davis Michael Paulini Kevin Howe Mary Ann Tuli Gary Williams

  28. Hidden Markov Models: Semi-Automated GO Molecular Function Curation • For each sentence, HMM yields: • True positive score • False positive score • For each sentence, curator assigns: • Fully curatable (entity + indication of enzymatic activity) • Positive (experiment was performed, result but no entity) • False Positive (not about enzymatic activity at all)

More Related