280 likes | 432 Views
Text Mining Applications for Literature Curation. Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium. WormBase: A Database for C. elegans and Other Nematodes. www.wormbase.org. Curating Diverse Data Types . Aggregation Behavior. Which worms aggregate
E N D
Text Mining Applications for Literature Curation Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium
WormBase: A Database for C. elegans and Other Nematodes www.wormbase.org
Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms and and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics • Strain information: • August 1, 1972 • Pineapple field in Hawaii
Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms (Phenotype) and what contributes to the behavior? Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms (Phenotype) and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics • Worm Phenotype Ontology (WPO): Bordering • (WBPhenotype:0001820) • Life stage ontology, e.g., L3 larval stage • Assay, e.g., food source
Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Bendesky et al., 2012, PLoS Genetics
Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Bendesky et al., 2012, PLoS Genetics • Gene: npr-1 • Variation: ad609 (T(83)->I and T(144)->A) • Gene Ontology for npr-1: • Biological Process: feeding behavior • Molecular Function: neuropeptide receptor activity • Cellular Component: integral to plasma membrane
Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction
Finding Papers: Daily, automated PubMed searches using keyword ‘elegans’ Download citation XML Article type Curator actions PMID Title Authors Abstract Journal
Literature Curation Workflow – Full Text Acquisition • Fully manual step • Done for all papers we select • Electronic copies stored in curation database
Data Type Flagging/Triage • Data Type Flagging/Triage: • General classification of papers • What types of experiments are in a paper? • e.g. RNAi phenotypes, Variation phenotypes, • Expression patterns, Physical interactions
Data Type Flagging Methods • Main pipeline: • Support Vector Machines (SVMs) • Other methods: • Textpresso category searches • hidden Markov models • Pattern matching scripts
Support Vector Machines: Document Classification • Machine learning models • Use positive and negative gold standard sets of papers to train (e.g., papers with/without RNAi experiments) • Positives: 100s, Negatives: 1000s • Resulting model classifies all new papers as negative • or positive (high, medium, low confidence)
Data Type Flagging – Support Vector Machines SVMstrained for tendifferent data types: • Antibody • Genetic Interactions • Physical Interactions • Gene Expression • Regulation of Gene Expression • Variation Phenotypes • Overexpression Phenotypes • RNAi Phenotypes • Variation Sequence Change • Gene Structure Correction See: Fang R, et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics. 13(1):16
Curation from Support Vector Machine Results • SVM resultsleaddirectly to manual curation: • e.g. RNAi Phenotypes • ResultsfromSVMsareprocessedfurther • e.g. VariationSequenceChange Pattern Matching Script – regular expressions New variations (entityrecognition) e.g. mg366, ju43, e1360
Data Type Flagging – Textpresso Full text of articles Terms, phrases, entities – semantically tagged Keyword or category search Match within sentence or entire paper Wnt Pathway HIV Nemtaodes S. cerevisiae RegulonDB ….many others C. elegans Mouse D. melanogaster Neuroscience Arabidopsis Dicty www.textpresso.org
Textpresso Categories • Pre-existing dictionaries, vocabularies: • Gene names • ChEBI(Chemical Entities of Biological Interest) • PATO • Sequence Ontology (SO) • Manually constructed by curators using language from published literature: • Sequence similarity – orthologous, conserved • Localization assays – GFP, antibody, fluorescence • Experimental verbs – required, regulates, exhibits
Data Type Flagging - Textpresso Category Searches • Data Type: C. elegans Human DiseaseHomologs • Three-category Textpresso search: • C. elegansgene • ’Ortholog’, ’Homolog’, ’Similar’, ’Model’ • Human disease ”Wemapthisdefect in dauer response to a mutationin the scd-2gene, which, we show, encodesthe nematodeanaplasticlymphomakinse (ALK) homolog, a proto-oncogenereceptortyrosinekinase.”
Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction
Textpresso: Semi-Automated Fact Extraction • Genetic Interactions • Interestingly, pph-5 (tm2979) behaved similarly to pph-5 (av101) in its • ability to dominantly (but weakly) suppresssep-1 (e2406ts), but • recessively suppresssep-1(ax110) (supplementary material Table S1). • Physical Interactions – after SVM document classifier Remarkably, only AIN-1coimmunoprecipitatedHA-tagged CePAB-1 • (Figure 3A and B, lane 7). • Gene Ontology – Cellular Component Curation • During embryogenesis , PAN-1 protein is uniformlydistributed • throughout the cytoplasm of the germline and somatic blastomeres , as • seen for pan-1 mRNA (Fig. 2A), with no obvious concentration of PAN-1 • in the P granules (Fig. 2K, N).
Textpresso: Semi-Automated GO Cellular Component Curation Textpresso Component Gene Products Suggested GO Annotations Textpresso Search Results See: Van Auken KM, Jaffery J, Chan J, Müller HM, Sternberg PW. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) cellular component curation. BMC Bioinformatics. 10:228.
Future Directions • Textpresso, other methods (HMMs) applied to additional data types • e.g. GO Biological Process curation (Phenotypes) • Focusing triage and fact extraction on novel findings • How best to integrate existing knowledge into curation pipelines to focus curator effort on new experimental results? • e.g. Commonly used molecular markers
Literature Annotation Tool – Tracking Evidence WB, GO Common Annotation Framework, BioCreative
Summary • Text Mining Applications for Literature Curation: • Paper approval and full text acquisition • Data type flagging and entity recognition • Fact extraction – record evidence • All steps of our pipeline incorporate some form of • semi- or fully automated approaches: • Scripts for downloads, pattern matching • Support Vector Machines for document classification • Textpresso for flagging and fact extraction • (Hidden Markov Models for flagging, fact extraction)
The WormBase Consortium, Textpresso WormBase - Caltech Textpresso - Caltech Hans-Michael Muller Yuling Li James Done Former member: ArunRangarajan Paul Sternberg JuancarlosChan Wen Chen Chris Grove RanjanaKishore Raymond Lee Cecilia Nakamura Daniela Raciti Gary Schindelman Kimberly Van Auken Daniel Wang XiaodongWang Karen Yook Former member: Ruihua Fang WormBase – OICR, Toronto Lincoln Stein Abigail Cabunoc Todd Harris JD Wong WormBase – Washington University John Spieth TamberlynBieri Phil Ozersky CGC – Oxford University, Oxford, UK WormBase – EBI, Sanger, Hinxton, UK Jonathan Hodgkin Richard Durbin Paul Kersey Matt Berriman Paul Davis Michael Paulini Kevin Howe Mary Ann Tuli Gary Williams
Hidden Markov Models: Semi-Automated GO Molecular Function Curation • For each sentence, HMM yields: • True positive score • False positive score • For each sentence, curator assigns: • Fully curatable (entity + indication of enzymatic activity) • Positive (experiment was performed, result but no entity) • False Positive (not about enzymatic activity at all)