Text Mining Applications for Literature Curation

Text Mining Applications for Literature Curation Kimberly Van Auken WormBase Consortium Textpresso Gene Ontology Consortium

WormBase: A Database for C. elegans and Other Nematodes www.wormbase.org

Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics

Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms and and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics

Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics • Strain information: • August 1, 1972 • Pineapple field in Hawaii

Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms (Phenotype) and what contributes to the behavior? Bendesky et al., 2012, PLoS Genetics

Curating Diverse Data Types Aggregation Behavior Which worms aggregate with other worms (Phenotype) and what contributes to that behavior? Bendesky et al., 2012, PLoS Genetics • Worm Phenotype Ontology (WPO): Bordering • (WBPhenotype:0001820) • Life stage ontology, e.g., L3 larval stage • Assay, e.g., food source

Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Bendesky et al., 2012, PLoS Genetics

Curating Diverse Data Types Aggregation Behavior Which worms (Strain) aggregate with other worms (Phenotype) and what contributes to that behavior (Molecular Basis)? Bendesky et al., 2012, PLoS Genetics • Gene: npr-1 • Variation: ad609 (T(83)->I and T(144)->A) • Gene Ontology for npr-1: • Biological Process: feeding behavior • Molecular Function: neuropeptide receptor activity • Cellular Component: integral to plasma membrane

Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction

Finding Papers: Daily, automated PubMed searches using keyword ‘elegans’ Download citation XML Article type Curator actions PMID Title Authors Abstract Journal

Literature Curation Workflow – Full Text Acquisition • Fully manual step • Done for all papers we select • Electronic copies stored in curation database

Data Type Flagging/Triage • Data Type Flagging/Triage: • General classification of papers • What types of experiments are in a paper? • e.g. RNAi phenotypes, Variation phenotypes, • Expression patterns, Physical interactions

Data Type Flagging Methods • Main pipeline: • Support Vector Machines (SVMs) • Other methods: • Textpresso category searches • hidden Markov models • Pattern matching scripts

Support Vector Machines: Document Classification • Machine learning models • Use positive and negative gold standard sets of papers to train (e.g., papers with/without RNAi experiments) • Positives: 100s, Negatives: 1000s • Resulting model classifies all new papers as negative • or positive (high, medium, low confidence)

Data Type Flagging – Support Vector Machines SVMstrained for tendifferent data types: • Antibody • Genetic Interactions • Physical Interactions • Gene Expression • Regulation of Gene Expression • Variation Phenotypes • Overexpression Phenotypes • RNAi Phenotypes • Variation Sequence Change • Gene Structure Correction See: Fang R, et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics. 13(1):16

Curation from Support Vector Machine Results • SVM resultsleaddirectly to manual curation: • e.g. RNAi Phenotypes • ResultsfromSVMsareprocessedfurther • e.g. VariationSequenceChange Pattern Matching Script – regular expressions New variations (entityrecognition) e.g. mg366, ju43, e1360

Data Type Flagging – Textpresso Full text of articles Terms, phrases, entities – semantically tagged Keyword or category search Match within sentence or entire paper Wnt Pathway HIV Nemtaodes S. cerevisiae RegulonDB ….many others C. elegans Mouse D. melanogaster Neuroscience Arabidopsis Dicty www.textpresso.org

Textpresso Categories • Pre-existing dictionaries, vocabularies: • Gene names • ChEBI(Chemical Entities of Biological Interest) • PATO • Sequence Ontology (SO) • Manually constructed by curators using language from published literature: • Sequence similarity – orthologous, conserved • Localization assays – GFP, antibody, fluorescence • Experimental verbs – required, regulates, exhibits

Data Type Flagging - Textpresso Category Searches • Data Type: C. elegans Human DiseaseHomologs • Three-category Textpresso search: • C. elegansgene • ’Ortholog’, ’Homolog’, ’Similar’, ’Model’ • Human disease ”Wemapthisdefect in dauer response to a mutationin the scd-2gene, which, we show, encodesthe nematodeanaplasticlymphomakinse (ALK) homolog, a proto-oncogenereceptortyrosinekinase.”

Literature Curation Workflow PubMed keyword search – ‘elegans’ Full text paper acquisition Data type flagging and entity recognition Detailed curation/Fact extraction

Textpresso: Semi-Automated Fact Extraction • Genetic Interactions • Interestingly, pph-5 (tm2979) behaved similarly to pph-5 (av101) in its • ability to dominantly (but weakly) suppresssep-1 (e2406ts), but • recessively suppresssep-1(ax110) (supplementary material Table S1). • Physical Interactions – after SVM document classifier Remarkably, only AIN-1coimmunoprecipitatedHA-tagged CePAB-1 • (Figure 3A and B, lane 7). • Gene Ontology – Cellular Component Curation • During embryogenesis , PAN-1 protein is uniformlydistributed • throughout the cytoplasm of the germline and somatic blastomeres , as • seen for pan-1 mRNA (Fig. 2A), with no obvious concentration of PAN-1 • in the P granules (Fig. 2K, N).

Textpresso: Semi-Automated GO Cellular Component Curation Textpresso Component Gene Products Suggested GO Annotations Textpresso Search Results See: Van Auken KM, Jaffery J, Chan J, Müller HM, Sternberg PW. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) cellular component curation. BMC Bioinformatics. 10:228.

Future Directions • Textpresso, other methods (HMMs) applied to additional data types • e.g. GO Biological Process curation (Phenotypes) • Focusing triage and fact extraction on novel findings • How best to integrate existing knowledge into curation pipelines to focus curator effort on new experimental results? • e.g. Commonly used molecular markers

Literature Annotation Tool – Tracking Evidence WB, GO Common Annotation Framework, BioCreative

Summary • Text Mining Applications for Literature Curation: • Paper approval and full text acquisition • Data type flagging and entity recognition • Fact extraction – record evidence • All steps of our pipeline incorporate some form of • semi- or fully automated approaches: • Scripts for downloads, pattern matching • Support Vector Machines for document classification • Textpresso for flagging and fact extraction • (Hidden Markov Models for flagging, fact extraction)

The WormBase Consortium, Textpresso WormBase - Caltech Textpresso - Caltech Hans-Michael Muller Yuling Li James Done Former member: ArunRangarajan Paul Sternberg JuancarlosChan Wen Chen Chris Grove RanjanaKishore Raymond Lee Cecilia Nakamura Daniela Raciti Gary Schindelman Kimberly Van Auken Daniel Wang XiaodongWang Karen Yook Former member: Ruihua Fang WormBase – OICR, Toronto Lincoln Stein Abigail Cabunoc Todd Harris JD Wong WormBase – Washington University John Spieth TamberlynBieri Phil Ozersky CGC – Oxford University, Oxford, UK WormBase – EBI, Sanger, Hinxton, UK Jonathan Hodgkin Richard Durbin Paul Kersey Matt Berriman Paul Davis Michael Paulini Kevin Howe Mary Ann Tuli Gary Williams

Hidden Markov Models: Semi-Automated GO Molecular Function Curation • For each sentence, HMM yields: • True positive score • False positive score • For each sentence, curator assigns: • Fully curatable (entity + indication of enzymatic activity) • Positive (experiment was performed, result but no entity) • False Positive (not about enzymatic activity at all)

Text Mining Applications for Literature Curation

Text Mining Applications for Literature Curation

Presentation Transcript

CSC 9010: Text Mining Applications Document Summarization

Text Mining

i ProLINK: An integrated protein resource for literature mining and literature-based curation

NLP for Text Mining

Literature Mining for the Biologists

Text mining- text analytics- data mining

Text Mining

GO Annotation: Strategies for Identifying Literature for Curation

Applications of Text Mining

Literature Circles for Informational Text

Text Mining

Text Mining

Knowledge Management and Text Mining for Bioscience Literature Search

Text Mining

Text Mining

Mining Biomedical Literature for Neuroanatomy

LITERATURE CURATION

Text Mining