150 likes | 256 Views
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation. WormBase Literature Curators Textpresso. SAB 2008. How does data get into WormBase?. Institution: Sanger Institute SUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse /elegans/
E N D
Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation WormBase Literature Curators Textpresso SAB 2008
How does data get into WormBase? Institution: Sanger Institute SUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse /elegans/ COMMENT TEXT: Dear WormBase, I think that WormBase may be missing a gene between Y50E8A.6 and Y50E8A.7...... User submission (email, web forms) First-pass curation SAB 2008
Current first-pass curation pipeline Publication Flagging/Triage Curation SAB 2008
User submissions: first-pass flagging/triage • Growing desire amongst biocurators for user submissions • First people to know what data is in a paper is the authors • TAIR – partnered with Plant Physiology web interface for data submission (February 2008) voluntary, link included in acceptance letter Submitter email Paper identifier Locus name Term/descriptor, method SAB 2008
Data extraction: Textpresso • Full-text searching • Keywords and/or categories Müller, Kenny, and Sternberg. PLoS Biology, November, 2004. SAB 2008
Textpresso: What data types? • Paper – entity association: pattern matching Transgenes (Wen): WBPaper00031242 – gqIs3, gqIs35, oxIs12 • Fact extraction: specialized categories Genetic interactions (Andrei): eor-2(op166) suppresses HSN death in the strong tra-1(e1099)background, but not noticeably in the weaker tra-1(e1076) background. GO cellular component curation (Kimberly): ...positions of these neurons are indicated with circles and localizations of GAR-3::YFP on the cell membranes are denoted by arrows. SAB 2008
Textpresso-mediated CC curation: from sentences to annotations SAB 2008
Textpresso: How much data? Transgenes: 1,100 new paper-transgene connections 250 new transgenes checked manually – 95% accuracy ultimately, connections will go directly into database Genetic Interactions: 1,875 (1/2007 – 5/2008) ~5,600 total interactions keeping current with new papers GO Cellular Component Annotations: 515 (1/2007 – 5/2008) 2-3X rate prior to categories nearly complete keeping up with new data (1-2 hours/week) SAB 2008
Textpresso: Other data types How else can we use Textpresso? Other data types: Molecular Function Assays, Gene Product Interactions Pilot: GO molecular function annotations for protein kinase activity keyword: phosphorylate category: C. elegans proteins 13 new GO annotations/hour Extension of this: protein modifications – not yet captured in WB Pilot: Gene product interactions for WB and BIND keywords: physically interact category:C. elegans proteins 310 matches in 237 documents 22 physical interactions – top 15 papers
Textpresso for triage: Classifying text based on content • Multiple levels: • Organismal triage – C. elegans, Drosophila • Identify, prioritize information-rich papers • Flag for specific data types • Multiple strategies (using existing first-pass papers as training set): • Machine learning – SVM (Support Vector Machine) Word frequency analysis • Hand-crafted categories • Combine SVM and categories • Supplement with word weighting, contextual analyses
.....and making curation statistics more transparent to users. • Users could search for curation status of any paper • Users could search for curation status of a given data type • Each database release would report newly curated papers • Each database release would document increases in data-type curation SAB 2008
WormBase Literature Curation First Pass, Genetic Interactions: Andrei Petcherski, Caltech Gene Symbols, Alleles, Sequence Features, Mapping Data: Mary Ann Tuli, Sanger Gene Regulation, PWMs: Xiaodong Wang, Caltech Erich Schwarz, Caltech Expression Patterns, Antibodies, Transgenes: Wen Chen, Caltech Gene Function: Concise Descriptions, Gene Ontology: Ranjana Kishore, Caltech Erich Schwarz, Caltech Kimberly Van Auken, Caltech Anatomy Ontology, Cell Function: Raymond Lee, Caltech Microarrays, SAGE: Igor Antoshechkin, Caltech Mutant Phenotypes (RNAi and Alleles): Igor Antoshechkin, Caltech Jolene Fernandez, Caltech Raymond Lee, Caltech Gary Shindelman, Caltech Karen Yook, Caltech Curation Tools, Database: Juancarlos Chan, Caltech Sequence, Gene Structures: Sanger, Wash U Authors, Papers: Cecilia Nakamura, Daniel Wang