1 / 15

Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation. WormBase Literature Curators Textpresso. SAB 2008. How does data get into WormBase?. Institution: Sanger Institute SUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse /elegans/

jenaya
Download Presentation

Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Curation Efficiency: User Contributions and Textpresso-Based Semi-Automation WormBase Literature Curators Textpresso SAB 2008

  2. How does data get into WormBase? Institution: Sanger Institute SUBMITTED FROM PAGE: http://www.wormbase.org/db/seq/gbrowse /elegans/ COMMENT TEXT: Dear WormBase, I think that WormBase may be missing a gene between Y50E8A.6 and Y50E8A.7...... User submission (email, web forms) First-pass curation SAB 2008

  3. Current first-pass curation pipeline Publication Flagging/Triage Curation SAB 2008

  4. User submissions: first-pass flagging/triage • Growing desire amongst biocurators for user submissions • First people to know what data is in a paper is the authors • TAIR – partnered with Plant Physiology web interface for data submission (February 2008) voluntary, link included in acceptance letter Submitter email Paper identifier Locus name Term/descriptor, method SAB 2008

  5. User-submitted first-pass flags - WormBase SAB 2008

  6. User data-submission forms: Expression Pattern SAB 2008

  7. Data extraction: Textpresso • Full-text searching • Keywords and/or categories Müller, Kenny, and Sternberg. PLoS Biology, November, 2004. SAB 2008

  8. Textpresso: What data types? • Paper – entity association: pattern matching Transgenes (Wen): WBPaper00031242 – gqIs3, gqIs35, oxIs12 • Fact extraction: specialized categories Genetic interactions (Andrei): eor-2(op166) suppresses HSN death in the strong tra-1(e1099)background, but not noticeably in the weaker tra-1(e1076) background. GO cellular component curation (Kimberly): ...positions of these neurons are indicated with circles and localizations of GAR-3::YFP on the cell membranes are denoted by arrows. SAB 2008

  9. Textpresso-mediated CC curation: from sentences to annotations SAB 2008

  10. Textpresso: How much data? Transgenes: 1,100 new paper-transgene connections 250 new transgenes checked manually – 95% accuracy ultimately, connections will go directly into database Genetic Interactions: 1,875 (1/2007 – 5/2008) ~5,600 total interactions keeping current with new papers GO Cellular Component Annotations: 515 (1/2007 – 5/2008) 2-3X rate prior to categories nearly complete keeping up with new data (1-2 hours/week) SAB 2008

  11. Textpresso: Other data types How else can we use Textpresso? Other data types: Molecular Function Assays, Gene Product Interactions Pilot: GO molecular function annotations for protein kinase activity keyword: phosphorylate category: C. elegans proteins 13 new GO annotations/hour Extension of this: protein modifications – not yet captured in WB Pilot: Gene product interactions for WB and BIND keywords: physically interact category:C. elegans proteins 310 matches in 237 documents 22 physical interactions – top 15 papers

  12. Textpresso for triage: Classifying text based on content • Multiple levels: • Organismal triage – C. elegans, Drosophila • Identify, prioritize information-rich papers • Flag for specific data types • Multiple strategies (using existing first-pass papers as training set): • Machine learning – SVM (Support Vector Machine) Word frequency analysis • Hand-crafted categories • Combine SVM and categories • Supplement with word weighting, contextual analyses

  13. Keeping better track of curation statistics..... SAB 2008

  14. .....and making curation statistics more transparent to users. • Users could search for curation status of any paper • Users could search for curation status of a given data type • Each database release would report newly curated papers • Each database release would document increases in data-type curation SAB 2008

  15. WormBase Literature Curation First Pass, Genetic Interactions: Andrei Petcherski, Caltech Gene Symbols, Alleles, Sequence Features, Mapping Data: Mary Ann Tuli, Sanger Gene Regulation, PWMs: Xiaodong Wang, Caltech Erich Schwarz, Caltech Expression Patterns, Antibodies, Transgenes: Wen Chen, Caltech Gene Function: Concise Descriptions, Gene Ontology: Ranjana Kishore, Caltech Erich Schwarz, Caltech Kimberly Van Auken, Caltech Anatomy Ontology, Cell Function: Raymond Lee, Caltech Microarrays, SAGE: Igor Antoshechkin, Caltech Mutant Phenotypes (RNAi and Alleles): Igor Antoshechkin, Caltech Jolene Fernandez, Caltech Raymond Lee, Caltech Gary Shindelman, Caltech Karen Yook, Caltech Curation Tools, Database: Juancarlos Chan, Caltech Sequence, Gene Structures: Sanger, Wash U Authors, Papers: Cecilia Nakamura, Daniel Wang

More Related