DATA MANAGEMENT AND CURATION AT TAIR

DATA MANAGEMENT AND CURATION AT TAIR Margarita Garcia-Hernandez

The ‘systems biology’ paradigm • FACT: huge amounts of data • NEED: systematic harvesting & easy accessibility (store, sort, interlink) • PROBLEM: complexity & heterogeneity of data • CHALLENGE: to describe complete biological systems in an integrated way (organizing, defining relationships, defining metadata standards, interpreting, quality control assessment – DATA MANAGEMENT)

Data Management Flow Chart Data generation Collection Selection • Organization of similar data types • Remove redundancy • Correct errors Data curation • Association of different data types • Establish unambiguous identifiers • Define and validate relationships Resolve data heterogeneity - standardization • Annotation – add descriptions • Define standard vocabulary Database population Data modeling Data dissemination

Quality Control Issues • Accuracy of information • Consistency in format and content • Up-to-datedness • Conflicting data TAIR’s approaches • Personnel training (Ph.D. level biologists) • User input • Source attribution • Checking curation consistency (computationally and manually) • Adopt Standard Operation Procedures • Define and use controlled vocabularies

Literature Public Databases Community submissions Functional Genomic projects Computational analysis Many Data Types with Many Sources Sources Data Types Genes/ Gene Products Mutant Phenotypes Expression Metabolism Stocks

Two Examples of Data Curation Sources Data Types Literature Genes/ Gene Products Public Databases (SMD) Mutant Phenotypes Community submissions Microarray Expression Functional Genomic projects (AFGC) Metabolism Computational analysis Stocks

Literature CurationPubSearch • A literature curation management systemdesigned to store and manage the available literature for an organism of interest PubSearch software is freely available at http://pubsearch.org Generic Model Organism Database (GMOD) http://www.gmod.org a joint effort by several model organism databases to develop reusable components for creating new biological databases

PubMed Agricola Biosis TAIR DB Literature CurationStep 1: Collection of References • Meetings abstracts • Dissertations • Textbooks Full text papers (scanning, online) (‘Arabidopsis’ in title or abstract) Arabidopsis References Remove redundancy Journal names standardization PubSearch DB (21,527) (curation tool) (public db)

Literature CurationStep 2: Assigning References to Genes Arabidopsis References (PubSearch) known gene names Scanning references for terms in list (programmatically) Term List (17,470) candidate gene names Ref 1 Ref 2 .. Ref n Gene X Reference hit Validation (by curators using PubSearch) Validated list of references for each gene

Literature CurationStep 3: Extracting information • Gene-centric curation approach • Each curator is assigned 2 genes per day • Papers are read and information extracted (following SOPs and using PubSearch curation tool): • Name validation & add aliases • Add sequence info • Assign locus (mapping to the genome by BLAST) • Merge/split genes • Write summary sentence • Correct errors • Annotation using controlled vocabularies (GO, POC)

ControlledVocabularies • A collection of defined terms (organized in a hierarchy) intended to serve as a standard nomenclature • Provide a common set of terms that users of a single system (or across multiple systems) can share • Allows retrieval of ALL relevant information Example: • Find all the genes that have transporter activity (regardless of how they are named, or what type they are)

Controlled Vocabularies used at TAIR • Gene Ontology (GO) http://www.geneontology.org/ Goal: to produce a controlled vocabulary for describing genes and proteins that can be applied to all organisms • Molecular Function • Cellular component • Biological process • Plant Organism Consortium (POC) http://www.plantontology.org/ Gramene, TAIR, Univ Missouri St Louis, MaizeDB, IRIS, MIPS, Oryzabase & Monsanto & Pioneer as collaborators Goal: to develop structured controlled vocabularies for plant-specific knowledge domains: • Plant Anatomy(morphology, organs, tissue and cell types) • Temporal stages(plant growth and developmental stages) • Phenotype Ontology (in the works)

Qualifying Annotations with supporting evidence References Evidence code usage A set of controlled vocabulary, which provides evidence to support the association between gene products and annotations • IDA: Inferred from Direct Assay • IMP: Inferredfrom Mutant Phenotype • ISS: Inferred from Sequence Similarity • IEA: Inferred from Electronic Annotation • IEP:Inferred from Expression Pattern • …… Evidence code description E.g., IPI :Inferred from Physical Interaction • Co-immunoprecipitation • Co-purification • Co-sedimentation • ….

Gene Annotation Display in TAIR

QC of Literature Curation • Weekly annotation meeting • Quality control manager • Use of standardized vocabularies • Random checks of annotations • Annotations are tagged by date and curator • Automatic checks in software • Use SOPs – curation guidelines

Curation of Microarray Data Sources Data Types Literature Genes/ Gene Products Public Databases (SMD) Mutant Phenotypes Community submissions Microarray Data Functional Genomic projects (AFGC) Metabolism Computational analysis Stocks

Stanford Microarray Database Arabidopsis Functional Genomics Consortium Numeric Results Data (raw and normalized) Samples Protocols Curation of AFGC Microarray DataData Collection and Selection • results • array design • minimal descriptions of individual arrays • sample info • proposal abstracts • protocols - All Arabidopsis public arrays - exclude QC arrays (45) Selected Arrays (516) Metadata Array Elements Experiments

Curation of Metadata: Array elements 1.classify, organize, add missing sequences, correct errors 2. mapping to the Arabidopsis genome & association to genes (pipeline) Samples & Experiments 1. Data extraction from flat files (abstracts, RNA forms), and database (SMD) e.g., tissue type, treatments, experimental design 2. Organization of data & parsing into tables 3. Develop controlled vocabularies for experiment categorization & treatments 4. Standardization using those vocabularies 5. Data association grouping arrays  replicate sets  experiments merging replicate samples to minimize redundancy linking to other related data (germplasm, clones, publications, people) 6. Annotation Experiments: GO process, category, experimental variables Samples: tissues (POC anatomy & temporal) & treatment Data Submission http://arabidopsis.org/info/microarray.submission.jsp

Numeric Results Data Curation of Microarray Results Data -Quality control Remove poor quality arrays (2) Exclude spots flagged as bad Re-normalize using lowess method (minimize spatial bias) Remove arrays with strong spatial/plate bias (72)(ANOVA) Exclude array elements with intensity < 350 in both channels Exclude array elements with null values in 80% of arrays -Analysis Calculate log2 ratio [ch2N/ch1N] Calculate fold change [ch2N/ch1N] Calculate averages for each array element (array & replicates) Element fold change/log2 ratio std error per array Element fold change/log2 ratio std error per replicate arrays

Conclusions • Requires trained biologists familiar with data • Can be facilitated computationally (repetitive tasks), but is mainly a knowledge-based task that can only be done by humans • Essential for assuring data quality • Adds value to data • Slow process • Can be inconsistent

DATA MANAGEMENT AND CURATION AT TAIR