420 likes | 552 Views
Gene expression data in VectorBase. Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to EBI, Sanger and ND). Outline. Project goals What’s currently available Current challenges and future plans. Project goals. For vector biologists:
E N D
Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to EBI, Sanger and ND)
Outline • Project goals • What’s currently available • Current challenges and future plans
Project goals • For vector biologists: • Easy access to gene expression data • consistent data processing • For array specialists: • ArrayExpress submission • Advanced analysis tools • Array annotation
EXPRESSION DATA BULK LOADER STORAGE & ANALYSIS • BASE: BioArray Software Environment • http://base.thep.lu.se/ • Open source, active development and user community • LIMS, data storage, export and analysis • Web-based, user/group access control • BASE 2.x adoption will bring Affy support
Data submission • Community submission guidelines available • First batch of experiments loaded by us • Bulk data loader • Sample/experiment annotation requires intervention from curators
ArrayExpress EXPRESSION DATA BULK LOADER ‘PUBLIC’ STORAGE STORAGE & ANALYSIS • Data held in BASE is largely MIAME compliant • Script for semi-automated export in TAB2MAGE format • One experiment submitted so far
ArrayExpress EXPRESSION DATA BULK LOADER ‘PUBLIC’ STORAGE STORAGE & ANALYSIS
ArrayExpress EXPRESSION DATA BULK LOADER ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES • BASE web interface offers powerful and extendable analysis environment • Can be used for multi-site collaborations on pre-publication data • Steep learning curve/not 100% intuitive • Not easily linked to • We provide simpler views so the casual user can quickly draw biological inferences
Standardised data All displayed data is processed in the same way: • Poor quality spots removed • Currently using submitted spot flags • Normalisation • “lowess” for two-colour experiments
3 probe types 6 array designs Mapping handled via Ensembl pipeline: Oligo exonerate PCR e-PCR cDNA exonerate2genes ArrayExpress EXPRESSION DATA BULK LOADER PROBE MAPPING ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES
VectorBase ArrayExpress EXPRESSION DATA GENOMIC DATA BULK LOADER PROBE MAPPING AUTOMATIC ANNOTATION GFF3 ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES GENOME BROWSER
VectorBase ArrayExpress EXPRESSION DATA GENOMIC DATA BULK LOADER PROBE MAPPING AUTOMATIC ANNOTATION ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES GENOME BROWSER DATA MINING ARRAY BIOLOGISTS GENOME BIOLOGISTS VECTOR BIOLOGISTS
BioMart • Beta version currently available • http://base.vectorbase.org:9999/biomart/martview • Improvements still needed: • experiment annotations • Alignments (i.e. handle split alignments) • Federation with current marts • Integration with new data?
Current challenges and future plans • How do you want to query? • CVs & ontologies • APIs • Community submission • Manual annotation
Querying strategy • What do you want to query on? • Fetch all genes upregulated under condition X • Fetch all experiments with gene X and condition Y • Fetch all probes with expression similar to probe X • All essentially boil down to: • Define probe (genes etc) • Define significant expression • ANOVA? • Up/down-regulation WRT what? • Define experimental conditions • Sample annotation • Experimental design
ArrayExpress EXPRESSION DATA GENOMIC DATA BULK LOADER PROBE MAPPING AUTOMATIC ANNOTATION CV / ONTOLOGY ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES GENOME BROWSER DATA MINING ARRAY BIOLOGISTS GENOME BIOLOGISTS VECTOR BIOLOGISTS
PROBE MAPPING AE API ? e! API ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES GENOME BROWSER MartJ / MQL DATA MINING ArrayExpress EXPRESSION DATA GENOMIC DATA BULK LOADER CV / ONTOLOGY AUTOMATIC ANNOTATION Array API ?
Array API Perl / Java objects for retrieval / handling of array data • Dual purpose: • Consistency & efficiency of VB expression website • Computational access to VB data for all • Objects must be: • General, DB-independent • Compatible with pre-existing Bio API (BioPerl / BioJava) • Nb. May be pre-existing solution: • ArrayExpress API? • BioPerl-Expression? • MAGE-OM-stk • http://neuron.cse.nd.edu/vectorbase/index.php/Array_API_proposal
Community data submission • Carrot? • Help with ArrayExpress submission • Analysis tools • Dissemination • Stick? • Outreach (courses, conferences) • Networking
GE data manual annotators • Gene-build designed arrays • Negative evidence less compelling • EST clone-based arrays • http://tinyurl.com/vlkwo
Longer term plans • Host-parasite GE data integration & analysis • GE-clusters “upstream” regions regulatory elements, upstream TFs • RNAi phenotypes • Images
CVs & ontologies • Integrate MGED and specialist ontologies for • Body parts • Developmental stages • Disease processes • … • Allows comparison across experiments with similar experimental conditions
Most biomarts: Gene-based Mostly ‘binary’ data e.g. a gene either has a signal domain or doesn’t Easily linked with other (gene-based) biomarts VB Biomart: Probe based Many probes not aligned Exp data less clear e.g. define ‘differential expression’ Exports gene/trans IDs for linking to other Marts BioMart
Clustering • A priority? • Easy to do on reporter level within experiments • Harder to do at gene level across all experiments • Binary gene profile: “yes/no differentially expressed in experiment” ? • Amazon-style links to “genes which may have similar expression profiles”?
BASE 2.x • Adoption delayed, now in progress • Brings Affymetrix support • Cleaner/modern interface • Better API (Java)