270 likes | 406 Views
Standards and gene expression data – from data archiving to extracting biological knowledge . Helen Parkinson, PhD Production Coordinator European Bioinformatics Institute. Talk content. Data sharing. Standards Landscape. Nature Reviews Genetics, Vol 7, p.593-605 (August 2006).
E N D
Standards and gene expression data – from data archiving to extracting biological knowledge Helen Parkinson, PhD Production Coordinator European Bioinformatics Institute
Standards Landscape Nature Reviews Genetics, Vol 7, p.593-605 (August 2006)
ArrayExpress history 2003 100 Expts 2004 TIGR Export 420 Expts 2005 Re-funded SMD Export 1200 Expts 2006 New UI 1600 Expts 2002 12 Expts 2007 GEO Affy Data Import Phase 1 >2607 Expts 2008 6631 Expts 2001 4
Public/Private Public Only 12,696,527 ATLAS Experiment queries > 200 species Gene level queries, 9 species Re-annotate Summarize Submit Why re-annotate? 134,000 hybs Harmonize annotation For integration purposes Data ages ArrayExpress: Overview Gene/ condition queries
Developing an Experimental Factor Ontology Our Use Cases • Query support (e.g, query for 'cancer' and get also 'leukemia') • Over-representation analysis in groups of samples (analogous to the use of GO terms in over-representation analysis in groups of genes) • Ontology visualisation – e.g., presenting an ontology tree to the user of what is in the database • Data integration by ontology terms – e.g., we assume that 'kidney' in independent studies roughly means the same, so we can count how many kidney samples we have in the database • Intelligent template generation for different experiment types in submission or data presentation • Summary level data
Submissions ArrayExpress Array Manufacturers (Affymetrix, Agilent) www MIAMExpress tab2mage MAGE-ML MAGE-ML Submission tracking/ curation environment Other Microarray Databases (SMD, TIGR, Utrecht, RZPD) MAGE-ML Experiment ArrayExpress repository Queries, analysis Source (e.g., Taxonomy) Gene (e.g., EMBL) MAGE-ML Sample Hybridisation Array Analysis Warehouse (BioMart) www Data Analysis Software (R/Bioconductor, J-Express, Resolver) Expression Profiler Data Oh the complexity! Publication External links Normalisation
Developing an Experimental Factor Ontology Application Ontology Status Quo ATLAS DW AE 16 • Text mining at data acquisition • Tuned for queries, structured for use in ArrayExpress GUI • Multi-species aspect 10.06.2014
Developing an Experimental Factor Ontology Semantic Roadmap • Position of the ArrayExpress Experimental Factor Ontology in the ‘bigger picture’ • Key is orthogonal coverage, reuse of existing resources and shared frameworks Chemical Entities of Biological Interest (ChEBI) Relation Ontology Cell Type Ontology Various Species Anatomy Ontologies Anatomy Reference Ontology Disease Ontology AE Ontology
Day et al, Genome Research, 2007 2007 Affymetrix Data landscape
Data exchange – or the failure to federate • We need all the data in house to re-process it • We do not have a data exchange agreement with GEO • SOFT vs. MAGE-ML/MAGE-TAB • No ontology usage • Some free text annotation, little process annotation • Mass data acquisition • 80% solution (or less) • Employing text mining • Data reprocessing • Cost effective, eliminates user support • Using spreadsheets (not XML) • We could almost eliminate the database if we can index the files
2008 Data Landscape 5000 8837 1540 135,000 36512 230749 ArrayExpress GEO
Flexible Data Access Models • GUIs – biologists • Hyperlinks • FTP – bioinformaticians • Web services – workflows • XML data dumps • Spreadsheets • Direct SQL access (not for ArrayExpress) • Schema and code if you want it • ‘Geek for a week’
Lessons learned • Complex architecture means a lot of SW engineering • Biologists like excel, Bioinformaticians like tab-delimited files • Spreadsheets scale, easy to check, harder to parse • Generic systems will be future proof • Legacy format converters are needed • You don’t need to keep everything • Text based queries most common • Text mining very useful • Scaling problems are hard to fix • Bleeding edge technologies should be used sparingly • Federation doesn’t really work for the goals we have • Archiving alone does not add value • Training is important and expensive
Useful tools for life sciences data management • Excel • Whatizit – text mining software from EBI • Our spreadsheet builder, checkers and format parsers tab2mage.sf.net • OBO foundry ontologies esp OBI, CTO, Disease Ontology • Taverna for building workflows • BASE – open source microarray data management tool • BioMart – data warehouse for biological data www.biomart.org
Acknowledgements • ArrayExpress Production Team • Tomasz Adamusiak, Tony Burdett, Anna Farne, Ele Holloway, James Malone, Margus Lukk, Helen Parkinson, Tim Rayner, Eleanor Williams, Holly Zheng • Ugis Sarkans ArrayExpress Development Team Leader • Misha Kapushesky – Main Atlas Developer • Gabriella Rustici – Training officer • Alvis Brazma – Group Leader • Uniprot and Ensembl teams • Funders: EC (FELICS, EMERALD, DIAMONDS, GEN2PHEN, MUGEN), NIH-NHGRI, EMBL • The submitters and microarray collaborators • GEO especially Tanya Barrett