Standards and gene expression data – from data archiving to extracting biological knowledge

Standards and gene expression data – from data archiving to extracting biological knowledge Helen Parkinson, PhD Production Coordinator European Bioinformatics Institute

Talk content

Data sharing

Standards Landscape Nature Reviews Genetics, Vol 7, p.593-605 (August 2006)

MIAME – Minimal Information about a microarray experiment

So Has MIAME been successful?

ArrayExpress?

ArrayExpress history 2003 100 Expts 2004 TIGR Export 420 Expts 2005 Re-funded SMD Export 1200 Expts 2006 New UI 1600 Expts 2002 12 Expts 2007 GEO Affy Data Import Phase 1 >2607 Expts 2008 6631 Expts 2001 4

Public/Private Public Only 12,696,527 ATLAS Experiment queries > 200 species Gene level queries, 9 species Re-annotate Summarize Submit Why re-annotate? 134,000 hybs Harmonize annotation For integration purposes Data ages ArrayExpress: Overview Gene/ condition queries

Getting to a summary level data Atlas

Developing an Experimental Factor Ontology Our Use Cases • Query support (e.g, query for 'cancer' and get also 'leukemia') • Over-representation analysis in groups of samples (analogous to the use of GO terms in over-representation analysis in groups of genes) • Ontology visualisation – e.g., presenting an ontology tree to the user of what is in the database • Data integration by ontology terms – e.g., we assume that 'kidney' in independent studies roughly means the same, so we can count how many kidney samples we have in the database • Intelligent template generation for different experiment types in submission or data presentation • Summary level data

Submissions ArrayExpress Array Manufacturers (Affymetrix, Agilent) www MIAMExpress tab2mage MAGE-ML MAGE-ML Submission tracking/ curation environment Other Microarray Databases (SMD, TIGR, Utrecht, RZPD) MAGE-ML Experiment ArrayExpress repository Queries, analysis Source (e.g., Taxonomy) Gene (e.g., EMBL) MAGE-ML Sample Hybridisation Array Analysis Warehouse (BioMart) www Data Analysis Software (R/Bioconductor, J-Express, Resolver) Expression Profiler Data Oh the complexity! Publication External links Normalisation

Developing an Experimental Factor Ontology Application Ontology Status Quo ATLAS DW AE 16 • Text mining at data acquisition • Tuned for queries, structured for use in ArrayExpress GUI • Multi-species aspect 10.06.2014

Developing an Experimental Factor Ontology Semantic Roadmap • Position of the ArrayExpress Experimental Factor Ontology in the ‘bigger picture’ • Key is orthogonal coverage, reuse of existing resources and shared frameworks Chemical Entities of Biological Interest (ChEBI) Relation Ontology Cell Type Ontology Various Species Anatomy Ontologies Anatomy Reference Ontology Disease Ontology AE Ontology

What lies beneath?

Where does the data come from

What is curation?

Day et al, Genome Research, 2007 2007 Affymetrix Data landscape

Data exchange – or the failure to federate • We need all the data in house to re-process it • We do not have a data exchange agreement with GEO • SOFT vs. MAGE-ML/MAGE-TAB • No ontology usage • Some free text annotation, little process annotation • Mass data acquisition • 80% solution (or less) • Employing text mining • Data reprocessing • Cost effective, eliminates user support • Using spreadsheets (not XML) • We could almost eliminate the database if we can index the files

2008 Data Landscape 5000 8837 1540 135,000 36512 230749 ArrayExpress GEO

Flexible Data Access Models • GUIs – biologists • Hyperlinks • FTP – bioinformaticians • Web services – workflows • XML data dumps • Spreadsheets • Direct SQL access (not for ArrayExpress) • Schema and code if you want it • ‘Geek for a week’

Lessons learned • Complex architecture means a lot of SW engineering • Biologists like excel, Bioinformaticians like tab-delimited files • Spreadsheets scale, easy to check, harder to parse • Generic systems will be future proof • Legacy format converters are needed • You don’t need to keep everything • Text based queries most common • Text mining very useful • Scaling problems are hard to fix • Bleeding edge technologies should be used sparingly • Federation doesn’t really work for the goals we have • Archiving alone does not add value • Training is important and expensive

Useful tools for life sciences data management • Excel • Whatizit – text mining software from EBI • Our spreadsheet builder, checkers and format parsers tab2mage.sf.net • OBO foundry ontologies esp OBI, CTO, Disease Ontology • Taverna for building workflows • BASE – open source microarray data management tool • BioMart – data warehouse for biological data www.biomart.org

Acknowledgements • ArrayExpress Production Team • Tomasz Adamusiak, Tony Burdett, Anna Farne, Ele Holloway, James Malone, Margus Lukk, Helen Parkinson, Tim Rayner, Eleanor Williams, Holly Zheng • Ugis Sarkans ArrayExpress Development Team Leader • Misha Kapushesky – Main Atlas Developer • Gabriella Rustici – Training officer • Alvis Brazma – Group Leader • Uniprot and Ensembl teams • Funders: EC (FELICS, EMERALD, DIAMONDS, GEN2PHEN, MUGEN), NIH-NHGRI, EMBL • The submitters and microarray collaborators • GEO especially Tanya Barrett

Standards and gene expression data – from data archiving to extracting biological knowledge