780 likes | 1.06k Views
ArrayExpress and Gene Expression Atlas:. Mining Functional Genomics Data. Amy Tang, PhD ArrayExpress Production Team Functional Genomics Group EMBL-EBI. What’s covered this morning?. http://www.ebi.ac.uk/training/course/bioinformatics-udine2013.
E N D
ArrayExpress and Gene Expression Atlas: Mining Functional Genomics Data Amy Tang, PhD ArrayExpress Production Team Functional Genomics Group EMBL-EBI
What’s covered this morning? http://www.ebi.ac.uk/training/course/bioinformatics-udine2013 • What do we mean by “functional genomics data”? Why do we need databases for them? • Two databases: • ArrayExpress • Expression Atlas • What’s in each database, how to browse, search, interpret, download data • (Microarray/sequencing data analysis; How to submit data to ArrayExpress?) 2 ArrayExpress
Functional genomics (FG) data • The aim of FG is to understand the function of genes and other (non-genic) parts of the genome • Often involved high-throughput technologies (microarrays, high-throughput sequencing [HTS]) • Questions addressed: • Gene expression - when? where? how much? changes? • Gene function - roles of genes in cellular processes, pathways • Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 3 ArrayExpress
Example of FG data sets in ArrayExpress • Questions addressed: • Gene expression - when? where? how much? changes? • Gene function - roles of genes in cellular processes, pathways 4 ArrayExpress
Example of FG data sets in ArrayExpress • Questions addressed: • Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 5 ArrayExpress
The two databases: how are they related? Direct submission Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to other databases, e.g. Links to analysis software, e.g. 6 ArrayExpress
The two databases: how do they compare? 7 ArrayExpress
ArrayExpresswww.ebi.ac.uk/arrayexpress • Public repository for functional genomics data (both microarray and sequencing) • Together with GEO at NCBI and CIBEX at DDBJ, serves the scientific community as a data archive supporting publications • Provides access to curated data in a structured and standardised format – essential for easy sharing of experimental information • Submissions are curated based on community standards: • MIAME guidelines & MAGE-TAB format for microarray • MINSEQE guidelines & MAGE-TAB format for HTS data 8 ArrayExpress
Community standards for data requirement • MIAME = Minimal Information About a Microarray Experiment (http://www.mged.org/Workgroups/MIAME/miame_2.0.html) • MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencingExperiment (http://www.mged.org/minseqe) • The checklist: 9 ArrayExpress
What is an experimental factor? • The main variable(s) studied, often related to the hypothesis of the experiment and is the independent variable. • Values of the factor (“factor values”) should vary. X A 10 ArrayExpress
Reporting standards - MAGE-TAB format A simple spreadsheet format that uses a number of tab-delimited text files • Array Design Format file • Describes probes on an array, e.g. sequence, genomic mapping location • Investigation Description Format file • Experiment title • Experiment description • Submitter’s contact details • Definition of all protocols ADF (microarray only) IDF • Raw and processed data files • Sample Data Relationship Format file • Starting materials with annotation • Derived materials (e.g. RNA extracts) • All assays (hybs/seq. lanes) • Resulting data file(s) for each assay Normalized.txt SDRF .CEL A1.CEL 2.fq.gz 1.fq.gz 11 ArrayExpress
How much data in ArrayExpress?(as of 29 Oct 2013) 14 ArrayExpress
HTS data in ArrayExpress(as of 29 October 2013) Microarray vs HTS RNA-, DNA-, ChIP-seq breakdown 15 ArrayExpress
ArrayExpress Browsing ArrayExpress www.ebi.ac.uk/arrayexpress
ArrayExpress Browsing ArrayExpress experimentswww.ebi.ac.uk/arrayexpress/experiments/browse.html All columns can be sorted by clicking at the heading
File download on the Browse page Direct download link (e.g. here it’s for a single raw data archive [i.e. *.zip] file) A link to a page which lists all the archive files available for download. (No direct link because there are >1 archives) This is specifically for HTS experiments. Direct link to European Nucleotide Archive (ENA)’s page which lists all the sequencing assays (which are called “runs” at the ENA). 18 ArrayExpress
ArrayExpress single-experiment view Sample characteristics, factors and factor values The microarray design used MIAME or MINSEQE scores ( * = compliant) All files related to this experiment ( e.g. IDF, SDRF, array design, raw data, R object ) Send data to GenomeSpace and analyse it yourself 19 ArrayExpress
Samples view – microarray experiment All columns can be sorted by clicking at the heading Direct link to data files for one sample Sample characteristics Factor values Scroll left and right to see all sample characteristics and factor values 20 ArrayExpress
Samples view – sequencing experiment Direct link to European Nucleotide Archive (ENA) record about this sequencing assay Direct link to fastq files at European Nucleotide Archive (ENA) 21 ArrayExpress
ArrayExpress Searching for experiments in ArrayExpresswww.ebi.ac.uk/arrayexpress/experiments/browse.html
ArrayExpress Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo • Ontology: a way to systematically organise experimental factor terms. controlled vocabulary + hierarchy (relationship) • Used in EBI databases: and external projects (e.g. NHGRI GWAS Catalogue) • Combine terms from a subset of well-maintained and compatible ontologies, e.g. • Gene Ontology (cellular component + biological process terms) • NCBI Taxonomy Ontology in layman terms: http://jamesmaloneebi.blogspot.co.uk/2012/06/common-ontology-questions-1-what-is-it.html
Building EFO - an example Take all experimental factors Find the logical connection between them Organize them in an ontology disease disease sarcoma is the parent term [-] neoplasm disease neoplasm cancer is a type of [-] cancer neoplasm cancer neoplasm is synonym of [-] sarcoma disease sarcoma cancer is a type of [-] Kaposi’s sarcoma Kaposi’s sarcoma Kaposi’s sarcoma sarcoma is a type of ArrayExpress
Exploring EFO - an example ArrayExpress
Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo EFO developed to: • increase the richness of annotations in databases • expand on search terms when querying ArrayExpress and Expression Atlas • using synonyms (e.g. “cerebral cortex” = “adult brain cortex”) • using child terms (e.g. “bone” “rib” and “vertebra”) • promote consistency (e.g. F/female/, 1day/24hours) • facilitate automatic annotation and integration of external data (e.g. changing “gender” to “sex” automatically) 26 ArrayExpress
ArrayExpress Searching ArrayExpressUsing EFO terms and filters • Filter your search results by: • Species of interest • One array design (platform), • molecule (DNA, RNA, protein, etc) • technology (microarray or HTS) • “Auto-complete” with suggestions (like Google search) • Avoid acronyms as search terms Enter keyword, click search, then filter next.
ArrayExpress What search terms can I use? • ArrayExpress accession number, e.g. “E-MEXP-568” • Secondary accession number e.g. GEO series “GSE5389” • Experiment title, description • Submitter's email address • Publication title, authors and journal name, PubMed ID • Sample attributes and experimental factor / factor values: • “genetic modification” “heart” “diabetes” • “neural stem cells” “penicillin” “ChIP-chip” • “methylation profiling” “Arabidopsis” “p53” • * Powered by EFO expansion. Use EFO terms wherever possible.
Example search: “leukemia” Exact match to search term Matched EFO synonyms to search term Matched EFO child term of search term 29 ArrayExpress
Advanced search • Allows you to restrict your search to a specific field • Format of search term: field_name:search_term • Some examples: • More examples: https://www.ebi.ac.uk/arrayexpress/help/how_to_search.html#AdvancedSearchExperiment
ArrayExpress QUESTIONS?
Hands-on exercise 1 Find RNA-seq assays studying human prostate adenocarcinomaHands-on exercise 2Find experiments studying the effect of sodium dodecyl sulphate on human skin ArrayExpress
The two databases Direct submission Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to other databases, e.g. Links to analysis software, e.g. 33 ArrayExpress
The two databases: how do they compare? 34 ArrayExpress
Array (platform) designs relating to the experiment must be provided. Probe annotation must be adequate to map probes to genes and allow re-annotation of external references (e.g. Ensembl gene ID, Uniprot ID) At least 3 replicates for each value of the experimental factor Maximum 4 experimental factors Adequate sample annotation using EFO terms Presence of rawdata files: CEL raw data files for Affymetrix assays, fastq files for RNA-seq experiments Atlas construction - expt selection criteria ArrayExpress
ArrayExpress Atlasconstruction – analysis pipeline Cond.1 Cond.2 Cond.3 A dummy example from one experiment: genes Cond.1 Cond.2 Cond.3 Linear model* (Bio/C Limma) Output: 2-D matrix Input data (Affy CEL, non-Affy processed) 1= differentially expressed 0 = not differentially expressed * More information about the statistical methodology: http://nar.oxfordjournals.org/content/38/suppl_1/D690.full
Atlasconstruction – analysis pipeline How differential expression is calculated in one experiment: “Is gene X differentially expressed in condition 1 in this experiment?” = a single expression value for gene X Gene X Cond.1 mean Cond.2 mean Mean of all samples Cond.3 mean Compare and calculate statistic ArrayExpress
Atlasconstruction - analysispipeline Exp.1 Cond.1 Cond.2 Cond.3 Apply linear modelling statistics to each of the n experiments Statistical test genes Exp. 2 Cond.4 Cond.5 Cond.6 Statistical test genes Cond.X Cond.Y Cond.Z Exp. n genes Statistical test Each experiment has its own “verdict” or “vote” on whether a gene is differentially expressed or not under a certain condition ArrayExpress
ArrayExpress Atlas construction - result Summary of the “verdicts” from different experiments
ArrayExpress Expression Atlas home page http://www.ebi.ac.uk/gxa Restrict query by direction of differential expression (up, down, both, neither) Query for conditions Query for genes The ‘advanced query’ option allows building more complex queries
Mapping microarray probes to genes • Every (~monthly) Atlas release takes the latest Ensembl gene – probe identifier mapping data. • From Ensembl genes, we also get: • Compara genes • External references (xrefs) to other databases E.g. UniProt protein IDs, NCBI RefSeq IDs, HGNC gene symbols, gene ontology terms, InterPro terms Probe identifiers Expression data per probe Ensembl genes
Example Atlas search: KCC2 gene and BPA Scenario: You study the health impact of BisphenolA (BPA) BPA: common additive in household plastic items. Negative health effects have been linked to BPA, e.g. on foetal and neonatal brain development. potassium chloride cotransporter 2 (Kcc2) mRNA levels ↓ Epigenetic downregulation BPA + • PNAS paper (Yeo et al., 2013) BisphenolA delays the perinatal chloride shift in cortical neurons by epigenetic effects on the Kcc2 promoter. Your questions: What is the human KCC2 gene? What is its general expression pattern? In which human organ/tissue is the KCC2 gene differentially expressed? What is the expression pattern of KCC2/Kcc2orthologues? ArrayExpress
Gene search: human KCC2 gene ArrayExpress
(1) Summarised expression data for one gene Default: Sort by levels of diff. expression Group by experimental factor / intent Clicking at a factor/condition changes profile display ArrayExpress
(2) The anatomogram ArrayExpress
ArrayExpress (3) Detailed expression profile Drill down to - 1 probe (210040_at) - mapped to 1 gene (KCC2) - in 1 experiment (E-GEOD-3526) Samples mapped to “brain” experimental factor by EFO * * * * * * * *
(4) Jump to orthologues from gene summary Orthology comes from EnsemblCompara database ArrayExpress
(5) Compare orthologues with parallel heatmaps ArrayExpress
Atlas ‘condition-only’ query ArrayExpress
ArrayExpress Atlas ‘condition-only’ query (cont’d)heatmap view