ArrayExpress and Gene Expression Atlas:. Mining Functional Genomics Data. Amy Tang, PhD amytang@ebi.ac.uk ArrayExpress Production Team Functional Genomics Group EMBL-EBI. What’s covered this morning?.
An Image/Link below is provided (as is) to download presentationDownload Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.Content is provided to you AS IS for your information and personal use only. Download presentation by click this link.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.During download, if you can't get a presentation, the file might be deleted by the publisher.
E N D
Presentation Transcript
ArrayExpress and Gene Expression Atlas:
Mining Functional Genomics Data Amy Tang, PhD amytang@ebi.ac.uk ArrayExpress Production Team Functional Genomics Group EMBL-EBI
What’s covered this morning? http://www.ebi.ac.uk/training/course/bioinformatics-transcriptomics-data-and-tools-cambridge-uk What do we mean by “functional genomics data”? Why do we need databases for them? Two databases: ArrayExpress Expression Atlas What’s in each database, how to browse, search, interpret, download data (Microarray/sequencing data analysis; How to submit data to ArrayExpress?) 2 ArrayExpress
Functional genomics (FG) data The aim of FG is to understand the function of genes and other (non-genic) parts of the genome Often involved high-throughput technologies (microarrays, high-throughput sequencing [HTS]) Questions addressed: Gene expression - when? where? how much? changes? Gene function - roles of genes in cellular processes, pathways Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 3 ArrayExpress
Example of FG data sets in ArrayExpress Questions addressed: Gene expression - when? where? how much? changes? Gene function - roles of genes in cellular processes, pathways 4 ArrayExpress
Example of FG data sets in ArrayExpress Questions addressed: Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 5 ArrayExpress
The two databases: how are they related? Direct submission Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to other databases, e.g. Links to analysis software, e.g. 6 ArrayExpress
The two databases: how do they compare? 7 ArrayExpress
ArrayExpresswww.ebi.ac.uk/arrayexpress Public repository for functional genomics data (both microarray and sequencing) Together with GEO at NCBI and CIBEX at DDBJ, serves the scientific community as a data archive supporting publications Provides access to curated data in a structured and standardised format – essential for easy sharing of experimental information Submissions are curated based on community standards: MIAME guidelines & MAGE-TAB format for microarray MINSEQE guidelines & MAGE-TAB format for HTS data 8 ArrayExpress
Community standards for data requirement MIAME = Minimal Information About a Microarray Experiment (http://www.mged.org/Workgroups/MIAME/miame_2.0.html) MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencingExperiment (http://www.mged.org/minseqe) The checklist: 9 ArrayExpress
What is an experimental factor? The main variable(s) studied, often related to the hypothesis of the experiment and is the independent variable, e.g. “genotype”. “Factor values” of samples should vary (e.g. “p53-/-”, “wild type”). X A 10 ArrayExpress
Reporting standards - MAGE-TAB format A simple spreadsheet format that uses a number of tab-delimited text files Array Design Format file Describes probes on an array, e.g. sequence, genomic mapping location Investigation Description Format file Experiment title Experiment description Submitter’s contact details Definition of all protocols ADF (microarray only) IDF Raw and processed data files Sample Data Relationship Format file Starting materials with annotation Derived materials (e.g. RNA extracts) All assays (hybs/seq. lanes) Resulting data file(s) for each assay Normalized.txt SDRF .CEL A1.CEL 2.fq.gz 1.fq.gz 11 ArrayExpress
MAGE-TAB Example: IDF
MAGE-TAB Example: SDRF
How much data in ArrayExpress?(as of 29 Oct 2013) 14 ArrayExpress
HTS data in ArrayExpress(as of 29 October 2013) Microarray vs HTS RNA-, DNA-, ChIP-seq breakdown 15 ArrayExpress
ArrayExpress Browsing ArrayExpress experimentswww.ebi.ac.uk/arrayexpress/experiments/browse.html All columns can be sorted by clicking at the heading
File download on the Browse page Direct download link (e.g. here it’s for a single raw data archive [i.e. *.zip] file) A link to a page which lists all the archive files available for download. (No direct link because there are >1 archives) This is specifically for HTS experiments. Direct link to European Nucleotide Archive (ENA)’s page which lists all the sequencing assays (which are called “runs” at the ENA). 18 ArrayExpress
ArrayExpress single-experiment view Sample characteristics, factors and factor values The microarray design used MIAME or MINSEQE scores ( * = compliant) All files related to this experiment ( e.g. IDF, SDRF, array design, raw data, R object ) Send data to GenomeSpace and analyse it yourself 19 ArrayExpress
Samples view – microarray experiment All columns can be sorted by clicking at the heading Direct link to data files for one sample Sample characteristics Factor values Scroll left and right to see all sample characteristics and factor values 20 ArrayExpress
Samples view – sequencing experiment Direct link to European Nucleotide Archive (ENA) record about this sequencing assay Direct link to fastq files at European Nucleotide Archive (ENA) 21 ArrayExpress
ArrayExpress Searching for experiments in ArrayExpresswww.ebi.ac.uk/arrayexpress/experiments/browse.html
ArrayExpress Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo Ontology: a way to systematically organise experimental factor terms. controlled vocabulary + hierarchy (relationship) Used in EBI databases: and external projects (e.g. NHGRI GWAS Catalogue) Combine terms from a subset of well-maintained and compatible ontologies, e.g. Gene Ontology (cellular component + biological process terms) NCBI Taxonomy Ontology in layman terms: http://jamesmaloneebi.blogspot.co.uk/2012/06/common-ontology-questions-1-what-is-it.html
Building EFO - an example Take all experimental factors Find the logical connection between them Organize them in an ontology disease disease sarcoma is the parent term [-] neoplasm disease neoplasm cancer is a type of [-] cancer neoplasm cancer neoplasm is synonym of [-] sarcoma disease sarcoma cancer is a type of [-] Kaposi’s sarcoma Kaposi’s sarcoma Kaposi’s sarcoma sarcoma is a type of ArrayExpress
Exploring EFO - an example ArrayExpress
Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo EFO developed to: increase the richness of annotations in databases expand on search terms when querying ArrayExpress and Expression Atlas using synonyms (e.g. “cerebral cortex” = “adult brain cortex”) using child terms (e.g. “bone” “rib” and “vertebra”) promote consistency (e.g. F/female/, 1day/24hours) facilitate automatic annotation and integration of external data (e.g. changing “gender” to “sex” automatically) 26 ArrayExpress
ArrayExpress Searching ArrayExpressUsing EFO terms and filters Filter your search results by: Species of interest One array design (platform), molecule (DNA, RNA, protein, etc) technology (microarray or HTS) “Auto-complete” with suggestions (like Google search) Avoid acronyms as search terms Enter keyword, click search, then filter next.
ArrayExpress What search terms can I use? ArrayExpress accession number, e.g. “E-MEXP-568” Secondary accession number e.g. GEO series “GSE5389” Experiment title, description Submitter's email address Publication title, authors and journal name, PubMed ID Sample attributes and experimental factor / factor values: “genetic modification” “heart” “diabetes” “neural stem cells” “penicillin” “ChIP-chip” “methylation profiling” “Arabidopsis” “p53” * Powered by EFO expansion. Use EFO terms wherever possible.
Example search: “leukemia” Exact match to search term Matched EFO synonyms to search term Matched EFO child term of search term 29 ArrayExpress
Advanced search Allows you to restrict your search to a specific field Format of search term: field_name:search_term Some examples: More examples: https://www.ebi.ac.uk/arrayexpress/help/how_to_search.html#AdvancedSearchExperiment
ArrayExpress QUESTIONS?
Hands-on exercise 1 Find RNA-seq assays studying human prostate adenocarcinomaHands-on exercise 2Find experiments studying the effect of sodium dodecyl sulphate on human skin ArrayExpress
The two databases Direct submission Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to other databases, e.g. Links to analysis software, e.g. 33 ArrayExpress
The two databases: how do they compare? 34 ArrayExpress
At least 3 replicates for each value of the experimental factor and maximum 4 factors Adequate sample annotation using EFO terms Adequate array (platform) design to map probes to genes and allow re-annotation of external references (e.g. Ensembl gene ID, Uniprot ID) RNA-seqexpt: good quality reads and reference genome build Presence of good quality rawdata files: e.g. CEL raw data files for Affymetrix assays, fastq files for RNA-seq experiments Atlas experiment selection criteria ArrayExpress
New atlas is launching in 3 days’ time! Launch date: week of 1 Dec 2013 Old Where to find the Atlases before and after launch? New ArrayExpress
New Atlas: “Baseline” and “differential” 37 ArrayExpress
Experiencing the old and new Atlases today Taster and preview Old Example use case and exercise Example use case and exercise New ArrayExpress
ArrayExpress “Old” Atlasconstruction – analysis pipeline Cond.1 Cond.2 Cond.3 A dummy example from one experiment: genes Cond.1 Cond.2 Cond.3 Linear model* (Bio/C Limma) Moderated T-test Output: 2-D matrix Input data (Affy CEL, Agilent feature extraction files, RNA-seqfastq files) 1= differentially expressed 0 = not differentially expressed * More information about the statistical methodology: http://nar.oxfordjournals.org/content/38/suppl_1/D690.full
“Old” Atlasconstruction – analysis pipeline How differential expression is calculated in one experiment: “Is gene X differentially expressed in condition 1 in this experiment?” = a single expression value for gene X Gene X Cond.1 mean Cond.2 mean Mean of all samples Cond.3 mean Compare and calculate statistic ArrayExpress
“Old” Atlasconstruction – analysis pipeline Exp.1 Cond.1 Cond.2 Cond.3 Apply linear modelling statistics to each of the n experiments Statistical test genes Exp. 2 Cond.4 Cond.5 Cond.6 Statistical test genes Cond.X Cond.Y Cond.Z Exp. n genes Statistical test Each experiment has its own “verdict” or “vote” on whether a gene is differentially expressed or not under a certain condition ArrayExpress
ArrayExpress “Old” Atlasconstruction – results Summary of the “verdicts” from different experiments
Mapping microarray probes to genes Every (~monthly) Atlas release takes the latest Ensembl gene – probe identifier mapping data. From Ensembl genes, we also get: Compara genes External references (xrefs) to other databases E.g. UniProt protein IDs, NCBI RefSeq IDs, HGNC gene symbols, gene ontology terms, InterPro terms Probe identifiers Expression data per probe Ensembl genes 43 ArrayExpress
Example Atlas use case: KCC2 gene and BPA Scenario: You study the health impact of BisphenolA (BPA) BPA: common additive in household plastic items. Negative health effects have been linked to BPA, e.g. on foetal and neonatal brain development. potassium chloride cotransporter 2 (Kcc2) mRNA levels ↓ Epigenetic downregulation BPA + PNAS paper (Yeo et al., 2013) BisphenolA delays the perinatal chloride shift in cortical neurons by epigenetic effects on the Kcc2 promoter. Your questions: In which human organ/tissue is the KCC2 gene differentially expressed? Under what condition(s) is the human KCC2gene differentially expressed? What is the expression pattern of KCC2/Kcc2orthologues? ArrayExpress
ArrayExpress “Old” Atlas home page Restrict query by direction of differential expression (up, down, both, neither) Query for single gene or a group of genes Query for conditions The ‘advanced query’ option allows building more complex queries
Gene search (old Atlas): human KCC2 gene ArrayExpress
(1) Summarised expression data for one gene Default: Sort by levels of diff. expression Group by experimental factor / intent Clicking at a factor/condition changes profile display ArrayExpress
(2) The anatomogram ArrayExpress
ArrayExpress (3) Detailed expression profile Drill down to - 1 probe (210040_at) - mapped to 1 gene (KCC2) - in 1 experiment (E-GEOD-3526) Samples mapped to “brain” experimental factor by EFO * * * * * * * *
(4) Jump to orthologues from gene summary Orthology comes from EnsemblCompara database ArrayExpress
(5) Compare orthologues with parallel heatmaps ArrayExpress
Baseline Atlas construction Only RNA-seq data sets are used. @read_name/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @read_name/2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 fastq fastq 1. Align with TopHat Reference genome from Ensembl 2. Cufflinks FPKMs bam Mapped reads ArrayExpress
Baseline Atlas search for human KCC2 ArrayExpress
Baseline Atlas search results ArrayExpress
Human KCC2 gene in Baseline Atlas FPKM threshold slider ArrayExpress
Old Atlas ‘condition-only’ query ArrayExpress
ArrayExpress Old Atlas ‘condition-only’ query (cont’d)heatmap view
Old Atlas gene + condition query ArrayExpress
Old Atlas query refining ArrayExpress
Old Atlas query refining AND ArrayExpress
Old Atlas query refining AND ArrayExpress
ArrayExpress QUESTIONS?
Hands-on exercise 3Find information on Tbx5 expression in mouse in relation to Holt-Oram syndromeHands-on exercise 4Find transcription factor genes belonging to the androgen signaling pathway in prostate cancer ArrayExpress
Diff. atlas changes: (1) analysis pipeline How differential expression is calculated in one experiment: “Is gene X differentially expressed in condition 1 in this experiment?” Gene X = a single expression value for gene X Cond.1 mean Cond.2 mean Mean of all samples Cond.3 mean Create “contrasts” and calculate statistic ArrayExpress
Diff atlas changes (2): modern interface Lots of mouse-over tips/help (?) FDR cut-off Clearer indication of experimental factor and contrast Colour gradient showing significance of differential expression Experiment design, data analysis methods, full analytics data for download MA plots ArrayExpress
Diff. atlas changes: (2) modern interface Clearer indication of experimental factor and contrast ArrayExpress
ArrayExpress Diff. atlas changes: (3) verdict “summary”? = ? What if there are differences in sample attributes?
Diff. atlas changes: (4) Histograms? ArrayExpress
ArrayExpress QUESTIONS?
ArrayExpress-Atlas Crossword ArrayExpress
ArrayExpress Find out more about the two databases…. Visit our eLearning portal, Train Online: http://www.ebi.ac.uk/training/online/ for tutorials on ArrayExpress and Expression Atlas ArrayExpressBioConductorR package: http://bioconductor.org/packages/release/bioc/html/ArrayExpress.html ArrayExpress help: www.ebi.ac.uk/arrayexpress/help/index.html Email us at: miamexpress@ebi.ac.uk Atlas mailing list: arrayexpress-atlas@ebi.ac.uk
ArrayExpress Open-source tools for FG data analysis Gene Pattern (Broad Institute) http://www.broadinstitute.org/cancer/software/genepattern/ GenomeSpace (incorporates Gene Pattern, ArrayExpress provides link to send data directly to GenomeSpace) http://genomespace.org/ Galaxy (allowing more modular customisation of workflow) BioConductorR (Comprehensive help doc on standard workflows) http://www.bioconductor.org/help/ BioConductor Case Studies (Hahne et al.) Microarray Technology in Practice (Russell et al.)
Data submission to ArrayExpress Archive ArrayExpress
ArrayExpress Data submission to Arrayexpress Read this help page carefully before preparing any files Use the MAGE-TAB submission tools to create a tailor-made template spreadsheet (IDF and SDRF) for your experiment
ArrayExpress Submission of HTS data ArrayExpress acts as a “broker” for submitter. Meta-data and processed data: ArrayExpress Raw sequence reads* (e.g. fastq, bam): ENA *See http://www.ebi.ac.uk/ena/about/sra_data_formatfor accepted read file format
ArrayExpress What happens after submission? Can keep data private until publication. Will provide login account details to you and reviewer for private data access Email confirmation Submission ‘closed’ so no more editing on your end Curation: We will email you with any questions May ‘re-open’ submission for you to make changes Get your submission in the best possible shape to shorten curation and processing time!
ArrayExpress Submission checklist
ArrayExpress Need help with submitting your data? Visit our eLearning portal, Train Onlinefor the specific tutorial on how to submit data using MAGE-TAB: www.ebi.ac.uk/training/online/course/arrayexpress-submitting-data-using-mage-tab ArrayExpress help page on submisisons: www.ebi.ac.uk/arrayexpress/help/submissions_overview.html Watch this short YouTube video on how to navigate the MAGE-TAB submission tool: http://youtu.be/KVpCVGpjw2Y Email curators at: miamexpress@ebi.ac.uk