390 likes | 532 Views
Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment. STEMREM 201 Fall 2012 Aaron Newman, Ph.D. 10/17/12. A genomics approach to biology involves…. A plethora of techniques and tools exist …
E N D
Microarray Gene Expression AnalysisDifferential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D. 10/17/12
A genomics approach to biology involves… • A plethora of techniques and tools exist… • My goal is to introduce some practical, powerful, and freely available methods for gene expression analysis • Finding significant patterns in high-throughput data • Interpreting these patterns in the context of prior knowledge • Generating new hypotheses and predictions
Example workflow Gene expression profiling GSEA DEGs Cluster analysis Functional enrichment Network analysis Biological meaning
Materials and Methods GSE22651 ESCs and differentiated cells GSEA Excel & GenePattern Hierarchical & AutoSOME DAVID Toppfun STRING & Cytoscape Biological meaning
Microarray normalization • Probe/array-level normalization (reduce inter-array technical errors and noise) • Raw CEL files merged and normalized text file • Robust Multi-Chip Averaging (RMA) for Affymetrix arrays • Affymetrix Gene Expression Console software • QuantileNormalization for Agilent, Illumina, and others. • Sets all arrays to the same distribution (mean, median, sd, etc.) • Analysis-level normalization • Log2 Transformation • Improves statistical properties for analysis (log-normal) • Median/mean centering • Useful for reducing impact of transcript abundance on identifying/visualizing co-expressed genes • Unit variance • Standardize each column (array) to mean of 0 and standard deviation of 1 center
Differentially expressed genes (DEGs) • Goal: given known phenotypes, determine which genes exhibit significant differential expression. • Many genes will be tested; multiple hypothesis testing should be performed to control the false discovery rate (FDR). • Bonferroni (α / n) • Benjamini-Hochberg (largest k s.t. P(k) ≤ (αk) / n) • StoreyQ-value (p-value specific FDR) • Q-value software- http://genomics.princeton.edu/storeylab/qvalue/ • Sample permutations can improve p-value accuracy • *Only apply with ≥ 10 samples per class • Implemented in GenePattern (ComparativeMarkerSelection module), SAM
Supervised DEG identification tools (i.e., classes are known) • Likely familiar • Excel (T-test) coupled with FDR assessment • Watch for conversion to dates! (e.g., MARCH6 3/6) • Basic to Intermediate • Statistical Analysis of Microarrays (Windows only) • GenePattern (Broad Institute) • p.adjust (R) • Advanced • Bioconductor packages in R (e.g., limma)
Tutorial 1:Identifying DEGs in Excel using FDR cutoff of 1% (BH method) • Open “DEG_example_large.txt” in Excel (2 classes, 25151 genes). • In column L, use T-test function to test for significant differential expression between ESCs and non-ESCs. • =TTEST(“ESCs”, “non-ESCs”, 2, 3) • 2-tailed, unpaired with unequal variance • Sort p-values in column L in ascending order. • In column M, input p-value rank, going from 1,2,3…n. • Input following formula into column N to test for FDR of 1% • In general: = (0.01 * rank) / n • In our case: = (0.01 * M2) / 25151 • Autocomplete column N • In column O, test for p-values that do not exceed FDR of 1% • = if(L2 <= N2, 1, 0) • That’s it! All genes with a 1 in column O are significant at an FDR of 1%. • Should be 3237 significant genes.
Cleaning up result and creating a heat map • Calculate log2 fold change in column P • = AVERAGE(B2:F2) – AVERAGE(G2:K2) • Autocomplete • Use filter to isolate significant genes with absolute fold change ≥ 5 • Copy and paste into new tab • Sort by fold change in descending order • Save as new file (e.g., DEG_example_sorted_fdr01_fold5.txt) • Center genes in Cluster 3.0 • Open file • Go to “Adjust Data” • Check “Center genes” and select “Median” • Press “Apply”, then save file (e.g. DEG_example_sorted_fdr01_fold5_medcen.txt) • Open in Java TreeView. • To customize heat map display and text, use “Settings>Pixel Settings”, and “Settings>Font Settings”. • Export heat map image using “Export>Save Tree Image”.
Expected result FDR ≤ 1% Fold change ≥ abs(25) = 79 genes
Tutorial 2:Identifying DEGs in GenePattern • Create account (if you have not already) and log in to GenePattern • Select “Differential Expression Analysis”. • Skip to ComparativeMarkerSelection (step 2) and click “Open module”. • Enter following fields: • input* file = GSEA_example_expression_large.gct • cls* file = GSEA_example_classes.cls • number of permutations* = 0 • Select “Run” • When processing is finished, select “ComparativeMarkerSelectionViewer” from the pulldown menu next to “GSEA_example_expression_large.comp.marker.odf” • Select “Run” • Select “Open Visualizer” and “Allow” • A table will appear showing all genes and various statistics, including: • Benjamini_Hochberg corrected p-value; FDR(BH) • Storey q-value; Q Value • Bonferroni p-value • Fold change
Extracting data and displaying as a heat map • Pipe .odf file to “ExtractComparativeMarkerResults” module. • Select genes with BH FDR ≤ 1%. • Filtered data are available as a new .gct file. • To display a heat map, pipe .gct file to “HeatMapViewer” module. • Select “Run”
Example workflow Gene expression profiling GSEA DEGs Cluster analysis Functional enrichment Network analysis Biological meaning
The “Clustering Problem” for Large Data Sets Widespread Importance • Genomics • Phylogenetics • Disease • Galaxy Clusters • etc., etc., etc.
Common clustering methods Toy data set Hierarchical (Eisen) SOM K-means Figure 1. D’haeseleer, Nat Biotechnol. 2005
Tutorial 3: Hierarchical Clustering • Open “GSE22651_filtered.txt” in Cluster 3.0 • Normalize data (Adjust Data tab) • Check “Log transform data” • Check “Center genes” and select “Median” • Press “Apply Button” • Cluster data (Hierarchical tab) • Check “Cluster” under Genes and under Arrays • Leave “Similarity Metric” at Uncentered Correlation • Press “Centroid linkage” under “Cluster method” • Centroid = distance between cluster centers • Single = closest distance between clusters • Complete = farthest distance between clusters • Average = mean of all pairwise distances between clusters • Open clustered data table file (*.cdt) in Java TreeView.
Java TreeView Navigate the cluster tree to highlight genes with distinct expression patterns in particular samples Export>”Save List” to copy or save gene lists of interest for further analysis.
Expected result Genes Samples
Automatic clustering of Self-Organizing Map Ensembles AutoSOME Serial application of - SOM - Density Equalization - Minimum Spanning Tree - Ensemble Averaging Newman and Cooper (2010) BMC Bioinformatics, 11:117
Tutorial 4:Identifying discrete clusters without prior knowledge of cluster number • Launch the AutoSOME GUI via the large launch button. • Open “GSE22651_filtered.txt” • Skip filtering • Show “Basic Fields” • Set p-value = 0.05 • Show “Input Adjustment” • Check Log2 Scaling, Unit Variance, Median Center, and Sum Squares = 1 • Press the large “Run” button on the left. • From the menubar, select View>heat map> green red. • Select cluster 1 in the cluster list. • The data are rendered as a normalized heat map. To change the display, go to View>settings>image settings. • Under the Normalization tab, check “Display Original Data”, “Log2 Scaling”, and “Median Center”. • Check “Manually adjust range for contrast” and set minimum to -2 and maximum to 2. Press “Update” (lower left corner).
AutoSOME continued… • Select clusters 1 to 5 (hold down shift). Right click mouse in heat map window to resize. • Set “Zoom Factor” to 40 and Press “Save”. • See website for further tutorials and documentation. View>heat map>rainbow Representative heat map
Example workflow Gene expression profiling GSEA DEGs Cluster analysis Functional enrichment Network analysis Biological meaning
How to interpret clustering results • Gene set functional annotation • DAVID • Toppfun • MSigDB • Network analysis • STRING http://david.abcc.ncifcrf.gov/ http://toppgene.cchmc.org/enrichment.jsp http://www.broadinstitute.org/gsea/msigdb/ http://string-db.org/
Gene sets: where do they come from? • Gene ontology • An attempt to semantically organize genes and their functional relationships. • Data are arranged in a graph structure, from broad to specific • Ontologies: • Biological process (BP): series of ordered events • Molecular function (MF): activities that occur at the molecular level • Cellular component (CC): part of a cell • Biocarta/KEGG pathways (curated wiring diagrams) • High-throughput studies e.g., MSigDB
Tutorial 5: Functional annotation of a large cluster using DAVID http://david.abcc.ncifcrf.gov/ • In AutoSOME, find the cluster of genes with higher expression in stem cells. • Go to View>raw data. Highlight all genes in the cluster and copy the list. • Open the DAVID homepage and press “Start Analysis” • Select “Upload” tab and paste in gene list. • Under “Select Identifier,” select “OFFICIAL_GENE_SYMBOL”. • Select “Gene List” for “List Type”. • Submit the list. • Press OK at multi-species warning message. • Select “Homo sapiens” and press “Select Species”. • Select “Functional Annotation Tool” • Press “Functional Annotation Clustering” button.
DAVID Output Similar gene sets are clustered together, eliminating redundancy and facilitating interpretation Gene sets Pathways
Tutorial 6: Protein-protein associations with STRING Top 50 non-ESC genes
Protein-protein association network among top 50 non-ESC genes Evidence types
Example workflow Gene expression profiling GSEA DEGs Cluster analysis Functional enrichment Network analysis Biological meaning
Gene set enrichment in a ranked list • Gene Set Enrichment Analysis (GSEA) • “Threshold-less” • Arbitrary DEG cutoffs are avoided • Two modes of operation to rank input genes: • Rank by differential expression between phenotypes • Default metric is signal to noise, defined as: (avg[class 1] – avg[class 2]) / (sd[class 1] + sd[class 2]) • Pre-ranked according to user-defined criteria • Evaluates statistical bias in the distribution of each defined gene set over the list of ranked input genes.
Input format • Input: • Expression data (gene cluster text file format, *.gct) • Classes (*.cls) • File extensions matter! • If using Notepad in Windows, set Save as type: to “All Files (*.*) • If using TextEdit in Mac, go to Preferences > Open and Save > and uncheck ‘Add “.txt” extension to plain text files’. • Formatting instructions: http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
Tutorial 7: GSEA • Input files: GSEA_example_expression_large.gct, GSEA_example_classes.cls Load data Run GSEA
Summary Gene expression profiling GSEA DEGs Cluster analysis Functional enrichment Network analysis Biological meaning