340 likes | 370 Views
Explore statistical algorithms, such as SAM and J-Score, for analyzing microarray data with limited replicates. Compare results from 8x8 and 3x3 analyses, and consider strategies for a final gene selection approach.
E N D
Microarray Analysis with a Small Number of Replicates By Kung-Hua Chang & Dhondup Pemba Mentors: Cecilie Boysen, Ph.D & Jim Breaux, Ph.D Southern California Bioinformatics Institute Summer 2005 Funded By NSF/NIH
Our Task Statistical Analysis with a Small Number of Replicates Functional Analysis Additional Projects Background Affymetrix GeneChip® Microarrays VMAxS Steps in Microarray Data Analysis Outline
Affymetrix GeneChip® Microarrays • Signal detection. Fluorescence detection of hybridization between RNA target and oligonucleotide probe. 22 Probes define one gene FOR MORE INFO... http://www.affymetrix.com
Each gene on an Affy chip is represented by a probe set • Perfect Match (PM) probe represents short segment of gene of interest. • Mismatch (MM) probe measures background signal • Data for probe set is summarized into single number (“gene-level” data) FOR MORE INFO... “Processing Affy chip Data: GCOS/MAS 5.0, RMA, and gcRMA”(Roger Bumgarner University of Washington).
ViaLogy’s data analysis service for DNA microarray chip data • Employs Quantum Resonance Interferometry technology to detect signals below background noise FOR MORE INFO... Visit Vialogy.com. Raw Data
Raw Data Image Image Analysis (extract cell-level data) VMAxS Gene-level summarization Normalization (remove non-biological variation) Statistical Analysis (select differentially expressed genes) Functional Analysis (identify affected processes and pathways) Steps in Microarray Data Analysis
Statistical Analysis with a Small Number of Replicates Overview • Overall objective: Perform end-to-end analysis on a client’s microarray data set (from raw image to pathway analysis) • Problem: Dataset contained a small number of replicates
Problem with small number of replicates Small number of replicates yields unreliable identification of gene variances With seven replicates, we are more confident that gene 1 is upregulated FOR MORE INFO... Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays (Nitin et al.)
Approach to dealing with a small number of replicates • Analyze a larger data set that has a good number of replicates (n = 8x8). • Assume this is the “truth” • Analyze a randomly selected subset of this data set (n = 3x3) using three different algorithms. • Compare output from 8x8 analysis to 3x3 analysis. • Decide how to analyze client’s data set based on results
Statistical Analysis Algorithms • SAM: Significance Analysis of Microarray (Tusher, Tibshirani & Chu) • J-Score (Jim Breaux) • Cyber-T (Baldi & Long)
SAM • Each gene receives a score based on the difference in average gene expression relative to the standard deviation of the repeated measurements. • Genes with scores greater than a threshold are considered significant. • This threshold is determined by the false discovery rate the user desires. FOR MORE INFO... Significance analysis of microarrays applied to the ionizing radiation response(Tusher et al)
J-Score • Each gene receives a score based on average fold-change in gene expression relative to the standard deviation of the repeated measurements. • Cut-off for selection of “significant” genes is arbitrary.
Cyber-T (Baldi & Long) Cyber-T ‘Regularized t-test’ • “Assumes genes of similar expression levels have similar measurement errors. • The variance of any single gene can be estimated from the variance from a number of genes of similar expression level. • The variance of any gene within any given treatment can be estimated by the weighted average of a prior estimate of variance for that gene.” FOR MORE INFO... Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework (Long et al).
Results: Comparison between SAM 8x8 and 3x3 methods • At 1% False Discovery Rate (FDR) SAM 8x8 picked up 762 significant genes (estimated number of false significant genes = 8). • Agreement between SAM 8x8 and the top 1000 genes from the 3x3 methods:
Results: Comparison between 3x3 methods • Venn Diagram: Union of all three methods = 433 unique genes
Results: Comparison between 3x3 methods • Agreement between any two methods: • These findings are consistent with a previous study by a group at NIH (Hosack et al.): • Found that agreement between various methods tested ranged from 7% to 60%.
Possible Approaches for Final Analysis • Method 1: Final set of significant genes is derived from the method that had the most overlap with SAM 8x8 (J-Score). • Final result: • 1000 total significant genes • At most 356 true positives • At most 652 false positives • Pro: • Decent number of true positives • Con: • Large number of false positives • Might be missing important genes found by other two methods
Possible Approaches for Final Analysis • Method 2: Final set of significant genes is the intersection of the three methods. • Final result: • 174 total significant genes • At most 174 true positives • At most 8 false positives • Pro: • Lowest number of false positives • Con: • Lowest number of true positives
Possible Approaches for Final Analysis • Method 3: Final set of significant genes is the union of the three methods • Final result: • 1631 total significant genes • At most 433 True positives • At most 1206 False positives • Pro: • Highest number of true positives. • Con: • Highest number of false positives
Final Approach • Return the largest number of true positives to the client (Method 3). • To deal with large number of potential false positives in the results, we rank each gene based on the ranking from Cyber-T, J-Score, and SAM methods. • For example, if “Gene 02” is ranked number 2 in Cyber-T, number 3 in J-Score, and number 4 in SAM, then the overall ranking is (2 + 3 + 4) / 3 = 3 • Higher ranking = more likely to be true positive
Functional Analysis • Mapping to biological processes. • - EASE, the Expression Analysis Systematic Explorer from the National Institute of Allergy and Infectious Diseases at the National Institute of Health. • Mapping to pathways. • - PathwayAssist software from Ariadne Genomics. FOR MORE INFO... http://apps1.niaid.nih.gov/david/ http://www.ariadnegenomics.com/products/pathway.html
Mapping to biological processes • The list of up and down regulated genes were inserted into EASE. • The Lower the EASE score the more highly the ranked process is. • Example of the top 14 processes, locations and functions found from our significant genes.
Gene 2 Gene 1 Gene 3 Mapping to pathways • Gene 1, 2 and 3 are significant up- or down-regulated genes by our combination method • Investigation of gene 1 reveals gene 2 and 3 are involved in gene 1’s pathway.
Conclusion • Three algorithms for selecting differentially expressed genes produced different lists of genes with ~60% to 70% agreement. • Taking the union of the results from the three algorithms yielded the most true positives for our client. • Biological processes and pathways found through functional analysis correspond to what we expected based on samples studied. • Helps to make microarray results more believable.
Additional Projects: Chris’s GUI • Automation of the previously discussed analyses with a GUI.
Additional Projects: Dhonam’s GUI • ViaLogy has individual scripts that are used to test quality of VMAxS output. • Current implementation requires working knowledge of R scripting. • Project: implement a user-friendly GUI program to execute multiple QC tests.
Dhonam’s GUI Screen 3 Optional window pops up if default parameters are not desired
Acknowledgements • Dr. Sandra Sharp • Dr. Wendie Johnston • Dr. Jamil Momand • Dr. Nancy Warter-Perez • Other SoCalBSI Staff and Faculty • SoCalBSI 2005 Participants • Lien Chung (SoCalBSI Participant 2004) SoCalBSI ViaLogy • Dr. Cecilie Boysen • Dr. Jim Breaux • Other ViaLogy Employees
References • Hosack DA, Dennis GJ, Sherman BT, Lane HC, Lempicki RA: Identifying biological themes within lists of genes with EASE.Genome Biol 2003, 4:R70. • Leslie M. Cope, Irizarry RA, Jaffee HA, Wu J, Speed, TP. A benchmark for Affymetrix GeneChip expression measures. Bioinformatics 2004;20:323–331 • Long, A.D., Mangalam, H.J., Chann, B.Y.P., Tolleri, L., Hatfield, G.W., and Baldi, P. (2001) Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. The Journal of Biological Chemistry 276(23):19937-19944. • Nitin Jain, Jayant Thatte, Thomas Braciale, Klaus Ley, Michael O'Connell, Jae K. Lee: Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics 19(15): 1945-1951 (2003) • Processing Affy chip Data: GCOS/MAS 5.0, RMA, and gcRMA (Roger Bumgarner ) • Saviozzi S, Calogero RA. 2003. Microarray probe expression measures,.data normalization and statistical validation. Comparative and Functional Genomics Comp Funct Genom 2003; 4: 442–446.Conference review • Tusher, V.G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response, PNAS, 98, 5116-5121 • http://www.tau.ac.il/lifesci/bioinfo/teaching/2002-2003/Differential_Genes_Dec03.ppt • http://www.kochi-u.ac.jp/~tatataa/RA/RA-targets.html • http://www.biostat.jhsph.edu/~ririzarr/Teaching/688/04-preproc-norm.pdf/ • http://nibn.bgu.ac.il/core_units/microarray_facility/microarray_technique.htm • http://www.Vialogy.com