340 likes | 448 Views
Microarray-based Disease Prognosis using Gene Annotation Signatures. Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005. Internship site: BioDiscovery, Inc. Mentor: Dr. Bruce Hoff Source of Funding: BioDiscovery, Inc. Motivation.
E N D
Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005
Internship site: BioDiscovery, Inc. • Mentor: Dr. Bruce Hoff • Source of Funding: BioDiscovery, Inc.
Motivation • Microarray gene-expression profiling studies to predict disease outcomes. • ex: cancer outcome • To improve treatment of patients based on knowledge of gene-expression profile (molecular signature).
Lancet Paper “Prediction of cancer outcome with microarrays: a multiple random validation strategy” Findings of Stefan Michiels et al :- “Gene expression microarray-based predictors of clinical outcome have been poorly optimistic and careful review shows that performance is poor and variable.” - Analyzed data from the 7 largest published studies that have attempted to predict prognosis of cancer patients based on DNA microarray analysis. - Random sampling approach
Goal • Reproduce the Lancet paper. • Compare the classification based on expression levels of microarray probes, with classification based on GSEA scores of biological pathways. • Validate our hypothesis:- • By abstracting away from the gene expression domain to that of biological properties, performance should stabilize and improve.
Phase I : Reproduce the Lancet Paper (Gene-Expression based classification)
Methodology • Data loading • Data preprocessing • Data selection • Correlating with clinical outcome • Determine the molecular signature • Classification of data
Data Loading • Read Affymetrix chip expression data. Sample data:
Data Preprocessing • Scaling • Identify the present, absent and marginal expressional levels. • scaling the average of the fluorescent intensities of all genes to a constant target intensity of 2500. • Expression values above 45000 capped to 45000 and the ones below 100 to 1. • Filtration • Eliminate the genes with low or no variance • Log transformation • Log2(values)
Data Selection • Training-Validation Approach:- • Training set for identifying the molecular signature. • Validation set for estimating the proportion of misclassifications. Therefore, such that, • Each set includes half the patients with and half without a favorable outcome. Dataset(N) (Random selection) Training(n) Validation(N-n)
Correlation • Clinical outcome • Favorable = 1 (continuous complete remission) • Unfavorable = -1 (relapse) • Correlate expression values of each gene with the clinical outcome • Pearson’s correlation coefficient • Determined the molecular signature • defined by the top 50 highest correlated genes.
Data Classification(Nearest Centroid Prediction Rule) • A new point is classified based on which centroid is nearest. • Data is 50- dimensional. • PCA plot is used to plot the data. • Principle component analysis(PCA) is a powerful tool for analysing data by identifying patterns in it. Unfavorable Centroid Favorable Centroid
Results(cont’d.) • Each of the 500 training sets provided a different molecular signature • Plot of genes that occurred most frequently in the molecular signature.
Analysis • The frequency of the genes participating in defining the signature is quite low. • This suggests that the molecular signature is selected almost randomly and is unstable.
Phase II Analysis of Microarray data using GSEA (Gene Set Enrichment Analysis) http://www.nature.com/ng/journal/v37/n1/full/ng1490.html
Methodology • Data loading • Data preprocessing • Data selection • GSEA – Determine enrichment scores • Correlating with clinical outcome • Classification of data
Preliminary steps • Data loading • Data preprocessing same as in phase I • Data selection
GSEA • Gene Set Enrichment Analysis • A microarray data analysis method that uses predefined gene sets and ranks of genes to identify significant biological changes in microarray data sets. • GSEA provides an enrichment score that measures the degree of enrichment of the gene set of a rank-ordered gene list derived from the data set.
GSEA(cont’d) • GSEA Inputs: • List of genes ranked according to the expression difference between two classes. • a priori defined gene sets (ex. pathways), each consisting of members drawn from the list of genes. • Ranking of genes is done using a distance metric, Signal-to-Noise ratio (SNR). http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_report.pdf
Signal to Noise ratio • The signal-to-noise ratio method looks at the difference of the means in each of the classes scaled by the sum of the standard deviations: ((α)* sqrt(n)) ÷ σ where α (signal) is the difference in mean expressions of two classes and σ (noise) is the standard deviation.
Implementation • Determine SNR for each microarray. • Sort gene list based on SNR values. • The degree of enrichment of the gene set is measured by comparing the SNR-ordered gene list with the gene set(pathways). http://www.nature.com/ng/journal/v37/n1/full/ng1490.html
Enrichment Score (ES) • If gene is in gene set, increment rank by Y • If gene is not in gene set, decrement rank by X X=√G/(N-G) Y=√(N-G)/G G=number of genes in set N=size of data http://www.broad.mit.edu/gsea/doc/detailed_description_of_gsea_algorithm.doc ES=greatest positive deviation of this running sum across all genes
Correlation & Classification • Similar to phase I • First, the top 50 pathways are selected to create favorable and unfavorable centroids • Next, the training and validation set is classified based on the nearest-centroid prediction rule.
Results(cont’d.) • Each of the 500 training sets provided a different molecular signature • Plot of pathways that occurred in over 150 of the molecular signatures.
Results Gene Expression Gene Set Based Average % =93.77% Average % =97.88%
Results (cont’d) Gene Expression Gene Set Based Average % =93.80% Average % =96.45%
Results (cont’d) Gene Expression Gene Set Based Average % =75.17% Average % =52.91%
Results (cont’d) Gene Expression Gene Set Based Average % =26.48% Average % =47.76%
Three significant pathways • Iron ion homeostasis • Reduces tumor angiogenesis by protecting cells from oxidative stress • Unfolded protein response, positive regulation of target gene transcription • A stress-signaling pathway in tumor cells • Tryptophan catabolism • Has an antiproliferative effect on many tumor cells
Conclusion • Our results have shown that • The centroid classification based on gene expression performs poorly with the validation set. • The GSEA method does not perform any better than the gene expression method
Future Work • Analysis with a different classification approach. • Using much larger data sets from different samples.
Acknowledgements • Dr. Bruce Hoff • Dr. Soheil Shams • SoCalBSI
References • Stefan Michiels, Serge Koscielny, Catherine Hill. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, Vol. 365, 488–92 (2005). • Mootha, V. K., et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, Vol. 34, 267-273 (2003). • http://www.broad.mit.edu/gsea/doc/detailed_description_of_gsea_algorithm.doc. • http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_report.pdf • http://www.nature.com/ng/journal/v37/n1/full/ng1490.html