Microarray-based Disease Prognosis using Gene Annotation Signatures

Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005

Internship site: BioDiscovery, Inc. • Mentor: Dr. Bruce Hoff • Source of Funding: BioDiscovery, Inc.

Motivation • Microarray gene-expression profiling studies to predict disease outcomes. • ex: cancer outcome • To improve treatment of patients based on knowledge of gene-expression profile (molecular signature).

Lancet Paper “Prediction of cancer outcome with microarrays: a multiple random validation strategy” Findings of Stefan Michiels et al :- “Gene expression microarray-based predictors of clinical outcome have been poorly optimistic and careful review shows that performance is poor and variable.” - Analyzed data from the 7 largest published studies that have attempted to predict prognosis of cancer patients based on DNA microarray analysis. - Random sampling approach

Goal • Reproduce the Lancet paper. • Compare the classification based on expression levels of microarray probes, with classification based on GSEA scores of biological pathways. • Validate our hypothesis:- • By abstracting away from the gene expression domain to that of biological properties, performance should stabilize and improve.

Phase I : Reproduce the Lancet Paper (Gene-Expression based classification)

Methodology • Data loading • Data preprocessing • Data selection • Correlating with clinical outcome • Determine the molecular signature • Classification of data

Data Loading • Read Affymetrix chip expression data. Sample data:

Data Preprocessing • Scaling • Identify the present, absent and marginal expressional levels. • scaling the average of the fluorescent intensities of all genes to a constant target intensity of 2500. • Expression values above 45000 capped to 45000 and the ones below 100 to 1. • Filtration • Eliminate the genes with low or no variance • Log transformation • Log2(values)

Preprocessed Data:BeforeAfter

Data Selection • Training-Validation Approach:- • Training set for identifying the molecular signature. • Validation set for estimating the proportion of misclassifications. Therefore, such that, • Each set includes half the patients with and half without a favorable outcome. Dataset(N) (Random selection) Training(n) Validation(N-n)

Correlation • Clinical outcome • Favorable = 1 (continuous complete remission) • Unfavorable = -1 (relapse) • Correlate expression values of each gene with the clinical outcome • Pearson’s correlation coefficient • Determined the molecular signature • defined by the top 50 highest correlated genes.

Data Classification(Nearest Centroid Prediction Rule) • A new point is classified based on which centroid is nearest. • Data is 50- dimensional. • PCA plot is used to plot the data. • Principle component analysis(PCA) is a powerful tool for analysing data by identifying patterns in it. Unfavorable Centroid Favorable Centroid

Results(cont’d.) • Each of the 500 training sets provided a different molecular signature • Plot of genes that occurred most frequently in the molecular signature.

Analysis • The frequency of the genes participating in defining the signature is quite low. • This suggests that the molecular signature is selected almost randomly and is unstable.

Phase II Analysis of Microarray data using GSEA (Gene Set Enrichment Analysis) http://www.nature.com/ng/journal/v37/n1/full/ng1490.html

Methodology • Data loading • Data preprocessing • Data selection • GSEA – Determine enrichment scores • Correlating with clinical outcome • Classification of data

Preliminary steps • Data loading • Data preprocessing same as in phase I • Data selection

GSEA • Gene Set Enrichment Analysis • A microarray data analysis method that uses predefined gene sets and ranks of genes to identify significant biological changes in microarray data sets. • GSEA provides an enrichment score that measures the degree of enrichment of the gene set of a rank-ordered gene list derived from the data set.

GSEA(cont’d) • GSEA Inputs: • List of genes ranked according to the expression difference between two classes. • a priori defined gene sets (ex. pathways), each consisting of members drawn from the list of genes. • Ranking of genes is done using a distance metric, Signal-to-Noise ratio (SNR). http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_report.pdf

Signal to Noise ratio • The signal-to-noise ratio method looks at the difference of the means in each of the classes scaled by the sum of the standard deviations: ((α)* sqrt(n)) ÷ σ where α (signal) is the difference in mean expressions of two classes and σ (noise) is the standard deviation.

Implementation • Determine SNR for each microarray. • Sort gene list based on SNR values. • The degree of enrichment of the gene set is measured by comparing the SNR-ordered gene list with the gene set(pathways). http://www.nature.com/ng/journal/v37/n1/full/ng1490.html

Enrichment Score (ES) • If gene is in gene set, increment rank by Y • If gene is not in gene set, decrement rank by X X=√G/(N-G) Y=√(N-G)/G G=number of genes in set N=size of data http://www.broad.mit.edu/gsea/doc/detailed_description_of_gsea_algorithm.doc ES=greatest positive deviation of this running sum across all genes

Correlation & Classification • Similar to phase I • First, the top 50 pathways are selected to create favorable and unfavorable centroids • Next, the training and validation set is classified based on the nearest-centroid prediction rule.

Results(cont’d.) • Each of the 500 training sets provided a different molecular signature • Plot of pathways that occurred in over 150 of the molecular signatures.

Results Gene Expression Gene Set Based Average % =93.77% Average % =97.88%

Results (cont’d) Gene Expression Gene Set Based Average % =93.80% Average % =96.45%

Three significant pathways • Iron ion homeostasis • Reduces tumor angiogenesis by protecting cells from oxidative stress • Unfolded protein response, positive regulation of target gene transcription • A stress-signaling pathway in tumor cells • Tryptophan catabolism • Has an antiproliferative effect on many tumor cells

Conclusion • Our results have shown that • The centroid classification based on gene expression performs poorly with the validation set. • The GSEA method does not perform any better than the gene expression method

Future Work • Analysis with a different classification approach. • Using much larger data sets from different samples.

Acknowledgements • Dr. Bruce Hoff • Dr. Soheil Shams • SoCalBSI

References • Stefan Michiels, Serge Koscielny, Catherine Hill. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, Vol. 365, 488–92 (2005). • Mootha, V. K., et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, Vol. 34, 267-273 (2003). • http://www.broad.mit.edu/gsea/doc/detailed_description_of_gsea_algorithm.doc. • http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_report.pdf • http://www.nature.com/ng/journal/v37/n1/full/ng1490.html

Microarray-based Disease Prognosis using Gene Annotation Signatures

Microarray-based Disease Prognosis using Gene Annotation Signatures

Presentation Transcript

Sample Size Selection for Microarray based Gene Expression Studies

Gene Structure Annotation

Gene Annotation Databases

Knowledge-based analysis of microarray gene expression data by using support vector machines

Lecture 6: Gene ontology and Gene Annotation

Gene/Protein Function Annotation

Annotation-based meta-analysis of microarray experiments

Lattice Based Signatures

Gene Structure Annotation

On utility of gene set signatures in gene expression-based class prediction

Hash-Based Signatures

Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines

Annotation consistency using annotation intersections

Disease, Prognosis, Retention

Microarray (Gene Chip) Technology

Gene-expression signatures for breast cancer prognosis, site of metastasis, and therapy resistance

Gene Structure Annotation

Gene Annotation Databases

Knowledge-based analysis of microarray gene expression data by using support vector machines