1 / 34

Microarray-based Disease Prognosis using Gene Annotation Signatures

Microarray-based Disease Prognosis using Gene Annotation Signatures. Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005. Internship site: BioDiscovery, Inc. Mentor: Dr. Bruce Hoff Source of Funding: BioDiscovery, Inc. Motivation.

jaimin
Download Presentation

Microarray-based Disease Prognosis using Gene Annotation Signatures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarray-based Disease Prognosis using Gene Annotation Signatures Michael Kovshilovsky Swapna Annavarapu SoCalBSI 2005

  2. Internship site: BioDiscovery, Inc. • Mentor: Dr. Bruce Hoff • Source of Funding: BioDiscovery, Inc.

  3. Motivation • Microarray gene-expression profiling studies to predict disease outcomes. • ex: cancer outcome • To improve treatment of patients based on knowledge of gene-expression profile (molecular signature).

  4. Lancet Paper “Prediction of cancer outcome with microarrays: a multiple random validation strategy” Findings of Stefan Michiels et al :- “Gene expression microarray-based predictors of clinical outcome have been poorly optimistic and careful review shows that performance is poor and variable.” - Analyzed data from the 7 largest published studies that have attempted to predict prognosis of cancer patients based on DNA microarray analysis. - Random sampling approach

  5. Goal • Reproduce the Lancet paper. • Compare the classification based on expression levels of microarray probes, with classification based on GSEA scores of biological pathways. • Validate our hypothesis:- • By abstracting away from the gene expression domain to that of biological properties, performance should stabilize and improve.

  6. Phase I : Reproduce the Lancet Paper (Gene-Expression based classification)

  7. Methodology • Data loading • Data preprocessing • Data selection • Correlating with clinical outcome • Determine the molecular signature • Classification of data

  8. Data Loading • Read Affymetrix chip expression data. Sample data:

  9. Data Preprocessing • Scaling • Identify the present, absent and marginal expressional levels. • scaling the average of the fluorescent intensities of all genes to a constant target intensity of 2500. • Expression values above 45000 capped to 45000 and the ones below 100 to 1. • Filtration • Eliminate the genes with low or no variance • Log transformation • Log2(values)

  10. Preprocessed Data:BeforeAfter

  11. Data Selection • Training-Validation Approach:- • Training set for identifying the molecular signature. • Validation set for estimating the proportion of misclassifications. Therefore, such that, • Each set includes half the patients with and half without a favorable outcome. Dataset(N) (Random selection) Training(n) Validation(N-n)

  12. Correlation • Clinical outcome • Favorable = 1 (continuous complete remission) • Unfavorable = -1 (relapse) • Correlate expression values of each gene with the clinical outcome • Pearson’s correlation coefficient • Determined the molecular signature • defined by the top 50 highest correlated genes.

  13. Data Classification(Nearest Centroid Prediction Rule) • A new point is classified based on which centroid is nearest. • Data is 50- dimensional. • PCA plot is used to plot the data. • Principle component analysis(PCA) is a powerful tool for analysing data by identifying patterns in it. Unfavorable Centroid Favorable Centroid

  14. Results(cont’d.) • Each of the 500 training sets provided a different molecular signature • Plot of genes that occurred most frequently in the molecular signature.

  15. Analysis • The frequency of the genes participating in defining the signature is quite low. • This suggests that the molecular signature is selected almost randomly and is unstable.

  16. Phase II Analysis of Microarray data using GSEA (Gene Set Enrichment Analysis) http://www.nature.com/ng/journal/v37/n1/full/ng1490.html

  17. Methodology • Data loading • Data preprocessing • Data selection • GSEA – Determine enrichment scores • Correlating with clinical outcome • Classification of data

  18. Preliminary steps • Data loading • Data preprocessing same as in phase I • Data selection

  19. GSEA • Gene Set Enrichment Analysis • A microarray data analysis method that uses predefined gene sets and ranks of genes to identify significant biological changes in microarray data sets. • GSEA provides an enrichment score that measures the degree of enrichment of the gene set of a rank-ordered gene list derived from the data set.

  20. GSEA(cont’d) • GSEA Inputs: • List of genes ranked according to the expression difference between two classes. • a priori defined gene sets (ex. pathways), each consisting of members drawn from the list of genes. • Ranking of genes is done using a distance metric, Signal-to-Noise ratio (SNR). http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_report.pdf

  21. Signal to Noise ratio • The signal-to-noise ratio method looks at the difference of the means in each of the classes scaled by the sum of the standard deviations: ((α)* sqrt(n)) ÷ σ where α (signal) is the difference in mean expressions of two classes and σ (noise) is the standard deviation.

  22. Implementation • Determine SNR for each microarray. • Sort gene list based on SNR values. • The degree of enrichment of the gene set is measured by comparing the SNR-ordered gene list with the gene set(pathways). http://www.nature.com/ng/journal/v37/n1/full/ng1490.html

  23. Enrichment Score (ES) • If gene is in gene set, increment rank by Y • If gene is not in gene set, decrement rank by X X=√G/(N-G) Y=√(N-G)/G G=number of genes in set N=size of data http://www.broad.mit.edu/gsea/doc/detailed_description_of_gsea_algorithm.doc ES=greatest positive deviation of this running sum across all genes

  24. Correlation & Classification • Similar to phase I • First, the top 50 pathways are selected to create favorable and unfavorable centroids • Next, the training and validation set is classified based on the nearest-centroid prediction rule.

  25. Results(cont’d.) • Each of the 500 training sets provided a different molecular signature • Plot of pathways that occurred in over 150 of the molecular signatures.

  26. Results Gene Expression Gene Set Based Average % =93.77% Average % =97.88%

  27. Results (cont’d) Gene Expression Gene Set Based Average % =93.80% Average % =96.45%

  28. Results (cont’d) Gene Expression Gene Set Based Average % =75.17% Average % =52.91%

  29. Results (cont’d) Gene Expression Gene Set Based Average % =26.48% Average % =47.76%

  30. Three significant pathways • Iron ion homeostasis • Reduces tumor angiogenesis by protecting cells from oxidative stress • Unfolded protein response, positive regulation of target gene transcription • A stress-signaling pathway in tumor cells • Tryptophan catabolism • Has an antiproliferative effect on many tumor cells

  31. Conclusion • Our results have shown that • The centroid classification based on gene expression performs poorly with the validation set. • The GSEA method does not perform any better than the gene expression method

  32. Future Work • Analysis with a different classification approach. • Using much larger data sets from different samples.

  33. Acknowledgements • Dr. Bruce Hoff • Dr. Soheil Shams • SoCalBSI

  34. References • Stefan Michiels, Serge Koscielny, Catherine Hill. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet, Vol. 365, 488–92 (2005). • Mootha, V. K., et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics, Vol. 34, 267-273 (2003). • http://www.broad.mit.edu/gsea/doc/detailed_description_of_gsea_algorithm.doc. • http://www.mit.edu/~scyudits/He&Yuditskaya_final_project_report.pdf • http://www.nature.com/ng/journal/v37/n1/full/ng1490.html

More Related