Final Project Week 3 - 5/7/09 GSEA and Cluster Computing in Protein Research Leon Kay, Yan Tran,

Final Project Week 3 - 5/7/09 GSEA and Cluster Computing in Protein Research Leon Kay, Yan Tran, Chris Thomas Leon Chris Yan Gary

Gene Set Enrichment Analysis • GSEA is a computational method that determines whether defined set of genes shows statistically significant, differences between two phenotypes • 3 Key Steps • Calculation of the Enrichment Score • Estimation of Significance Level of ES • Adjustment for multiple hypothesis testing

Broad Institute GSEA Tool • We tried using the GSEA tool from the Broad Institute, where most of the original work for GSEA was done - http://www.broad.mit.edu/gsea/ • Java web-start app that launches quickly and easily, lots of online documentation and tutorials. • Unfortunately, we ran into some major issues getting our data to work with it.

Input to the GSEA Tool

Input to the GSEA Tool – Parameters • Expression dataset – This is the expression data, in our case, sub-data extracted from clusters using T-MeV • Gene sets database – databases of gene sets, downloadable through the tool, from Broad’s website – created by Broad and others • Phenotype labels – an independent file of label data plus more, format specific to GSEA – created from original data • Chip Platform – Chip data file that matches the data set from which the data was recorded.

What is a Phenotype? • Simply put, a characteristic of an organism as a result of differing gene expression, plus possible environmental factors. • In our data, the breast cancer classifications can be considered phenotypes. • So the phenotype file is created from the breast cancer data using the class labels as phenotypes.

Folding@Home • The most powerful computing cluster in the world • One of the largest computing clusters as well • Launched in 2000, It is managed by the Pande Group within Stanford's Chemistry Department • Goal is “to understand protein folding, misfolding and related diseases” • As of May 2009, 63 papers have been published utilizing Folding@Home

Folding@Home: Model • Does not rely on a “super computer” for data processing • Small client application installed on client hardware • Leverages unused computing power on hardware • As of April '09, from an estimated 400,000 machines, a peak speed of 4.5 Native PFLOPS • More modern CPUs are now multi-core, so the Pande Group has explored Symmetrical Processing to leverage unused power

Folding@Home At a Glance

References • “Folding at Home”, http://folding.stanford.edu/ • Spanish Inquisition Image - http://roflrazzi.com/upcoming/?pid=12265 • Subramanian, Aravind; Gene Set Enrichment Analysis: A Knowledged based approach for interpretting genome wide expression profiles; http://mootha.med.harvard.edu/PubPDFs/Subramanian2005.pdf • GSEA, http://www.broad.mit.edu/gsea/

Final Project Week 3 - 5/7/09 GSEA and Cluster Computing in Protein Research Leon Kay, Yan Tran,