150 likes | 321 Views
Lab 5 Unsupervised and supervised clustering. Feb 22 th 2012 Daniel Fernandez Alejandro Quiroz. Outline. Unsupervised Hierarchical clustering Principal component analysis Supervised LIMMA package Linear models for microarray data. Before any high level analysis….
E N D
Lab 5Unsupervised and supervisedclustering Feb 22th 2012 Daniel Fernandez Alejandro Quiroz
Outline • Unsupervised • Hierarchical clustering • Principal component analysis • Supervised • LIMMA package • Linear models for microarray data
Before any high level analysis…. • Download the data set used in lab 4 • Go to and download GSE10940 • Load the .CEL files and use the custom CDF file annotation used in lab 4: “drosophila2dmrefseqcdf” • Perform RMA normalization and obtain in a matrix the expression intensities • Obtain the genes that are up and down expressed with a fold change of 2. • Store the gene ides in: X.top
The data set • Secretory and transmembrane proteins traverse the endoplasmic reticulum (ER) and Golgi compartments for final maturation prior to reaching their functional destinations. • Members of the p24 protein family function in trafficking some secretory proteins in yeast and higher eukaryotes. • Yeast p24 mutants have minor secretory defects and induce an ER stress response that likely results from accumulation of proteins in the ER due to disrupted trafficking. • Test the hypothesis that loss of Drosophila melanogaster p24 protein function causes a transcriptional response characteristic of ER stress activation.
Supervised MethodLIMMA • Linear Models for MicroArray data • A package for differential expression analysis from microarray data. • Makes use of linear models to describe the expression of each gene. • Uses empirical Bayes and other shrinkage methods to borrow information across genes making the analyses stable even for experiments with small number of arrays.
LIMMA uses linear models to analyze microarray data. • The approach requires the definition of 2 matrices • Design matrix • Provides the representation on how the different factors are distributed in the data • It is assumed a linear model • Where yj contains the expression for gene j • The estimates of αj are provided by lmFit() • Contrast matrix • Allows the definition of the comparison between factors of interest • If the parameters are of interest • C is the contrast matrix • These parameters are estimated by contrast.fit()
Given the large number of linear models fits arising from a microarray there is a pressing need to take advantage of the parallel structure whereby the same model is fitted to each gene • Using a hierarchical framework, a moderate t-statistic is computed • Standard errors are shrunk towards a common value using a Bayesian model • This borrows information for the inference of individual genes • The degrees of freedom are increased • Reflexes the greater reliability to the smoothed standard errors
Unsupervised MethodHierarchical clustering • Hierarchical clustering • First, need to calculate all the pair wise distances • D=dist(t(X.top)) • Finally, perform the hierarchical clustering • H1=hclust(D,method=“single”) • H2=hclust(D,method=“complete”) • H3=hclust(D,method=“average”) • plot(Hi) • Is there something odd from the clustering?
Unsupervised MethodMDS • Multidimensional scaling (MDS) is a set of related statistical techniques to explore similarities in data*. • *Wikipedia.
Unsupervised Method Principal component • In R, the function prcomp performs principal component analysis • In our context, the idea is to visualize the impact of possible dimension reduction in GENES • Important: Remember that in prcomp, the genes have to be columns and the samples rows.