610 likes | 877 Views
Bioinformatics: gene expression basics. Ollie Rando, LRB 903. Experimental Cycle. Biological question ( hypothesis-driven or explorative). To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination:
E N D
Bioinformatics: gene expression basics Ollie Rando, LRB 903
Experimental Cycle Biological question (hypothesis-driven or explorative) To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: He may be able to say what the experiment died of. Ronald Fisher Experimental design Failed Microarray experiment Image analysis Quality Measurement Pre-processing Normalization Pass Analysis Clustering Discrimination Estimation Testing Biological verification and interpretation
Brain 67,679 Lung 20,224 Heart 9,400 Liver 37,807 Colon 4,832 Prostate 7,971 Bone 4,832 Skin 3,043 Microarray Analysis Examples Brain Lung Liver Liver Tumor
tissue contamination RNA degradation amplification efficiency reverse transcription efficiency Hybridization efficiency and specificity clone identification and mapping PCR yield, contamination Raw data are not mRNA concentrations • spotting efficiency • DNA support binding • other array manufacturing related issues • image segmentation • signal quantification • “background” correction
Scatterplot Data Data (log scale) Message: look at your data on log-scale!
M = log2(R/G) A = 1/2 log2(RG) MA Plot
Median centering One of the simplest strategies is to bring all „centers“ of the array data to the same level. Assumption: the majority of genes are un-changed between conditions. Median is more robust to outliers than the mean. Divide all expression measurements of each array by the Median. Log Signal, centered at 0
Scatterplot of log-Signals after Median-centering M-A Plot of the same data Log Red M = Log Red -Log Green Log Green A = (Log Green + Log Red) / 2 Problem of median-centering Median-Centering is a global Method. It does not adjust for local effects, intensity dependent effects, print-tip effects, etc.
M = Log Red -Log Green A = (Log Green + Log Red) / 2 Lowess normalization Local estimate Use the estimate to bend the banana straight
Summary I • Raw data are not mRNA concentrations • We need to check data quality on different levels • Probe level • Array level (all probes on one array) • Gene level (one gene on many arrays) • Always log your data • Normalize your data to avoid systematic (non-biological) effects • Lowess normalization straightens banana
OK, so I’ve got a gene list with expression changes: now what? “Huh. Turns out the standard names for the most upregulated genes all start with ‘HSP’, or ‘GAL’ … I wonder if that’s real …”
Gene Ontology • Organization of curated biological knowledge • 3 branches: biological process, molecular function, cellular component
Hypergeometric Distribution • Probability of observing x or more genes in a cluster of n genes with a common annotation • N = total number of genes in genome • M = number of genes with annotation • n = number of genes in cluster • x = number of genes in cluster with annotation • Multiple hypothesis correction required if testing multiple functions (Bonferroni, FDR, etc.) • Additional genes in clusters with strong enrichment may be related
Kolmogorov-Smirnov test • Hypergeometric test requires “hard calls” – this list of 278 genes is my upregulated set • But say all 250 genes involved in oxygen consumption go up ~10-20% each – this would not likely show up • KS test asks whether *distribution* for a given geneset (GO category, etc.) deviates from your dataset’s background, and is nonparametric • Cumulative Distribution Function (CDF) plot: • Gene Set Enrichment Analysis: • http://www.broadinstitute.org/gsea/
GO term Enrichment Tools • SGD’s & Princeton’s GoTermFinder • http://go.princeton.edu • GOLEM (http://function.princeton.edu/GOLEM) • HIDRA Sealfon et al., 2006
Supervised analysis = learning from examples, classification • We have already seen groups of healthy and sick people. Now let’s diagnose the next person walking into the hospital. • We know that these genes have function X (and these others don’t). Let’s find more genes with function X. • We know many gene-pairs that are functionally related (and many more that are not). Let’s extend the number of known related gene pairs. Known structure in the data needs to be generalized to new data.
Un-supervised analysis = clustering • Are there groups of genes that behave similarly in all conditions? • Disease X is very heterogeneous. Can we identify more specific sub-classes for more targeted treatment? No structure is known. We first need to find it. Exploratory analysis.
Supervised analysis Calvin, I still don’t know the difference between cats and dogs … Oh, now I get it!! Don’t worry! I’ll show you once more: Class 1: cats Class 2: dogs
Un-supervised analysis Calvin, I still don’t know the difference between cats and dogs … I don’t know it either. Let’s try to figure it out together …
Supervised analysis: setup • Training set • Data: microarrays • Labels: for each one we know if it falls into our class of interest or not (binary classification) • New data (test data) • Data for which we don’t have labels. • Eg. Genes without known function • Goal: Generalization ability • Build a classifier from the training data that is good at predicting the right class for the new data.
One microarray, one dot Think of a space with #genes dimensions (yes, it’s hard for more than 3). Each microarray corresponds to a point in this space. If gene expression is similar under some conditions, the points will be close to each other. If gene expression overall is very different, the points will be far away. Expression of gene 2 Expression of gene 1
Which line separates best? A B D C
No sharp knive, but a … FAT PLANE
Support Vector Machines Maximal margin separating hyperplane Datapoints closest to separating hyperplane = support vectors
How well did we do? Training error: how well do we do on the data we trained the classifier on? But how well will we do in the future, on new data? Test error: How well does the classifier generalize? Same classifier (= line) New data from same classes The classifier will usually perform worse than before: Test error > training error
Cross-validation Training error Train classifier and test it Test error Train Test K-fold Cross-validation Train Train Test Step 1. Here for K=3 Train Test Train Step 2. Test Train Train Step 3.
Additional supervised approaches might depend on your goal: cell cycle analysis
Clustering • Let the data organize itself • Reordering of genes (or conditions) in the dataset so that similar patterns are next to each other (or in separate groups) • Identify subsets of genes (or experiments) that are related by some measure
Quick Example Conditions Genes
Why cluster? • “Guilt by association” – if unknown gene X is similar in expression to known genes A and B, maybe they are involved in the same/related pathway • Visualization: datasets are too large to be able to get information out without reorganizing the data
Clustering Techniques • Algorithm (Method) • Hierarchical • K-means • Self Organizing Maps • QT-Clustering • NNN • . • . • . • Distance Metric • Euclidean (L2) • Pearson Correlation • Spearman Correlation • Manhattan (L1) • Kendall’s t • . • . • .
Distance Metrics • Choice of distance measure is important for most clustering techniques • Pair-wise metrics – compare vectors of numbers • e.g. genes x & y, ea. with n measurements Euclidean Distance Pearson Correlation Spearman Correlation
Distance Metrics Euclidean Distance Pearson Correlation Spearman Correlation
Hierarchical clustering • Imposes (pair-wise) hierarchical structure on all of the data • Often good for visualization • Basic Method (agglomerative): • Calculate all pair-wise distances • Join the closest pair • Calculate pair’s distance to all others • Repeat from 2 until all joined
HC – Interior Distances • Three typical variants to calculate interior distances within the tree • Average linkage: mean/median over all possible pair-wise values • Single linkage: minimum pair-wise distance • Complete linkage: maximum pair-wise distance
Hierarchical clustering: problems • Hard to define distinct clusters • Genes assigned to clusters on the basis of all experiments • Optimizing node ordering hard (finding the optimal solution is NP-hard) • Can be driven by one strong cluster – a problem for gene expression b/c data in row space is often highly correlated
Cluster analysis of combined yeast data sets Eisen M B et al. PNAS 1998;95:14863-14868 ©1998 by The National Academy of Sciences
To demonstrate the biological origins of patterns seen in Figs. 1 and 2, data from Fig. 1 were clustered by using methods described here before and after random permutation within rows (random 1), within columns (random 2), and both (random 3). Eisen M B et al. PNAS 1998;95:14863-14868 ©1998 by The National Academy of Sciences
Hierarchical Clustering: Another Example • Expression of tumors hierarchically clustered • Expression groups by clinical class Garber et al.
K-means Clustering • Groups genes into a pre-defined number of independent clusters • Basic algorithm: • Define k = number of clusters • Randomly initialize each cluster with a seed (often with a random gene) • Assign each gene to the cluster with the most similar seed • Recalculate all cluster seeds as means (or medians) of genes assigned to the cluster • Repeat 3 & 4 until convergence (e.g. No genes move, means don’t change much, etc.)