1 / 47

Pabio590B – week 1 Microarrays

Pabio590B – week 1 Microarrays. Overview Design & hybridization Data analysis. Overview. Affix/synthesize probes of known sequence to chip Hybridize with labeled sample Quantify level of hybridization to each probe Normalization Statistics Clustering & more. Experiments you might do.

jera
Download Presentation

Pabio590B – week 1 Microarrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pabio590B – week 1Microarrays • Overview • Design & hybridization • Data analysis

  2. Overview • Affix/synthesize probes of known sequence to chip • Hybridize with labeled sample • Quantify level of hybridization to each probe • Normalization • Statistics • Clustering & more

  3. Experiments you might do Measure RNA expression Changes in gene expression over time / lifecycle Compare differences between tissues/cell types Comparisons between species/strains/conditions Whole genome transcript mapping (tiling arrays) Measure DNA content Presence or absence of region Copy number via Comparative Genomic Hybridization SNP Genotyping/Re-sequencing Other ChIP on chip arrays RIP on chip

  4. Microarray Design • Affix/synthesize probes of known sequence to chip • Hybridize with labeled sample • Quantify level of hybridization to each probe • Normalization • Statistics • Clustering & more

  5. 20 nt 50 nt 50 nt 70 nt tiling window RNA Expression Chip Designs Expression Array: - N number of probes per gene of interest - Trade-off between accuracy and number of features Tiling array: - Place probe of X nt every Y bases - Biased vs unbiased

  6. Probe considerations • Number of probes per region of interest • Specificity of probes • Distance between probes (tiling) • Mismatch probes (Affymetrix)

  7. Hybridization • Affix/synthesize probes of known sequence to chip • Hybridize with labeled sample • Quantify level of hybridization to each probe • Normalization • Statistics • Clustering & more

  8. Two-color vs One-color • Two-color • Two samples one each slide • cy3 - green - 532nm • cy5 - red - 635nm • One-color • One sample per slide • cy3 • No significant difference in accuracy or reproducibility

  9. Common Reference Biological Replicates Experiment Replicates Round Robin Dye Swaps Designs for Two-color Array

  10. Data Normalization • Affix/synthesize probes of known sequence to chip • Hybridize with labeled sample • Quantify level of hybridization to each probe • Normalization • Statistics • Clustering & more

  11. Within-Array Normalization Lowess Normalization Cy3/Cy5 Signal intensity Before After

  12. Between-Array Normalization • RNA Spike-in • Random Probes • Median Scaling • Quantile Scaling Median and quantile normalization are predicated upon the arrays in question having the same distribution. That is to say, if you can safely assume that the bulk of genes have the same expression across the arrays, only then you can use those methods.

  13. Quantile Normalization Before After

  14. Statistical Analysis • Affix/synthesize probes of known sequence to chip • Hybridize with labeled sample • Quantify level of hybridization to each probe • Normalization • Statistics • Clustering & more

  15. Some Advice About Statistics • Don’t get too hung up on p-values [or any other stat]. • Ultimately what matters is biological relevance and external knowledge and other heterogeneous measures (related functions, pathways, other data types) that are not easily measured by statistics alone. • P-values should help you evaluate the strength of the evidence, rather than being used as an absolute yardstick of significance. • Statistical significance is not necessarily the same as biological relevance and vice-versa. John Quackenbush

  16. Probe Signal Sample A Sample B Is this gene differentially expressed between the two conditions?

  17. To rephrase the question • Is the mean probe value different between Samples A & B • Null Hypothesis = H0 = means are the same • Alternate Hypothesis = Ha = means are different

  18. What affects our ability to test the hypothesis? • Difference in means • Number of sample points • Standard deviations of sample

  19. The T-statistic • Directly proportional to difference in means • Inversely proportional to standard deviation • Directly proportional to sample size The T-test calculates how likely the T-statistic is, given the null hypothesis that the means are actually the same.

  20. T-statistic and P-values • P-values can be determined from theoretical distributions or permutation testing • Theoretical distributions rely on a set of assumptions that array experiments do not necessarily follow • Permutation tests do not rely on any assumptions

  21. Original Permutation 1 Permutation 2 Probe Signal Probe Signal Probe Signal Group 1 Group 2 Group 1 Group 2 Gene A Gene B Permutation Testing 1) Permute n times by random shuffling 2) Calculate T-statistic for each permutation 3) Calculate probability of original T-statistic

  22. Interpreting P-values • T-test tests the null hypothesis that sample means are equal • Gene X has p-value of 5% from T-test • 95% chance it is differentially expressed • 5% chance that is NOT differentially expressed •  = False Positive Rate = 5%

  23. T-Test Refinements • Equal vs unequal variance of samples • Equal vs unequal sample size • Dependant vs independent samples CAVEAT: As sample sizes get smaller, the validity of p-values calculated via permutation diminishes. Microarrays typically have few probes per gene, so sample size is smallish.

  24. Multiple Testing Problem • If there is a 5% chance of false positives in one experiment, what happens when we are testing 10,000 genes. • The majority of those genes are not differentially expressed, but • a 5% p-value means we will have 500 false-positives.

  25. Family-Wise Error Rate (FWER) FWER is the probability of making one or more false discoveries (type I errors) among all the hypotheses when performing multiple pair-wise tests. • One comparison: FWER = p-value • 10,000 comparisons: FWER ~ 1.0 That means that when making 10,000 comparisons you are sure to make at least one error.

  26. Bonferroni Correction What if you want to keep the FWER at 5% • 0.05 / 10,000 = 0.000005 = 5e-6 • Only those genes with T-test p-value of < 5xe-6 are called differentially expressed • Leads to experiment-wide  of 0.05 The Standard Bonferroni correction is considered very conservative

  27. Adjusted Bonferroni • Rank all genes by ascending order of p-value • Assign gene with smallest p-value a corrected p-value of  / N (0.5/10,000) • Assign gene with second smallest p-value a corrected p-value of  / N-1 • Etc… The Adjusted Bonferroni correction is less conservative

  28. False Discovery Rate • Measures the likely number of false positives amongst “discovered” genes • Factors affecting FDR: • Proportion of actual differentially expressed genes • Distribution of the true differences • Measurement variability • Sample size

  29. Analysis of Variance (ANOVA) • Microarray testing across ≥ 3 conditions • Is a gene expressed equally across all conditions? • F-ratio for given gene X: (variability within conditions) / (variability across conditions) • Calculate p-value • Look up probability of F-ratio • Determine probability by permutation testing

  30. Significance Analysis of Microarrays (SAM) • Gene-specific T-tests • Computes statistic (dj) for each gene j • measures the relationship between gene expression and a response variable • describes and groups the data based on experimental conditions • uses non-parametric statistics • repeated permutations are used to determine FDR • Accounts for correlations in genes and avoids parametric assumptions about the (normal vs non-normal) distribution of individual genes

  31. Clustering • Affix/synthesize probes of known sequence to chip • Hybridize with labeled sample • Quantify level of hybridization to each probe • Normalization • Statistics • Clustering & more

  32. Why do clustering? • Identify groups of possibly co-regulated genes (e.g. so you can look for common sequence motifs) • Identify typical temporal or spatial gene expression patterns (e.g. cell-cycle data) • Arrange a set of genes in a linear order that is at least not totally meaningless

  33. Can also cluster experiments • Quality control • detect bad/outlying experiments • Identify or categorize classes of biological samples • sorting by tumor sub-type

  34. How you cluster? • Define a distance measure • Group genes (or experiments) based on that measure Objects are placed into groups. Objects within a group are more similar to each other than objects across groups. In some cases groups are hierarchically organized based on the intra-group similarity

  35. Distance Metrics Correlation Euclidean Correlation (X,Y) = 1 Distance (X,Y) = 4 Correlation (X,Z) = -1 Distance (X,Z) = 2.83 Correlation (X,W) = 1 Distance (X,W) = 1.41

  36. Clustering considerations • Correlation clustering • Direction only • ≥ 3 conditions • Euclidean clustering • Magnitude & direction • ≥ 2 conditions Array data is noisy, so you probably need multiple data points per condition • Clustering methods • Hierarchical • Partitional • Other

  37. Hierarchical clustering Agglomerative, bottom-up method • Initial state - each item is a cluster • Iterate - join two most similar cluster • Stop - when number of clusters reaches user-defined value

  38. Linkage methods Ways to determine cluster similarity Single Link: Similarity of two most similar members Complete Link: Similarity of two most similar members Average Link: Average similarity of all members

  39. Single Average Complete Comparing linkage methods

  40. Partitional (K-means) clustering Divisive, top-down method • Partition data into K random clusters • Assign each point to nearest cluster • Calculate centroid of each cluster • GOTO step 2

  41. Other methods • Support Vector Machines (SVM) • K-nearest Neighbor (KNN) • Self Organizing Maps (SOM) • Self Organizing Tree Algorithm (SOTA) • Cluster Affinity Search Technique (CAST) • QT Cluster (QTC) • Discriminant Analysis Classifier (DAM) • Principal Component Analysis (PCA) • Etc.

  42. Warnings and Limitations • Clusters are like statistics Ideally they mirror reality, but they should only be taken seriously in conjunction with confirmatory data from other sources. • Clustering software clusters things If you tell it to find 4 clusters, it will find 4 clusters in anything! • Garbage In, Garbage Out Clustering typically relies on a set of input parameters that can be hard to evaluate except for empirically evaluating the outputs for a given set of input parameters.

  43. Clusters Interpretation - EASE(Expression Analysis Systematic Explorer) Population Size: 40 genes Cluster size: 12 genes 10 genes, shown in green, have a common biological theme and 8 occur within the cluster

  44. Microarray Analysis Software TIGR MEV Limma SAM EDGE • These software packages are free and open-source • Each has different strengths/weaknesses and makes different assumptions about your data

  45. $$ Analysis Platforms Gene Sifter Rosetta Resolver Bio Discovery

  46. Microarray Data Sources • Gene Expression Omnibus (NCBI) • ArrayExpress (EBI) • Stanford Microarray Database • Yale Microarray Database

  47. Microarray Data Standards • Microarray Gene Expression Data Society (MGED) • MIAME • MAGE - OM • MAGE ML • RNA Abundance Database (RAD) • Integrating data from various types of expression experiments

More Related