1 / 18

Differential Analysis

Differential Analysis. Given phenotypically distinct classes, find “markers” that distinguish these classes from one another. Differential Analysis. Marker selection. Normal. Tumor. Normal. Tumor. Gene Marker Selection. Hierarchy of difficulty. Problem Gene Markers Error Example

Download Presentation

Differential Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Differential Analysis

  2. Given phenotypically distinct classes, find “markers” that distinguish these classes from one another Differential Analysis Marker selection Normal Tumor Normal Tumor

  3. Gene Marker Selection Hierarchy of difficulty ProblemGene MarkersErrorExample I. Tissue or Cell Type ~1000-2000 ~0% Normal vs. Renal carcinoma Normal vs. Abnormal II. Morphological ~200-500 ~0-5% Leukemia ALL vs. AML Type III. Morphological Subtype ~50-100 ~0-15% ALL B- vs. T-Cell Multiclass Classification IV. Treatment Outcome ~1-20 ~5-50% AML Treatment Outcome Drug Sensitivity Degree of Difficulty adapted from P. Tamayo

  4. Gene Marker Selection Compute score for each gene Ranked gene list Compute score: t-test, SNR, etc. Dataset Score Phenotype/ class labels T-test: Signal-to-Noise Ratio (SNR):

  5. Small sample size. Each gene tested is a separate hypothesis  likelihood of false positives. Gene interaction not taken into account. Gene Marker Selection Challenges

  6. Gene Markers Selection Small Sample Size • Generate a 10,000x100 matrix from a Gaussian (mean=0, SD=0.5) • Pickn columns (6,14,30,100) • Assign sample labels yellow and green • Select top 25 markers for yellow, top 25 markers for green Yellow Green Yellow Green Yellow Green Yellow Green 6 samples 14 samples 30 samples 100 samples With small sample size it is easy to find genes correlated with phenotype

  7. If a gene is normally distributed the t-score follows the t-distribution What if they aren’t normally distributed? Permutation Test: shuffle labels (class membership) compute score for each gene (t-score, SNR, .. ) repeat many times Empirical null distribution of scores for each gene Compare observed score to empirical distribution. Observed score of gene scores Distribution of permuted scores for given gene P-value calculation No distributional assumptions are made - compute gene-specific p-values

  8. Permutation test and P-value To determine how significant a gene’s statistical score is “Called” Class A “Called” Class B Known class A samples Known class B samples Score “True” classes Permutation 1 Permutation 2 Permutation n Generates a “null distribution” of values for this gene Compare with “real” score for this gene

  9. Marker Selection Process Measure of significance Compute score: t-test, SNR, etc. Measure significance: permutation test Ranked gene list Dataset Score Phenotype/ class labels Correct for multiple hypotheses: FDR, FWER, etc. Markers

  10. Bonferroni Correction: Most conservative metric Divides the p-value by the number of hypotheses FWER (Family-Wise Error Rate): probability of calling one or more hypotheses significant given that they are all null FDR (False Discovery Rate): probability that the null hypothesis is true given that the result is significant Try to reduce the number of hypotheses tested in the first place (i.e. filtering) Multiple Hypotheses What to control

  11. Exercise ComparativeMarkerSelection Module • Choose module: • Gene List Selection  ComparativeMarkerSelection • Choose input file: Next to “input file”, choose “Specify URL” View datasets window in Web browser Click and drag all_aml_train.preprocessed.gct • Choose class file: Next to “cls file”, choose “Specify URL” View datasets window in Web browser Click and drag all_aml_train.cls • Click Run

  12. Viewing Analysis Results

  13. Reduce number of hypotheses/genes by variation filtering (attempt at reducing false negatives) Choose test statistic (e.g., SNR, t-score, ...) If enough samples, compute p-values by permutation test (otherwise, compute asymptotic test using the standard t-distribution). Control for Multiple Hypothesis Testing by using the FDR correction Remember: if you choose FDR ≤ 0.05, you’re willing to accept 5% of false positives. If number of significant hypotheses/genes “too large” even for very small threshold values, either: use the maxT correction (possible w/ empirical p-values only). use additional criteria (e.g., min fold-change, min expression value, etc.) Differential Analysis Cookbook

  14. Create expression data set – ExpressionFileCreator Reduce number of hypotheses/genes by variation filtering – PreprocessDataset Make class file Run Differential Analysis – ComparativeMarkerSelection Choose test statistic (say, t-score) View results with ComparativeMarkerSelectionViewer If enough samples, compute p-values by permutation test (otherwise, use asymptotic test). Control for MHT by using the FDR correction Use HeatMapViewer to view results for top genes Use GSEA to find gene sets (or pathways) that are enriched in your dataset. Differential Analysis GenePattern modules

  15. Working with Samples and Features

  16. Extracting a set of samples Computing co-expressed genes Converting probe set ids to gene names Computing overlap between gene sets Overview

  17. Working with Samples and Features • From a combined dataset of cancer and normal samples, select the normal samples. • Within the normal samples, find the genes coexpressed with LRPPRC (Affymetrix probe M92439_at), a gene with mitochondrial function. • Compare these genes and those coexpressed with LRPPRC in another expression dataset to determine the coexpressed genes common to both datasets. GCM_Total.res SelectFeaturesColumns GCM_Normals.res GeneNeighbors GCM_Normals.markerdata.gct GCM_Normals.markerlist.odf GeneListSignificanceViewer CollapseDataset GCM_Total_Normals.markerdata.collapsed.gct ExtractRowNames GCM_Total_Normals.markerdata.collapsed.row.names.txt VennDiagram

  18. Exercise

More Related