1 / 30

Flow chart of Affymetrix from sample to information

This flow chart outlines the process of Affymetrix analysis, including functional annotation, pathway assignment, co-ordinate regulation, and tissue promoter motif commonalities. The output is generated as an Affy chp file.

sine
Download Presentation

Flow chart of Affymetrix from sample to information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flow chart of Affymetrix from sample to information Functional annotation Pathway assignment Co-ordinate regulation Tissue Promoter motif commonalities Output as Affy.chp file Text Generate Affy.dat file Hybridize to Affy arrays Hyb. cRNA Self Organized Maps (SOMs)

  2. Microarray Data Analysis • Data preprocessing and visualization • Supervised learning • Machine learning approaches • Unsupervised learning • Clustering and pattern detection • Gene regulatory regions predictions based co-regulated genes • Linkage between gene expression data and gene sequence/function databases • …

  3. Data preprocessing • Data preparation or pre-processing • Normalization • Feature selection • Base on the quality of the signal intensity • Based on the fold change • T-test • …

  4. Experiment1 Experiment2 Control Control Normalization • Need to scale the red sample so that the overall intensities for each chip are equivalent

  5. Normalization • To insure the data are comparable, normalization attempts to correct the following variables: • Number of cells in the sample • Total RNA isolation efficiency • Signal measurement sensitivity • … • Can use simple math • Normalization by global scaling (bring each image to the same average brightness) • Normalization by sectors • Normalization to housekeeping genes • … • Active research area

  6. Mn-SOD Annexin IV Aminoacylase 1 Basic Data Analysis • Fold change (relative change in intensity for each gene)

  7. Microarray Data Analysis • Data preprocessing and visualization • Supervised learning • Machine learning approaches • Unsupervised learning • Clustering and pattern detection • Gene regulatory regions predictions based co-regulated genes • Linkage between gene expression data and gene sequence/function databases • …

  8. Microarrays: An Example • Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 • 72 examples (38 train, 34 test), about 7,000 probes • well-studied (CAMDA-2000), good test example ALL AML Visually similar, but genetically very different

  9. Probe AML1 AML2 AML3 ALL1 ALL2 ALL3 p-value D21869_s_at 170.7 55.0 43.7 5.5 807.9 1283.5 0.243 D25233cds_at 605 31.0 629.2 441.7 95.3 205.6 0.487 D25543_at 2148.7 2303.0 1915.5 49.2 96.3 89.8 0.0026 L03294_g_at 241.8 721.5 77.2 66.1 107.3 132.5 0.332 J03960_at 774.5 3439.8 614.3 556 14.4 12.9 0.260 M81855_at 1087 1283.7 1372.1 1469 4611.7 3211.8 0.178 L14936_at 212.6 2848.5 236.2 260.5 2650.9 2192.2 0.626 L19998_at 367 3.2 661.7 629.4 151 193.9 0.941 L19998_g_at 65.2 56.9 29.6 434.0 719.4 565.2 0.022 AB017912_at 1813.7 9520.6 2404.3 3853.1 6039.4 4245.7 0.963 AB017912_g_at 385.4 2396.8 363.7 419.3 6191.9 5617.6 0.236 U86635_g_at 83.3 470.9 52.3 3272.5 3379.6 5174.6 0.022 … … … … … … … … Feature selection   

  10. Hypothesis Testing • Null hypothesis is an hypothesis about a population parameter. • Hypothesis testing is to test the viability of the null hypothesis for a set of experimental data • Example: • Test whether the time to respond to a tone is affected by the consumption of alcohol • Hypothesis : µ1 - µ2 = 0 • µ1 is the mean time to respond after consuming alcohol • µ2 is the mean time to respond otherwise

  11. Z-test • Theorem: If xi has a normal distribution with mean  and standard deviation 2, i=1,…,n, then U= ai xi has a normal distribution with a mean E(U)= ai and standard deviation D(U)=2 ai 2. • xi /n ~ N(, 2/n). • Z test : H: µ = µ0 (µ0 and 0 are known, assume  = 0) • What would one conclude about the null hypothesis that a sample of N = 46 with a mean of 104 could reasonably have been drawn from a population with the parameters of µ = 100 and  = 8? Use Reject the null hypothesis.

  12. Histogram Set 1 Set 2

  13. T-test

  14. William Sealey Gosset (1876-1937)(Guinness Brewing Company)

  15. Project 3 • A training data set • (38 samples, 7129 probes, 27 ALL, 11 AML) • A testing data set • (35 samples, 7129 probes, 22 ALL, 13 AML) • Lab today: pick the top probes that can differentiate the two sub types and process the testing data set

  16. L L L M M M M M M M M M M M M L L L M M M L L L Feature 1 Feature 1 Feature 1 M M M L L L L L L L L L L L L Feature 2 Feature 2 Feature 2 = ALL = ALL L L = AML = AML M M = test sample = test sample K Nearest Neighbor Classification

  17. Distance measures • Euclidean distance • Manhattan distance

  18. test sample Feature0 Feature1 Feature50 … … M L M M Jury Decisions • Use one feature at a time for the classification • Combining the results from the top 51 features • Majority decision

  19. False Discovery • Two possible errors in making a decision about the null hypothesis. • We could reject the null hypothesis when it is actually true, i.e., our results were obtained by chance. (Type I error). • We could fail to reject the null hypothesis when it is actually false, i.e. our experiment failed to detect the true difference that exists. (Type II error) • We set  at a level which will minimize the chances of making either of these errors.

  20. False Discovery • Type I error: False Discovery • False Discovery Rate (FDR) is equal to the p-value of the t-test X the number of genes in the array • For a p-value of 0.01  10,000 genes = 100 false “different” genes • You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001) • The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and variability of the measured expression values

  21. ? RCC subtypes • Clear Cell RCC (70-80%) • Papillary (15-20%) • Chromoprobe (4-5%) • Collecting duct • Oncocytoma • Saramatoid RCC Goal: Identify a panel of discriminator genes

  22. Genetic Algorithm for Feature Selection Raw measurement data Clear cell RCC, etc. Sample f1 f2 f3 f4 f5 Feature vector = pattern

  23. Why Genetic Algorithm? • Assuming 2,000 relevant genes, 20 important discriminator genes (features). • Cost of an exhaustive search for the optimal set of features ? C(n,k)=n!/k!(n-k)! C(2,000, 20) = 2000!/(20!1980!) ≥ (100)^20 = 10^40 If it takes one femtosecond (10-15 second) to evaluate a set of features, it takes more than 310^17 years to find the optimal solution on the computer.

  24. Evolutionary Methods • Based on the mechanics of Darwinian evolution • The evolution of a solution is loosely based on biological evolution • Population of competing candidate solutions • Chromosomes (a set of features) • Genetic operators (mutation, recombination, etc.) • generate new candidate solutions • Selection pressure directs the search • those that do well survive (selection) to form the basis for the next set of solutions.

  25. A Simple Evolutionary Algorithm Increasing Fitness Evaluation Genetic Operators Selection

  26. g100 g100 g2 g2 g5 g5 g7 g1 g2 g20 g21 g21 g3 g3 g6 g6 g7 g1 g2 g20 g10 g10 g22 g12 g15 g15 g7 g7 g12 g12 g1 g14 g21 g23 g51 g25 g7 g17 g20 g201 g10 g10 g23 g23 g56 g56 g72 g72 g25 g25 Genetic Algorithm Stop Good enough 4 3 2 Not good enough 1 5

  27. Encoding • Most difficult, and important part of any GA • Encode so that illegal solutions are not possible • Encode to simplify the “evolutionary” processes, e.g. reduce the size of the search space • Most GA’s use a binary encoding of a solution, but other schemes are possible

  28. GA Fitness • At the core of any optimization approach is the function that measures the quality of a solution or optimization. • Called: • Objective function • Fitness function • Error function • measure • etc.

  29. Crossover Mutation 10 30 62 80 10 30 50 70 Randomly Selected Mutation Site 20 40 60 80 • Recombination is intended to produce promising individuals. • Mutation maintains population diversity, preventing premature convergence. Randomly Selected Crossover Point 10 30 60 80 20 40 50 70 Genetic Operators

  30. Genetic Algorithm/K-Nearest Neighbor Algorithm Classifier (kNN) Feature Selection(GA) MicroarrayDatabase

More Related