1 / 31

Fusing Results from Microarray Experiments

Fusing Results from Microarray Experiments. Matt.Boardman@dal.ca http://www.cs.dal.ca/~boardman. Summary. Primary paper: Gilks et al, “Fusing microarray experiments with multivariate regression,” Bioinformatics , 2005. The basic idea:

missy
Download Presentation

Fusing Results from Microarray Experiments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fusing Results from Microarray Experiments Matt.Boardman@dal.ca http://www.cs.dal.ca/~boardman

  2. Summary • Primary paper: • Gilks et al, “Fusing microarray experiments with multivariate regression,” Bioinformatics, 2005. • The basic idea: • Microarray experiments are subject to noise and variation • Regression model “fuses” data from several microarray tests • Unique visualization! • Application: • Rustici et al, “Periodic gene expression program of the fission yeast cell cycle,” Nature Genetics, 2004.

  3. Agenda • Introduction to Microarrays • Regression Model • Experimental Procedures • Visualization of Results • Project Proposal

  4. DNA Transcription to mRNA Orengo et al, Figure 1.1

  5. DNA Microarrays • A microarray is a collection of thousands of small test locations, arranged in a 1” x 3” array. • Each test location has a small fragment of DNA, called a probe (about 20-70 bases), which corresponds to a particular gene. • Fragments of mRNA (recently transcribed messenger RNA) from a test subject bind to each probe. • We measure the quantity of mRNA that “sticks” to each probe, to determine how much mRNA for that gene is present in the sample. http://www.agilent.com/about/newsroom/lsca/imagelibrary/index_2003.html

  6. DNA Microarrays • Flash demo: http://www.bio.davidson.edu/Courses/genomics/chip/chipQ.html

  7. Slide Manufacturers: Agilent (HP spinoff) Amersham “Codelink” Corning “CMT GAPS II” Erie Sciences “Gold Seal” … Scanner Manufacturers: Affymetrix Agilent Applied Precision Asper Biotech Axon Molecular Devices National Instruments Vidar … Commercial Software: Axon/GenePix GeneExplorer Iobion-Stratagene/GeneTraffic Rosetta Resolver Spotfire SGI/GeneSpring … DNA Microarrays http://microarrays.ucsd.edu/biogem/resources/images/agilent_scanner.jpg

  8. DNA Microarrays • Sources of error: • gene-specific dye bias • probe design and manufacturing • “heterogeneity in source material” (The Fly!) • glass surface abnormalities (warpage, curvature) • variations in glass thickness • slide movement within scanner • slide manufacturing quality • mRNA deterioration • Remedies: • daily calibration • dynamic autofocus (Agilent) • software fixes (e.g. normalization) • repeat, repeat, repeat … http://www.moleculardevices.com/pages/instruments/gn_genepix4000.html

  9. Multivariate Regression Model • Microarray test repetition • Different laboratories • Different slides / scanners / software • Different procedures for sample preparation • Authors propose a new model to combine data from multiple microarray tests • No need to infer the causes of error • Automatically filter out noise and artefacts • Iteratively weight each test based on quality of results • Avoid “polluting high-quality results” with lower quality data • Deliver “fused and cleaned” dataset for further analysis

  10. Multivariate Regression Model • Let: • N be the number of microarray tests • m be the number of genes in each microarray • n be the number of hypothetical cell types under test • Note: • Typically m[N • We don’t know n, but we assume n < N

  11. Multivariate Regression Model • Where: • D is the matrix of observations: the actual microarray tests • X is a matrix of weights, uniquely designed for each experiment • C is the ideal, perfect microarray test with no variation or noise • ε contains unknown residual errors and noise • So … • D are the warped, noisy observations of the perfect microarray test C Gilks et al, Equation 1

  12. Experiment: Periodic Cell-Cycles in Yeast • Question: • Which genes are involved in cell reproduction in yeast? • Schizosaccharomyces pombe: (fission yeast) • Nine experiments were designed in order to synchronize the cell cycles in yeast: • centrifugal elutriation • cdc25 block-release • combinations of both methods • Microarrays taken every 15 minutes, for roughly two cell cycles (about 5.5 hours)

  13. Experiment: Periodic Cell-Cycles in Yeast • Goal: • Fuse these nine different experiments into one “ideal” • Result will be a set of microarray results for one cell-cycle • Problem: • Different synchronization methods  different cell-cycles • Experiments are not exactly in phase with each other • Experiments result in different cell-cycle lengths

  14. Experiment: Periodic Cell-Cycles in Yeast • These nine experiments produce N=178 microarray tests • Each microarray test has m=407 genes • Selected since they are identified as periodic in cell-cycle • 136 of these show significant changes during cycle • Define an ideal cell-cycle, divided into n=10 “fusion times” • Each microarray test will be at a different “angle” in the ideal cycle • The coefficients in X are chosen to weight the relevance of each microarray test to each of the fusion times

  15. Experiment: Periodic Cell-Cycles in Yeast • How are the coefficients in X chosen? • Suppose microarray test h occurs at θh in the cell-cycle • Linear interpolation: • Find the two fusion times on either side of θh • Weight each one according to how close they are to θh • The other fusion times for h have a zero weight • How is this done? We don’t know θh ! • Algorithm assumes initial weight values, then iteratively updates according to resulting generalization error • Authors claim convergence of these weights within 3 or 4 iterations, but continue through 10 iterations in their results for precision See Gilks et al, Equation 7 See Gilks et al, Equation 6

  16. Experiment: Periodic Cell-Cycles in Yeast • Now use the D=XCε model to estimate C • But how do we know the answer we get is correct? • Need a technique to visualize the results!

  17. Singular Value Decomposition (SVD) • A technique in linear algebra • Commonly used to solve systems of linear equations • Also used for linear least-squares problems, or “curve fitting” • The authors use SVD to find the two eigenvectors of a matrix which exhibit the highest variation • i.e. the “most variable” components of a matrix • not part of the actual model, just used for visualization • Similar in purpose to PCA (Principal Components Analysis), which identifies the components with highest variance • For more information on SVD and PCA with bioinformatics applications, see Wall et al.

  18. Gilks et al, Figure 1

  19. Closeup of experiment “cdc25-1” Ten “fusion times” are evenly spaced at π/5 radian intervals in the cycle. Gilks et al, Figure 2

  20. “Peppered Fried Egg Plot” Fusion Times Fusion Times · Specific Genes · Specific Genes Cell-Cycleness Cell-Cycleness Gene Density Gene Density Fusion times are evenly spaced at intervals of π/5 radians. Longer arrows indicate more variability in gene expression levels at this fusion time. The “pepper” represents the periodic activation of particular genes. Larger radius from the origin indicates more cell-cycle dependence. The boundary of the “yolk” represents the average radius from the origin of all genes, at each point in the cell cycle. The boundary of the “egg white” represents the average gene density, at each point in the cell cycle. Gilks et al, Figure 4

  21. Multivariate Regression Model • Possible difficulties with proposed algorithm? • Assumes linear relationships for simplicity of algorithm • Note the linear interpolation in our choice of X coefficients • Microarray tests “which fail to cohere with the generality of results will be downweighted” automatically, as part of the algorithm • In other words, the majority wins: what if the majority of experiments have been conducted poorly? • Difference in coverage over cell-cycle • Some parts of the cell-cycle have many contributors, others few • Treatment of missing data: KNN (K Nearest Neighbors) • However, these “imputed” data points have the same weight in the algorithm as the measured data points • Doesn’t address some significant sources of error, such as gene-specific dye bias • Most microarray experiments use the same dyes, Cy3 and Cy5

  22. Project Proposal • Can we use different methods to obtain similar results? • SVM regression (Support Vector Machines)? • To model the ideal, noise-free microarray test at any point in cycle • ICA (Independent Components Analysis)? • Identify contributions from n different cell types and a noise component • Simulated Annealing (a stochastic optimization method)? • Identify the best cell-cycle synchronization points • Why SVM regression? • Ability to generalize from a low number of samples • Detect non-linear relationships (paper assumes linear!) • Why ICA? • Computationally complex, but requires no assumptions about underlying data or noise models (we don’t need to know n!)

  23. References • Primary Paper: • W.R.Gilks, B.D.M.Tom, A.Brazma, “Fusing microarray experiments with multivariate regression,” Bioinformatics, 21(Suppl. 2):137–143, 2005. • Experimental Procedures: • G.Rustici, J.Mata, K.Kivinen, P.Lió, C.J.Penkett, G.Burns, J.Hayles, A.Brazma, P.Nurse, J.Bähler, “Periodic gene expression program of the fission yeast cell cycle,” Nature Genetics, 36(8):809–817, 2004. • Microarrays: • Wikipedia Contributors, “DNA microarray,” (http://en.wikipedia.org/wiki/CDNA_microarray), 2006. • A.M.Campbell, “DNA microarray methodology: Flash animation,” Department of Biology, Davidson College, Davidson, NC, (http://www.bio.davidson.edu/Courses/genomics/chip/chipQ.html), 2001. • C.A.Orengo, D.T.Jones, J.M.Thornton, Bioinformatics: Genes, Proteins & Computers, New York: Springer-Verlag, pp.218–228, 2003. • Singular Value Decomposition (SVD): • W.H.Press, S.A.Teukolsky, W.T.Vetterling, B.P.Flannery, Numerical Recipes in C: The Art of Scientific Computing, 2nd ed., Cambridge University Press, pp.59–70, 1992. • M.E.Wall, A.Rechtsteiner, L.M.Rocha."Singular value decomposition and principal component analysis". In A Practical Approach to Microarray Data Analysis, D.P.Berrar, W.Dubitzky, M.Granzow, eds., pp. 91–109, Kluwer: Norwell, MA, 2003. • Support Vector Machines (SVM): • K.P.Bennett, C.Campbell, “Support vector machines: Hype or hallelujah?” SIGKDD Explorations, 2(2):1–13, 2000. • A.J.Smola, B.Schölkopf, “A tutorial on support vector regression,” Statistics and Computing, 14(3):199–222, 2004. • V.N.Vapnik, The Nature of Statistical Learning Theory, 2nd ed., New York: Springer-Verlag, 1999. • Independent Components Analysis (ICA): • A.Hyvärinen, “Survey on independent components analysis,” Neural Computing Surveys, 2:94–128, 1999.

  24. Support Vector Machines • SVM use statistical machine learning • Constrained optimization problem: Objective: Find a hyperplane which maximizes margin • Higher dimensional mappings provide flexibility • Non-separable data: a tradeoff to allow misclassification some points in order to improve generalization performance (cost parameter) • Non-linear SVM (Polynomial, Sigmoid, Gaussian kernels)

  25. Support Vector Machines • The importance of data normalization (centre and scale) • The importance of free-parameter selection: Dataset from MLDB: “Iris Plant Database”

  26. ε-Tube Support Vector Regression • Can we use ε-SVR for outlier detection? • i.e. identify contributing samples which are outside the ε boundary, remove them, and retrain the model • Missing data: can we include the number of missing data points as another input variable for the SVM model? Bennet et al, Figure 12

  27. Independent Components Analysis • ICA attempts to find the true underlying signals from multiple observations of a mix of signals • Finds signals which are as statistically independent from one another as possible: “blind source separation” • Different to PCA, which identifies the measured signals with highest variance • For example, consider a hypothetical political debate: • Martin and Harper are speaking at the same time • two omnidirectional microphones listening to both speakers • ICA can isolate each speaker’s voice! • For a demo: http://www2.ele.tue.nl/ica99/realworld.html

  28. Independent Components Analysis: Test

  29. Independent Components Analysis: Samples

  30. Independent Components Analysis: Results

  31. Cell-cycle for Selected Genes Gilks et al, Figure 5

More Related