1 / 31

Center for Biofilm Engineering

Center for Biofilm Engineering. Some statistical considerations in molecular methods. Al Parker Statistician and Research Engineer Montana State University. BSTM– July 2009. Acknowledgments. Colleagues in the CBE: James Moberly, Seth D’Imperio, Brent Peyton Markus Dieser

lowri
Download Presentation

Center for Biofilm Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Center for Biofilm Engineering Some statistical considerations in molecular methods Al Parker Statistician and Research Engineer Montana State University BSTM– July 2009

  2. Acknowledgments • Colleagues in the CBE: • James Moberly, Seth D’Imperio, Brent Peyton • Markus Dieser • Marty Hamilton

  3. The problem How to extract useful information from hundreds to thousands of response variables (eg. micro-array analysis) measured from only a few replicates (experiments or environmental samples)

  4. Statistical thinking • Multivariate Statistics attempts to organize and summarize data sets with large numbers of response variables • “organize and summarize” = dimension reduction • In this talk, I will focus on abundance data, estimated for example from micro-array or clone analysis of PCR

  5. Statistical thinking • Hierarchical Clustering • Principle Components • Canonical Correlation

  6. Hierarchical Clustering (38 variables, 9 replicates)

  7. Hierarchical Clustering (38 variables, 9 replicates) Similarity or Distance Linkage: How the similarity measure determines clusters

  8. Two different ways to generate clusters with the same similarity measure

  9. A Distance or Similarity Measure Correlation measures the strength and direction of a linear relationship between paired variables x and y Σ(xi – mean(x))(yi – mean(y)) Corr(x,y) = (n-1)SxSy • Unitless • Values between -1 and 1

  10. An example (2 variables, 9 replicates) Corr(Actinobacteria, Acidobacteria) = .7833

  11. Another (made up) example Corr(species 1, species 2) = 0.000

  12. A matrix of scatterplots for 6 variables

  13. A correlation matrix of 6 variables

  14. Principle Components Analysis (PCA) • PCA uses the correlation matrix formed by the original variables to optimally construct a smaller number of new variables which capture the maximum amount of variability in the original variables • PCA applied to the correlation matrix is not affected by disparate units between the different variables • The number of new variables is only as large as the number of replicates

  15. PCA with 2 (standardized) responses Original variable #1 Original variable #2

  16. PCA with 2 (standardized) responses 1st PC - 78% 1st PC is loaded by Orig Var #1 Original variable #1 2nd PC – 22% 2nd PC is loaded by Orig Var #2 Original variable #2

  17. PCA terminology • The new variables are called principle components • The amount of variability of the original data captured by each component is given • The correlation between the original variables and the principle components are • principle component loadings

  18. Reducing 7 original variables to 2 PCs Original variables: 1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn New variables = Principle Components 55% 1st PC: Metals 2nd PC: Water depth and Core depth 18%

  19. Reducing 7 original variables to 2 PCs 1st PC - 55% 2nd PC – 18% Total: 73%

  20. PCA is another way to cluster

  21. Canonical Correlation Analysis (CCA) • CCA uses the correlation matrix to determine the (linear) relationship between input variables (eg. environmental variables) and response variables (eg. phylogenic data) • CCA simultaneously finds new variables from the input and response variables which have maximal correlation • The number of new variables (canonical components) can be no larger than the number of replicates

  22. CCA Example (7 inputs, 6 outputs, 9 replicates) Original microbial variables: Original environmental variables: 1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn 1. Acidobacteria 2. Actinobacteria 3. Bacteroidetes 4. Chloroflexi 5. Proteobacteria 6. Verrucomicrobia

  23. CCA (7 inputs, 6 outputs, 9 replicates) Original microbial variables: Original environmental variables: 1. Water depth 2. Core depth 3. Fe 4. Mn 5. Cu 6. Pb 7. Zn 1. Acidobacteria 2. Actinobacteria 3. Bacteroidetes 4. Chloroflexi 5. Proteobacteria 6. Verrucomicrobia 1st CC: Water depth and Core depth 1st CC: Acidobacteria,…, Verucomicrobia 2nd CC: Metals 2nd CC: Bacteroidetes

  24. CCA (7 inputs, 6 outputs, 9 replicates) 1st CC: Water depth and Core depth 1st CC: Acidobacteria,…, Verucomicrobia 2nd CC: Metals 2nd CC: Bacteroidetes

  25. Summary • PROBLEM: Lots of variables measured from a few samples • SOME APPROACHES: • Cluster similar variables together • Principle component analysis creates a few new variables which optimally represent the data • Canonical correlation analysis describes the optimal (linear) relationship between input and output variables

  26. Fin

  27. Principal Component Analysis: water depth , core depth (, Mn-Total, Fe-Total, C • Eigenanalysis of the Correlation Matrix • Eigenvalue 3.8467 1.2443 1.0043 0.6628 0.1567 0.0830 0.0023 • Proportion 0.550 0.178 0.143 0.095 0.022 0.012 0.000 • Cumulative 0.550 0.727 0.871 0.965 0.988 1.000 1.000 • Variable PC1 PC2 PC3 PC4 PC5 PC6 PC7 • water depth (cm) 0.090 -0.529 -0.732 0.338 0.131 0.201 -0.062 • core depth (cm) -0.193 0.702 -0.154 0.558 0.194 0.313 0.009 • Mn-Total 0.488 0.163 -0.171 -0.016 -0.366 0.084 0.752 • Fe-Total 0.477 0.228 -0.126 -0.057 -0.504 0.154 -0.651 • Cu-Total 0.227 -0.358 0.608 0.633 -0.119 0.188 0.004 • Zn-Total 0.463 0.019 0.147 -0.326 0.634 0.505 -0.026 • Pb-Total 0.474 0.142 -0.055 0.253 0.376 -0.735 -0.080

  28. CCA (7 input variables, 9 replicates) 1st CC: Water depth Core depth 2nd CC: Metals

  29. CCA (6 response variables, 9 replicates) 1st CC: Acidobacteria,…, Verucomicrobia 2nd CC: Bacteroidetes

  30. Hierarchical Clustering • The large number of variables are organized into a smaller number of similar clusters • One can choose a representative variable from each cluster (eg. a mean)

More Related