1 / 64

Statistical Analyses of Life Science and Pathology from Genomic Perspective

This study explores the roles of statistics and data science in genome/omics research, including quality control of noisy high-throughput data, tests and estimation/inference, classification/clustering, and more.

cordie
Download Presentation

Statistical Analyses of Life Science and Pathology from Genomic Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Analyses of Life Science and Pathology from Genomic Perspective Unit of Statistical Genetics, Kyoto University Ryo Yamada ryamada@genome.med.kyoto-u.ac.jp

  2. Roles of statistics/data science for genome/omics • Quality Control of Noisy High-Throughput Data • Tests, Estimation/Inference, Classification/Clustering • Multi-dimensional/High-dimensional Data • Random value-based approaches • Others : Experimental Designs

  3. Quality Control of Noisy High-Throughput Data • Systematic errors/ biases; samples, reagents, date/machine/personnel effects • How to Correct or control the noises • Outsider detection • Transformation of all records with a function • Normalization for “locational effects” • “Control samples”

  4. Transformation of all records with a function Genomic control for GWAS Preprocessing Micorarray Data Median-based correction Log-transformation

  5. Normalization for “locational effects” • Tendency should be considered. • Batch effects should be considered. • Non-data-driven • Data-driven

  6. Tests, Estimation/Inference, Classification/Clustering • Tests • Significance, Error Controlling, Multiple-testing issue • Estimation/Inference • Interval, Models, Bayes • Classification/Clustering • Unsupervised Learning vs. Supervised Learning

  7. Multiple Comparison P-value vs. Q-value

  8. Multiple Comparison • Almost all hypotheses are NULL

  9. Uniform distribution

  10. Minimum p-value distribution Mean • 2^10 Min-p may take quite larger value than the mean. In many cases, min-p value is smaller than the mean. Such small value are not rare.

  11. Minimum p-value distribution 1,2,4,8,… 10^6 1,2,4,8,… 10^6

  12. NON-NULL, FDR (False Discovery Rate) • Many hypotheses are NON-NULL, or Almost all hypotheses are NON-NULL

  13. Combination of two distributions • Uniform p-values • Small p-values

  14. Pick smaller p-values.Threshold value should be changed forthe ranks of p-values.The fraction of “true positives” is controlled.

  15. Large-scale inference • When you observed many at once, their distribution is informative. • The estimates of each observation using the information are different the estimates not using the information. • q-value of FDR is one type of such estimates. • Use information of distribution when observed many together • Empirical Bayes

  16. Estimation/Inference • Models, Parameters, Interval, Bayes • Uniform p-values • Small p-values Assuming the mixture of two distributions; This is a model.

  17. Estimation/Inference • Samples → Point estimates, Interval estimates • Sample distribution, Theoretical estimates, unbiased estimates,… • Frequentist The statement “The star’s weight is between a and b” will be right 9 times out of 10 times.

  18. Estimation/Inference • Frequentistsvs. Bayesians • Frequentists approaches are difficult for students not good at mathematics and their thinking processes are not easy to follow. • Instead Bayesian thinking processes tend to be easy to follow for many.

  19. Estimation/Inference • Bayesian • Model has parameter(s) • Dara + Model → Estimation of parameter value • Likelihood-based; Maximum-likelihood estimates; Interval estimates based on likelihood

  20. Summary for Genotypes and Phenotypes Data + Model ↓  Estimation of parameter values • To start your analysis • Record “Values” • “Values” take various shapes • “Simple value” : a Number • “Number” • Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix … • “Values” for analysis but not “Numbers” • Mathematical models • Biological phenomena have random errors • Stochastic models, Statistical models • Models have parameters, then • “Values” for parameters are “Numbers” again • “Simple values” are values of parameters in simple models. • Complex models and their parameters can be also values for your analysis.

  21. Quality Control of Noisy High-Throughput Data • Tests, Estimation/Inference, Classification/Clustering • Multi-dimensional/High-dimensional Data • Random value-based approaches • Others : Experimental Designs Estimation/Inference • Frequentistsvs. Bayesians • Use both, not select one of them, it is the way in 21-st century • Bayesian approaches seem to be used more and more, because • Models became more complicated. • Computers’ assists・・・Complicated distributions can be handled simulationally • Large-scale data ・・・Empirical Bayes approaches can be applied

  22. Estimation/Inference • Frequentistsvs. Bayesians • “Prior” distribution is necessary • What is the “appropriate prior”?

  23. Success rate:No information at all • Somebody you don’t know at all will take an exam on which you have no information at all. How likely do you think (s)he will pass it? Jeffreys prior One of non-subjective priors

  24. Estimation/Inference • Frequentistsvs. Bayesians • Use both, not select one of them, it is the way in 21-st century • Large scale inference • Prior can be set based on the data set ~ empirical Bayesian

  25. Multi-dimensional/High-dimensional Data • No way to visualize high-dimensional data • Almost impossible for US to understand in high-dimensional data themselves

  26. Multi-dimensional/High-dimensional Data • How many dimensions can we handle? • 2D space or 3D space • Extra dimensions • Gray/Color scale • Arrows • Time

  27. Multi-dimensional/High-dimensional Data • Dimension reduction • Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. • Only few dims are truly meaningful and all the others are noize. Pick the true dims. • LASSO, Compression sensing

  28. Multi-dimensional/High-dimensional Data • Space is high dimensional but data is low • Manifold learning • Put data into higher dimensional space and pull them back to low dim space.

  29. High-dimensionality • Many genes, many biomarkers, many features

  30. Multi-dimensional/High-dimensional Data • Life-science data are high-dimensional • Number of observed items are huge. • But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity

  31. Multi-dimensional/High-dimensional Data • Objects with low dimensions in higher dimensional space • Topology • Graph, network and topology

  32. Multi-dimensional/High-dimensional Data • Graph: Itemize and connect items with relation • Pairwise relations are cared. • No care for trio-wise or higher relations.

  33. Multi-dimensional/High-dimensional Data • Graph and its matrix representation and linear algebra • Graph tends to be sparse • … Sparse analysis

  34. Multi-dimensional/High-dimensional Data • Two important features • No “common” individuals • Sparse

  35. High-dimensionality • No commons • Central area : a sphere in a cubic 3.14 / 4 =0.785

  36. High-dimensionality • Sparse • To estimate density, you need reasonable number of samples per small cubic volume, but… • Dim = 1 : 0.1 • Dim = 2 : 0.01 • Dim = 3 : 0.001 • …. • Dime = 6 : 0.000001

  37. High-dimensionality • Quite spacious, but reasonably dense distribution. • Distribution should be low dimensional.

  38. Multi-dimensional/High-dimensional Data • Life-science data are high-dimensional • Number of observed items are huge. • But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity

  39. Low dimensional distribution in higher dimensional space and its local density • Regular density estimation method does not work. • Small cubic are still spacious in high dimensional space • How to estimate local density • k-nearest neighbormethod • In graph theory, similar idea is applicable. • Minimum-spanning tree

  40. Sparse in high dimensional space • How sparse? • One-dimensional manifolds • But significant variance

  41. Sparse in high dimensional space • How sparse? • One-dimensional manifolds • But significant variance Clustering

  42. クラスタリングの方法、2タイプ • Hierarchical • Non-hierarchical

  43. Hierarchical • Tree structure--- Graph, again • Its structure has information • Its structure is related with dimension • On the tree, distance is defined. • Some phenomena have reasons to be analyzed hierarchically.

  44. Classification • Separate something difficult to segregate. J. Med. Imag. 1(3), 034501 (Oct 09, 2014). doi:10.1117/1.JMI.1.3.034501

  45. Classification/Clustering • Unsupervised Learning • Supervised Learning • No teacher, but want to check whether the classification criteria is reliable or not. • Cross-validataion: One of resampling methods

  46. Small n Large p • Sample size 100 • Test association between a trait and expression of A gene. • N = 100, p = 1 • Large n Small p • Sample size 100 • Test association between a trait and expression of MANY genes. • N = 100, p = 25000 • Small n Large p

  47. n << p • One set of variables gives the perfect answer. • Another set of variables gives the perfect but different answer. • Which answer is the truth? • Closer fitting is not always the best. • AIC ~ Simpler model is better • LASSO, Sparse • The assumption k << n variables should be the answer, that is “prior” beliefBayesian

  48. Resampling • Estimation based on samples • Jack-knife(Subsets)、Bootstrap(Replacement) • Statistical significance • Permutation ~ Exact probability • Cross-validation • Pseudo-random generators from computers

  49. Psuedo-random number sequences • From uniform distribution • From other known distributions • From arbitrary distributions … Gibbs sampling • Using Gibbs sampler, • Based on a stochastic model, estimate parameters of distributions and generate random values form the estimated distributions… • BUGS (Bayesian inference using Gibbs Sampling)

More Related