640 likes | 649 Views
This study explores the roles of statistics and data science in genome/omics research, including quality control of noisy high-throughput data, tests and estimation/inference, classification/clustering, and more.
E N D
Statistical Analyses of Life Science and Pathology from Genomic Perspective Unit of Statistical Genetics, Kyoto University Ryo Yamada ryamada@genome.med.kyoto-u.ac.jp
Roles of statistics/data science for genome/omics • Quality Control of Noisy High-Throughput Data • Tests, Estimation/Inference, Classification/Clustering • Multi-dimensional/High-dimensional Data • Random value-based approaches • Others : Experimental Designs
Quality Control of Noisy High-Throughput Data • Systematic errors/ biases; samples, reagents, date/machine/personnel effects • How to Correct or control the noises • Outsider detection • Transformation of all records with a function • Normalization for “locational effects” • “Control samples”
Transformation of all records with a function Genomic control for GWAS Preprocessing Micorarray Data Median-based correction Log-transformation
Normalization for “locational effects” • Tendency should be considered. • Batch effects should be considered. • Non-data-driven • Data-driven
Tests, Estimation/Inference, Classification/Clustering • Tests • Significance, Error Controlling, Multiple-testing issue • Estimation/Inference • Interval, Models, Bayes • Classification/Clustering • Unsupervised Learning vs. Supervised Learning
Multiple Comparison P-value vs. Q-value
Multiple Comparison • Almost all hypotheses are NULL
Minimum p-value distribution Mean • 2^10 Min-p may take quite larger value than the mean. In many cases, min-p value is smaller than the mean. Such small value are not rare.
Minimum p-value distribution 1,2,4,8,… 10^6 1,2,4,8,… 10^6
NON-NULL, FDR (False Discovery Rate) • Many hypotheses are NON-NULL, or Almost all hypotheses are NON-NULL
Combination of two distributions • Uniform p-values • Small p-values
Pick smaller p-values.Threshold value should be changed forthe ranks of p-values.The fraction of “true positives” is controlled.
Large-scale inference • When you observed many at once, their distribution is informative. • The estimates of each observation using the information are different the estimates not using the information. • q-value of FDR is one type of such estimates. • Use information of distribution when observed many together • Empirical Bayes
Estimation/Inference • Models, Parameters, Interval, Bayes • Uniform p-values • Small p-values Assuming the mixture of two distributions; This is a model.
Estimation/Inference • Samples → Point estimates, Interval estimates • Sample distribution, Theoretical estimates, unbiased estimates,… • Frequentist The statement “The star’s weight is between a and b” will be right 9 times out of 10 times.
Estimation/Inference • Frequentistsvs. Bayesians • Frequentists approaches are difficult for students not good at mathematics and their thinking processes are not easy to follow. • Instead Bayesian thinking processes tend to be easy to follow for many.
Estimation/Inference • Bayesian • Model has parameter(s) • Dara + Model → Estimation of parameter value • Likelihood-based; Maximum-likelihood estimates; Interval estimates based on likelihood
Summary for Genotypes and Phenotypes Data + Model ↓ Estimation of parameter values • To start your analysis • Record “Values” • “Values” take various shapes • “Simple value” : a Number • “Number” • Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix … • “Values” for analysis but not “Numbers” • Mathematical models • Biological phenomena have random errors • Stochastic models, Statistical models • Models have parameters, then • “Values” for parameters are “Numbers” again • “Simple values” are values of parameters in simple models. • Complex models and their parameters can be also values for your analysis.
Quality Control of Noisy High-Throughput Data • Tests, Estimation/Inference, Classification/Clustering • Multi-dimensional/High-dimensional Data • Random value-based approaches • Others : Experimental Designs Estimation/Inference • Frequentistsvs. Bayesians • Use both, not select one of them, it is the way in 21-st century • Bayesian approaches seem to be used more and more, because • Models became more complicated. • Computers’ assists・・・Complicated distributions can be handled simulationally • Large-scale data ・・・Empirical Bayes approaches can be applied
Estimation/Inference • Frequentistsvs. Bayesians • “Prior” distribution is necessary • What is the “appropriate prior”?
Success rate:No information at all • Somebody you don’t know at all will take an exam on which you have no information at all. How likely do you think (s)he will pass it? Jeffreys prior One of non-subjective priors
Estimation/Inference • Frequentistsvs. Bayesians • Use both, not select one of them, it is the way in 21-st century • Large scale inference • Prior can be set based on the data set ~ empirical Bayesian
Multi-dimensional/High-dimensional Data • No way to visualize high-dimensional data • Almost impossible for US to understand in high-dimensional data themselves
Multi-dimensional/High-dimensional Data • How many dimensions can we handle? • 2D space or 3D space • Extra dimensions • Gray/Color scale • Arrows • Time
Multi-dimensional/High-dimensional Data • Dimension reduction • Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. • Only few dims are truly meaningful and all the others are noize. Pick the true dims. • LASSO, Compression sensing
Multi-dimensional/High-dimensional Data • Space is high dimensional but data is low • Manifold learning • Put data into higher dimensional space and pull them back to low dim space.
High-dimensionality • Many genes, many biomarkers, many features
Multi-dimensional/High-dimensional Data • Life-science data are high-dimensional • Number of observed items are huge. • But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity
Multi-dimensional/High-dimensional Data • Objects with low dimensions in higher dimensional space • Topology • Graph, network and topology
Multi-dimensional/High-dimensional Data • Graph: Itemize and connect items with relation • Pairwise relations are cared. • No care for trio-wise or higher relations.
Multi-dimensional/High-dimensional Data • Graph and its matrix representation and linear algebra • Graph tends to be sparse • … Sparse analysis
Multi-dimensional/High-dimensional Data • Two important features • No “common” individuals • Sparse
High-dimensionality • No commons • Central area : a sphere in a cubic 3.14 / 4 =0.785
High-dimensionality • Sparse • To estimate density, you need reasonable number of samples per small cubic volume, but… • Dim = 1 : 0.1 • Dim = 2 : 0.01 • Dim = 3 : 0.001 • …. • Dime = 6 : 0.000001
High-dimensionality • Quite spacious, but reasonably dense distribution. • Distribution should be low dimensional.
Multi-dimensional/High-dimensional Data • Life-science data are high-dimensional • Number of observed items are huge. • But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity
Low dimensional distribution in higher dimensional space and its local density • Regular density estimation method does not work. • Small cubic are still spacious in high dimensional space • How to estimate local density • k-nearest neighbormethod • In graph theory, similar idea is applicable. • Minimum-spanning tree
Sparse in high dimensional space • How sparse? • One-dimensional manifolds • But significant variance
Sparse in high dimensional space • How sparse? • One-dimensional manifolds • But significant variance Clustering
クラスタリングの方法、2タイプ • Hierarchical • Non-hierarchical
Hierarchical • Tree structure--- Graph, again • Its structure has information • Its structure is related with dimension • On the tree, distance is defined. • Some phenomena have reasons to be analyzed hierarchically.
Classification • Separate something difficult to segregate. J. Med. Imag. 1(3), 034501 (Oct 09, 2014). doi:10.1117/1.JMI.1.3.034501
Classification/Clustering • Unsupervised Learning • Supervised Learning • No teacher, but want to check whether the classification criteria is reliable or not. • Cross-validataion: One of resampling methods
Small n Large p • Sample size 100 • Test association between a trait and expression of A gene. • N = 100, p = 1 • Large n Small p • Sample size 100 • Test association between a trait and expression of MANY genes. • N = 100, p = 25000 • Small n Large p
n << p • One set of variables gives the perfect answer. • Another set of variables gives the perfect but different answer. • Which answer is the truth? • Closer fitting is not always the best. • AIC ~ Simpler model is better • LASSO, Sparse • The assumption k << n variables should be the answer, that is “prior” beliefBayesian
Resampling • Estimation based on samples • Jack-knife(Subsets)、Bootstrap(Replacement) • Statistical significance • Permutation ~ Exact probability • Cross-validation • Pseudo-random generators from computers
Psuedo-random number sequences • From uniform distribution • From other known distributions • From arbitrary distributions … Gibbs sampling • Using Gibbs sampler, • Based on a stochastic model, estimate parameters of distributions and generate random values form the estimated distributions… • BUGS (Bayesian inference using Gibbs Sampling)