Statistical Analyses of Life Science and Pathology from Genomic Perspective

Statistical Analyses of Life Science and Pathology from Genomic Perspective Unit of Statistical Genetics, Kyoto University Ryo Yamada ryamada@genome.med.kyoto-u.ac.jp

Roles of statistics/data science for genome/omics • Quality Control of Noisy High-Throughput Data • Tests, Estimation/Inference, Classification/Clustering • Multi-dimensional/High-dimensional Data • Random value-based approaches • Others : Experimental Designs

Quality Control of Noisy High-Throughput Data • Systematic errors/ biases; samples, reagents, date/machine/personnel effects • How to Correct or control the noises • Outsider detection • Transformation of all records with a function • Normalization for “locational effects” • “Control samples”

Transformation of all records with a function Genomic control for GWAS Preprocessing Micorarray Data Median-based correction Log-transformation

Normalization for “locational effects” • Tendency should be considered. • Batch effects should be considered. • Non-data-driven • Data-driven

Tests, Estimation/Inference, Classification/Clustering • Tests • Significance, Error Controlling, Multiple-testing issue • Estimation/Inference • Interval, Models, Bayes • Classification/Clustering • Unsupervised Learning vs. Supervised Learning

Multiple Comparison P-value vs. Q-value

Multiple Comparison • Almost all hypotheses are NULL

Uniform distribution

Minimum p-value distribution Mean • 2^10 Min-p may take quite larger value than the mean. In many cases, min-p value is smaller than the mean. Such small value are not rare.

Minimum p-value distribution 1,2,4,8,… 10^6 1,2,4,8,… 10^6

NON-NULL, FDR (False Discovery Rate) • Many hypotheses are NON-NULL, or Almost all hypotheses are NON-NULL

Combination of two distributions • Uniform p-values • Small p-values

Pick smaller p-values.Threshold value should be changed forthe ranks of p-values.The fraction of “true positives” is controlled.

Large-scale inference • When you observed many at once, their distribution is informative. • The estimates of each observation using the information are different the estimates not using the information. • q-value of FDR is one type of such estimates. • Use information of distribution when observed many together • Empirical Bayes

Estimation/Inference • Models, Parameters, Interval, Bayes • Uniform p-values • Small p-values Assuming the mixture of two distributions; This is a model.

Estimation/Inference • Samples　→　Point estimates, Interval estimates • Sample distribution, Theoretical estimates, unbiased estimates,… • Frequentist The statement “The star’s weight is between a and b” will be right 9 times out of 10 times.

Estimation/Inference • Frequentistsvs. Bayesians • Frequentists approaches are difficult for students not good at mathematics and their thinking processes are not easy to follow. • Instead Bayesian thinking processes tend to be easy to follow for many.

Estimation/Inference • Bayesian • Model has parameter(s) • Dara　＋　Model　→　Estimation of parameter value • Likelihood-based; Maximum-likelihood estimates; Interval estimates based on likelihood

Summary for Genotypes and Phenotypes Data　＋　Model ↓　 Estimation of parameter values • To start your analysis • Record “Values” • “Values” take various shapes • “Simple value” : a Number • “Number” • Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix … • “Values” for analysis but not “Numbers” • Mathematical models • Biological phenomena have random errors • Stochastic models, Statistical models • Models have parameters, then • “Values” for parameters are “Numbers” again • “Simple values” are values of parameters in simple models. • Complex models and their parameters can be also values for your analysis.

Quality Control of Noisy High-Throughput Data • Tests, Estimation/Inference, Classification/Clustering • Multi-dimensional/High-dimensional Data • Random value-based approaches • Others : Experimental Designs Estimation/Inference • Frequentistsvs. Bayesians • Use both, not select one of them, it is the way in 21-st century • Bayesian approaches seem to be used more and more, because • Models became more complicated. • Computers’ assists・・・Complicated distributions can be handled simulationally • Large-scale data ・・・Empirical Bayes approaches can be applied

Estimation/Inference • Frequentistsvs. Bayesians • “Prior” distribution is necessary • What is the “appropriate prior”?

Success rate：No information at all • Somebody you don’t know at all will take an exam on which you have no information at all. How likely do you think (s)he will pass it? Jeffreys prior One of non-subjective priors

Estimation/Inference • Frequentistsvs. Bayesians • Use both, not select one of them, it is the way in 21-st century • Large scale inference • Prior can be set based on the data set ~ empirical Bayesian

Multi-dimensional/High-dimensional Data • No way to visualize high-dimensional data • Almost impossible for US to understand in high-dimensional data themselves

Multi-dimensional/High-dimensional Data • How many dimensions can we handle? • 2D space or 3D space • Extra dimensions • Gray/Color scale • Arrows • Time

Multi-dimensional/High-dimensional Data • Dimension reduction • Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. • Only few dims are truly meaningful and all the others are noize. Pick the true dims. • LASSO, Compression sensing

Multi-dimensional/High-dimensional Data • Space is high dimensional but data is low • Manifold learning • Put data into higher dimensional space and pull them back to low dim space.

High-dimensionality • Many genes, many biomarkers, many features

Multi-dimensional/High-dimensional Data • Life-science data are high-dimensional • Number of observed items are huge. • But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity

Multi-dimensional/High-dimensional Data • Objects with low dimensions in higher dimensional space • Topology • Graph, network and topology

Multi-dimensional/High-dimensional Data • Graph: Itemize and connect items with relation • Pairwise relations are cared. • No care for trio-wise or higher relations.

Multi-dimensional/High-dimensional Data • Graph and its matrix representation and linear algebra • Graph tends to be sparse • … Sparse analysis

Multi-dimensional/High-dimensional Data • Two important features • No “common” individuals • Sparse

High-dimensionality • No commons • Central area : a sphere in a cubic 3.14 / 4 =0.785

High-dimensionality • Sparse • To estimate density, you need reasonable number of samples per small cubic volume, but… • Dim = 1 : 0.1 • Dim = 2 : 0.01 • Dim = 3 : 0.001 • …. • Dime = 6 : 0.000001

High-dimensionality • Quite spacious, but reasonably dense distribution. • Distribution should be low dimensional.

Multi-dimensional/High-dimensional Data • Life-science data are high-dimensional • Number of observed items are huge. • But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity

Low dimensional distribution in higher dimensional space and its local density • Regular density estimation method does not work. • Small cubic are still spacious in high dimensional space • How to estimate local density • k-nearest neighbormethod • In graph theory, similar idea is applicable. • Minimum-spanning tree

Sparse in high dimensional space • How sparse? • One-dimensional manifolds • But significant variance

Sparse in high dimensional space • How sparse? • One-dimensional manifolds • But significant variance Clustering

クラスタリングの方法、２タイプ • Hierarchical • Non-hierarchical

Hierarchical • Tree structure--- Graph, again • Its structure has information • Its structure is related with dimension • On the tree, distance is defined. • Some phenomena have reasons to be analyzed hierarchically.

Classification • Separate something difficult to segregate. J. Med. Imag. 1(3), 034501 (Oct 09, 2014). doi:10.1117/1.JMI.1.3.034501

Classification/Clustering • Unsupervised Learning • Supervised Learning • No teacher, but want to check whether the classification criteria is reliable or not. • Cross-validataion: One of resampling methods

Small n Large p • Sample size 100 • Test association between a trait and expression of A gene. • N = 100, p = 1 • Large n Small p • Sample size 100 • Test association between a trait and expression of MANY genes. • N = 100, p = 25000 • Small n Large p

n << p • One set of variables gives the perfect answer. • Another set of variables gives the perfect but different answer. • Which answer is the truth? • Closer fitting is not always the best. • AIC ～ Simpler model is better • LASSO, Sparse • The assumption k << n variables should be the answer, that is “prior” beliefBayesian

Resampling • Estimation based on samples • Jack-knife（Subsets）、Bootstrap(Replacement) • Statistical significance • Permutation ～ Exact probability • Cross-validation • Pseudo-random generators from computers

Psuedo-random number sequences • From uniform distribution • From other known distributions • From arbitrary distributions … Gibbs sampling • Using Gibbs sampler, • Based on a stochastic model, estimate parameters of distributions and generate random values form the estimated distributions… • BUGS (Bayesian inference using Gibbs Sampling)

Statistical Analyses of Life Science and Pathology from Genomic Perspective

Statistical Analyses of Life Science and Pathology from Genomic Perspective

Presentation Transcript

Life History Analyses

Life Long Learning: European pathology training in perspective

Experimental design and statistical analyses of data

Descriptive Statistical Analyses Reliability Analyses

Variance Analyses from Invariance Analyses

Statistical aspects of Higgs analyses

Exploring Haemophilus haemolyticus by functional genomic analyses

Science of Life and God of Life

Statistical Concepts and Methodologies for Data Analyses

Introduction to Using Statistical Analyses

Dispensing Processes Impact Computational and Statistical Analyses

Molecular and Genomic Pathology at UNC

Experimental design and statistical analyses of data

Experimental design and statistical analyses of data

Statistical analyses and non-causal associations

Statistical analyses of genetical data from trios

Measuring Science, Technology and Innovation (STI): Definitions from a statistical perspective

Statistical Models of Anatomy and Pathology

Measuring Science, Technology and Innovation (STI): Definitions from a statistical perspective

Measuring Science, Technology and Innovation (STI): Definitions from a statistical perspective

Variance Analyses from Invariance Analyses

Causes and Consequences of Inbreeding: a Livestock Genomic Perspective