Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Peter J. Bickel Department of Statistics University of California at Berkeley, USA Some examples of statistical inference in genomics Joint work with Ben Brown, Haiyan Huang, Nancy Zhang, Nathan Boley, Jessica Li, and the ENCODE Consortium

Outline The ENCODE Project The first question: Testing the hypothesis of lack of association between two features of the genome a) Modeling issues b) A minimal nonparametric model c) Theory and practical applications of our nonparametric view The second question: Determining the reliability of genomic features derived by different algorithms from ChIP-seq and other assays a) The method is based on consistency of biological replicates since ground truth is rarely, if ever, available b) A curve, a copula model, and an analogue of the False Discovery Rate

The ENCODE Project The ENCODE Project Consortium. "The ENCODE (ENCyclopedia Of DNA Elements) Project". 2004. Science 22: 306 (5696).

The Genome Structural CorrectionReferences for Part I Peter J. Bickel, Nathan Boley, James B. Brown, Haiyan Huang, and Nancy R. Zhang. “Subsampling methods for genomic inference”. Annals of Applied Statistics, Volume 4, Number 4 (2010), 1660-1697. The ENCODE Project Consortium. “Initial Analysis of the Encyclopedia of DNA Elements in the Human Genome”. 2012. Nature, in press. Gerstein et al. Integrative Analysis of the Caenorhabditiselegans Genome by the modENCODE Project”. Science. (2010): Vol. 330 no. 6012 pp. 1775-1787 Birney E et al (2007). “Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project”. Nature. 447: 799-816. Margulies EM, et al. (2007). “Analysis of deep mammalian sequence alignments and constraint predictions for 1% of the human genome”. Genome Research. 17: 760-774.

Association of functional annotations in the Human Genome The ENCODE Consortium found that many Transcription Start Sites are anti-sense to GENCODE exons They also found vastly more TSSs than previously supposed Is the association between TSSs and exons in the anti-sense direction real, or experimental noise in TSS identification? → GENCODE Exons → Transcription Start Sites (TSSs)‏ 5' 3' 3' 5'

Association of experimental annotations across whole chromosomes Do two factors tend to bind together more closely or more often than other pairs of factors? Does a factor’s binding site relative to TSSs tend to change across genomic regions?

Feature Overlap: the question A mathematical question arises: Do these features overlap more, or less than “expected at random”? →Transcription Fragments → Conserved sequence 5' 3'

Our formulation Defining “expectation” and “at random”: The genome is highly structured Analysis of feature inter-dependence must account for superficial structure “Expected at random” becomes: Overlap between two feature sets bearing structure, under no biological constraints

Naïve Method Treating bases as being independent with same distribution (ordinary bootstrap)‏ Hypothesis: Feature markings are independent Specific Object Test based on % Feature Overlap – (% Feature1)(% Feature2) and standard statistics Why naïve ? Bases are NOT independent Better method: keeping one type of feature fixed and simulating moving start site of another feature uniformly (feature bootstrap)‏ Why still a problem? Even if feature occurrences are independent functionally, there can be clumping caused by the complex underlying genome sequence structure (i.e. inhomogeneity, local sequence dependence)

A non parametric model Requirements: It should roughly reflect known statistics of the genome It should encompass methods listed It should be possible to do inference, tests, set confidence bounds meaningfully

Segmented Stationary Model Let Xi = base at position i, i=1,…,n such that for each k=1,…,r, is: • Stationary (homogeneity within blocks) • Mixing (bases at distant positions are nearly independent)‏ • r << n …

Empirical Interpretations Within a segment: For k small compared to minimum segment length, statistics of random kmers do not differ between large subsegments of segment Knowledge of the first kmer does not help in predicting a distant kmer Remark: If this model holds it also applies to derived local features, e.g. {I1,…,In} where Ik = 1 if position k belongs to binding site for given factor

Using our model for inference Many genomic statistics are function of one or more sums of the form: e.g. is 1 or 0 depending on the presence or absence of a feature or features When the summands are small compared to S: Gaussian case Example: Region overlap for common features, or rare features over large regions Under segmented stationarity, these distributions can be estimated from the data

Some theory Theorem 1: Segmented stationarity, exponential mixing and fraction of short segments → 0 implies asymptotic normality of linear statistics Theorem 2: If the ordinary stationary bootstrap is used (Politis/Romano) under suitable conditions on L, and different stationary segments are present, then the asymptotic bootstrap distribution is heavier tailed than a Gaussian of the same variance Theorem 3: If the true segmentation is estimated in an approximately consistent way, then, for approximately linear statistics, the resulting segmented bootstrap is consistent By the delta method, Gaussianity holds for smooth functions of vectors of linear statistics, and so does segmented bootstrap and previous theorems

Distributions of feature overlaps The Block Bootstrap Can’t observe independent occurrences of ENCODE regions, but if our hypothesis of segmented stationarity holds then the distribution of sum statistics and their functions can be approximated as follows

Block Bootstrap for r = 1 Algorithm 4.1: • 1) Given L << n choose a number N uniformly at random from{1,...,n-L} • 2) Given the statistics Tn(X1,…,Xn) , under the assumption that X1,…,Xnis stationary, compute • 3) Repeat B times to obtain • 4) Estimate the distribution of by the empirical distribution: By Theorem 4.2.1 of Politis, Romano and Wolf (1999) this is asymptotically okay‏

Block Bootstrap Animationr = 1 Statistic: Observed Sequence (X): S=f(X)‏ … … … Draw a block of length L from original sequence, this is the block-bootstrapped sequence. Calculate statistic on the block bootstrapped sequence. Repeat this procedure identically B times.

Observing the distributions Block bootstrap distribution of the Region Overlap Statistic Shown here with the PDF of the normal distribution with the same mean and variance The histogram of Is approximately the same as density of QQplot of BB distribution vs. standard normal

What if r > 1 The estimated distribution is always heavier tailed leading to conservative p values But it can be enormously so if the segment means of the statistic differ substantially Less so but still meaningful if the means agree but variances differ

Solutions Segment using biological knowledge Essentially done in ENCODE: poor segmentation occasionally led to non-Gaussian distributions (excessively conservative)‏ Segment using a particular linear statistic which we expect to identify homogeneous segments

Block Bootstrap given Segmentation T(X1*),…T(XB*)‏ 3. Do this B times: f3L f1L f2L 1. Draw Subsample of length L: 2. Compute statistic on subsample: T(X*)‏

True distribution Uniform Start Site Shuffling Block Bootstrap without Segmentation Block Bootstrap with True Segmentation Block Bootstrap with Estimated Segmentation

Testing Association Question: How do we estimate null distribution given only data for which we believe the null is false?

Testing Association (bp overlap)‏ Observed Sequence (Feature 1 = , Feature 2 = ): Statistic is: (X2)(Y1)+(X1)(Y2), properly normalized and set to mean 0. Under the null hypothesis of independence, this should be Gaussian. Align Feature 1 of first block with Feature 2 of second block, And vice versa. Calculate overlap in the blocks after swapping = (X2)(Y1)+(X1)(Y2)‏ Sample two blocks of equal length.

Correlating DNA copy number variation with genomic content Redon et al (2007) claimed that copy number variant regions (CNVs) are significantly (< 0.05 with multiple testing correction) negatively correlated with coding regions, i.e. have less than randomly expected overlap. Their analysis is based on random shufflings of start positions. Our analysis is that the effect is probably an artifact.

Issues • Choosing a method of segmentation, e.g. dyadic, and its tuning parameters • Block size for bootstrap using: • stability • segmentation

Test Statistic H : Features not associated in each segment (so-called “dummy overlap”)‏ Then has a Gaussian distribution. We form the test statistic: where: Length of segment i/n % of basepairs in segment i identified as Feature 1 % of basepairs in segment i identified as Feature 2

Measuring reproducibility of high-throughput experiments Qunhua Li, James B. Brown, Haiyan Huang, and Peter J. Bickel Annals of Applied Statistics, Volume 5, Number 3 (2011), 1752-1779.

A consistency measure

Our method of fitting

Nancy Zhang Ben Brown Qunhua Li Nathan Boley Jessica Li Anshul Kundaje Haiyan Huang Peter Bickel

Peter J. Bickel Department of Statistics University of California at Berkeley, USA