1 / 46

Peter J. Bickel Department of Statistics University of California at Berkeley, USA

Peter J. Bickel Department of Statistics University of California at Berkeley, USA. Some examples of statistical inference in genomics. Joint work with Ben Brown, Haiyan Huang, Nancy Zhang, Nathan Boley, Jessica Li, and the ENCODE Consortium. Outline. The ENCODE Project

rtoni
Download Presentation

Peter J. Bickel Department of Statistics University of California at Berkeley, USA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Peter J. Bickel Department of Statistics University of California at Berkeley, USA Some examples of statistical inference in genomics Joint work with Ben Brown, Haiyan Huang, Nancy Zhang, Nathan Boley, Jessica Li, and the ENCODE Consortium

  2. Outline The ENCODE Project The first question: Testing the hypothesis of lack of association between two features of the genome a) Modeling issues b) A minimal nonparametric model c) Theory and practical applications of our nonparametric view The second question: Determining the reliability of genomic features derived by different algorithms from ChIP-seq and other assays a) The method is based on consistency of biological replicates since ground truth is rarely, if ever, available b) A curve, a copula model, and an analogue of the False Discovery Rate

  3. The ENCODE Project The ENCODE Project Consortium. "The ENCODE (ENCyclopedia Of DNA Elements) Project". 2004. Science 22: 306 (5696).

  4. The Genome Structural CorrectionReferences for Part I Peter J. Bickel, Nathan Boley, James B. Brown, Haiyan Huang, and Nancy R. Zhang. “Subsampling methods for genomic inference”. Annals of Applied Statistics, Volume 4, Number 4 (2010), 1660-1697. The ENCODE Project Consortium. “Initial Analysis of the Encyclopedia of DNA Elements in the Human Genome”. 2012. Nature, in press. Gerstein et al. Integrative Analysis of the Caenorhabditiselegans Genome by the modENCODE Project”. Science.  (2010): Vol. 330 no. 6012 pp. 1775-1787  Birney E et al (2007). “Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project”. Nature. 447: 799-816. Margulies EM, et al. (2007). “Analysis of deep mammalian sequence alignments and constraint predictions for 1% of the human genome”. Genome Research. 17: 760-774.

  5. Association of functional annotations in the Human Genome The ENCODE Consortium found that many Transcription Start Sites are anti-sense to GENCODE exons They also found vastly more TSSs than previously supposed Is the association between TSSs and exons in the anti-sense direction real, or experimental noise in TSS identification? → GENCODE Exons → Transcription Start Sites (TSSs)‏ 5' 3' 3' 5'

  6. Association of experimental annotations across whole chromosomes Do two factors tend to bind together more closely or more often than other pairs of factors? Does a factor’s binding site relative to TSSs tend to change across genomic regions?

  7. Feature Overlap: the question A mathematical question arises: Do these features overlap more, or less than “expected at random”? →Transcription Fragments → Conserved sequence 5' 3'

  8. Our formulation Defining “expectation” and “at random”: The genome is highly structured Analysis of feature inter-dependence must account for superficial structure “Expected at random” becomes: Overlap between two feature sets bearing structure, under no biological constraints

  9. Naïve Method Treating bases as being independent with same distribution (ordinary bootstrap)‏ Hypothesis: Feature markings are independent Specific Object Test based on % Feature Overlap – (% Feature1)(% Feature2) and standard statistics Why naïve ? Bases are NOT independent Better method: keeping one type of feature fixed and simulating moving start site of another feature uniformly (feature bootstrap)‏ Why still a problem? Even if feature occurrences are independent functionally, there can be clumping caused by the complex underlying genome sequence structure (i.e. inhomogeneity, local sequence dependence)

  10. A non parametric model Requirements: It should roughly reflect known statistics of the genome It should encompass methods listed It should be possible to do inference, tests, set confidence bounds meaningfully

  11. Segmented Stationary Model Let Xi = base at position i, i=1,…,n such that for each k=1,…,r, is: • Stationary (homogeneity within blocks) • Mixing (bases at distant positions are nearly independent)‏ • r << n …

  12. Empirical Interpretations Within a segment: For k small compared to minimum segment length, statistics of random kmers do not differ between large subsegments of segment Knowledge of the first kmer does not help in predicting a distant kmer Remark: If this model holds it also applies to derived local features, e.g. {I1,…,In} where Ik = 1 if position k belongs to binding site for given factor

  13. Using our model for inference Many genomic statistics are function of one or more sums of the form: e.g. is 1 or 0 depending on the presence or absence of a feature or features When the summands are small compared to S: Gaussian case Example: Region overlap for common features, or rare features over large regions Under segmented stationarity, these distributions can be estimated from the data

  14. Some theory Theorem 1: Segmented stationarity, exponential mixing and fraction of short segments → 0 implies asymptotic normality of linear statistics Theorem 2: If the ordinary stationary bootstrap is used (Politis/Romano) under suitable conditions on L, and different stationary segments are present, then the asymptotic bootstrap distribution is heavier tailed than a Gaussian of the same variance Theorem 3: If the true segmentation is estimated in an approximately consistent way, then, for approximately linear statistics, the resulting segmented bootstrap is consistent By the delta method, Gaussianity holds for smooth functions of vectors of linear statistics, and so does segmented bootstrap and previous theorems

  15. Distributions of feature overlaps The Block Bootstrap Can’t observe independent occurrences of ENCODE regions, but if our hypothesis of segmented stationarity holds then the distribution of sum statistics and their functions can be approximated as follows

  16. Block Bootstrap for r = 1 Algorithm 4.1: • 1) Given L << n choose a number N uniformly at random from{1,...,n-L} • 2) Given the statistics Tn(X1,…,Xn) , under the assumption that X1,…,Xnis stationary, compute • 3) Repeat B times to obtain • 4) Estimate the distribution of by the empirical distribution: By Theorem 4.2.1 of Politis, Romano and Wolf (1999) this is asymptotically okay‏

  17. Block Bootstrap Animationr = 1 Statistic: Observed Sequence (X): S=f(X)‏ … … … Draw a block of length L from original sequence, this is the block-bootstrapped sequence. Calculate statistic on the block bootstrapped sequence. Repeat this procedure identically B times.

  18. Observing the distributions Block bootstrap distribution of the Region Overlap Statistic Shown here with the PDF of the normal distribution with the same mean and variance The histogram of Is approximately the same as density of QQplot of BB distribution vs. standard normal

  19. What if r > 1 The estimated distribution is always heavier tailed leading to conservative p values But it can be enormously so if the segment means of the statistic differ substantially Less so but still meaningful if the means agree but variances differ

  20. Solutions Segment using biological knowledge Essentially done in ENCODE: poor segmentation occasionally led to non-Gaussian distributions (excessively conservative)‏ Segment using a particular linear statistic which we expect to identify homogeneous segments

  21. Block Bootstrap given Segmentation T(X1*),…T(XB*)‏ 3. Do this B times: f3L f1L f2L 1. Draw Subsample of length L: 2. Compute statistic on subsample: T(X*)‏

  22. True distribution Uniform Start Site Shuffling Block Bootstrap without Segmentation Block Bootstrap with True Segmentation Block Bootstrap with Estimated Segmentation

  23. Testing Association Question: How do we estimate null distribution given only data for which we believe the null is false?

  24. Testing Association (bp overlap)‏ Observed Sequence (Feature 1 = , Feature 2 = ): Statistic is: (X2)(Y1)+(X1)(Y2), properly normalized and set to mean 0. Under the null hypothesis of independence, this should be Gaussian. Align Feature 1 of first block with Feature 2 of second block, And vice versa. Calculate overlap in the blocks after swapping = (X2)(Y1)+(X1)(Y2)‏ Sample two blocks of equal length.

  25. Correlating DNA copy number variation with genomic content Redon et al (2007) claimed that copy number variant regions (CNVs) are significantly (< 0.05 with multiple testing correction) negatively correlated with coding regions, i.e. have less than randomly expected overlap. Their analysis is based on random shufflings of start positions. Our analysis is that the effect is probably an artifact.

  26. Issues • Choosing a method of segmentation, e.g. dyadic, and its tuning parameters • Block size for bootstrap using: • stability • segmentation

  27. Test Statistic H : Features not associated in each segment (so-called “dummy overlap”)‏ Then has a Gaussian distribution. We form the test statistic: where: Length of segment i/n % of basepairs in segment i identified as Feature 1 % of basepairs in segment i identified as Feature 2

  28. Measuring reproducibility of high-throughput experiments Qunhua Li, James B. Brown, Haiyan Huang, and Peter J. Bickel Annals of Applied Statistics, Volume 5, Number 3 (2011), 1752-1779.

  29. A consistency measure

  30. Our method of fitting

  31. Nancy Zhang Ben Brown Qunhua Li Nathan Boley Jessica Li Anshul Kundaje Haiyan Huang Peter Bickel

More Related