470 likes | 591 Views
Refined Non Parametric Methods for Genomic inference. Peter J. Bickel Department of Statistics University of California at Berkeley, USA. Joint work with Nancy R. Zhang (Stanford), James B. Brown (UCB) and Haiyan Huang (UCB). Motivating Questions. → GENCODE Exons.
E N D
Refined Non Parametric Methods for Genomic inference Peter J. Bickel Department of Statistics University of California at Berkeley, USA Joint work with Nancy R. Zhang (Stanford), James B. Brown (UCB) and Haiyan Huang (UCB)
→ GENCODE Exons Association of functional annotations in Human Genome • The ENCODE Consortium found that many Transcription Start Sites are anti-sense to GENCODE exons • They also found vastly more TSSs than previously supposed • Is the association between TSSs and exons in the anti-sense direction real, or experimental noise in TSS identification? → Transcription Start Sites (TSSs) 5' 3' 3' 5'
Association of experimental annotations across whole chromosomes Do two factors tend to bind together more closely or more often than other pairs of factors? Does a factor’s binding site relative to TSSs tend to change across genomic regions?
The statistical relation of Transcription Start Sites and protein binding sites • Normalized Chip-chIP signals around GENCODE TSSs in ENCODE regions • Most peak over the TSS and are clearly significant • Does the upstream bump in CTCF constitute good evidence of enchancer binding activity? Figure from ENCODE Consortium Paper: Nature, June 14th, 2007 Normalized signal intensity Enchancer activity?
What is a non-parametric model for the Genome and why is it needed?
→Transcription Fragments → Conserved sequence Feature Overlap: the question • A mathematical question arises: • Do these features overlap more, or less than “expected at random”? 5' 3'
Our formulation • Defining “expectation” and “at random”: • The genome is highly structured • Analysis of feature inter-dependence must account for superficial structure • “Expected at random” becomes: • Overlap between two feature sets bearing structure, under no biological constraints
Naïve Method • Treating bases as being independent with same distribution (ordinary bootstrap) • Hypothesis: Feature markings are independent • Specific Object Test based on % Feature Overlap – (% Feature1)(% Feature2) and standard statistics • Why naïve ? Bases are NOT independent • Better method: keeping one type of feature fixed and simulating moving start site of another feature uniformly (feature bootstrap) • Why still a problem? • Even if feature occurrences are independent functionally, there can be clumping caused by the complex underlying genome sequence structure (i.e. inhomogeneity, local sequence dependence)
A non parametric model Requirements: • It should roughly reflect known statistics of the genome • It should encompass methods listed • It should be possible to do inference, tests, set confidence bounds meaningfully
Segmented Stationary Model Let Xi = base at position i, i=1,…,n such that for each k=1,…,r, is: • Stationary (homogeneity within blocks) • Mixing (bases at distant positions are nearly independent) • r << n …
Empirical Interpretations • Within a segment: • For k small compared to minimum segment length, statistics of random kmers do not differ between large subsegments of segment • Knowledge of the first kmer does not help in predicting a distant kmer • Remark: • If this model holds it also applies to derived local features, e.g. {I1,…,In} where Ik = 1 if position k belongs to binding site for given factor
Mentioned other models are special cases for r = 1 • Independent identically distributed (bootstrap) • Stationary Markov • Uniform displacement of start sites (Homogeneous Poisson Process)
True distribution: Mean=5.23 SD=0.53 Ordinary Bootstrap: Mean=4.83 SD=0.26 Feature Randomization: Mean=6.19 SD=0.81 Block Bootstrap: Mean=4.81 SD=0.55 Is the Effect Serious? Example Statistic:Overlap between two features in a binary sequenceof 10K bases (region statistic in the ENCODE studies) Feature 1: occurrence of motif 111000; Feature 2: more than six 1’s in 10 consecutive bases • Ordinary bootstrap • Base-by-base sampling randomly from observed sequence for two features separately • Feature randomization: • Keep one type of feature fixed and randomizing the start positions of the other
Evidence for Segmented Stationarity • DNA sequence is known to be inhomogeneous • However, it has been segmented into homogeneous domains based on: • Base composition (e.g. finding Isochores) • CpG density • Density of higher order features (e.g. ORFS, palindromes, TFBS) • Our model aims to capture these “domain-specific” effects, while avoiding parametric assumptions within domains Figure from Li, 2001: References: Elton (1974, J. Theoretical Bio.), Braun and Müller (1998, Statistical Science), Li et al. (1998, Genome Res.), Liu and Lawrence (1999, Bioinformatics)
Inference with our model • Use X1,…,Xn for basic data, but Xk could be base identity, feature identity, a vector of feature identities obeying segmented stationarity assumption.
Using our model for inference Many genomic statistics are function of one or more sums of the form: e.g. is 1 or 0 depending on the presence or absence of a feature or features When the summands are small compared to S: Gaussian case Example: Region overlap for common features, or rare features over large regions Under segmented stationarity, these distributions can be estimated from the data
Distributions of feature overlaps • The Block Bootstrap • Can’t observe independent occurrences of ENCODE regions, but if our hypothesis of segmented stationarity holds then the distribution of sum statistics and their functions can be approximated as follows
Block Bootstrap for r = 1 Algorithm 4.1: • Given L << n choose a number N uniformly at random from • Given the statistics Tn(X1,…,Xn) , under the assumption that X1,…,Xnis stationary, compute • Repeat B times to obtain • Estimate the distribution of by the empirical distribution: By Theorem 4.2.1 of Politis, Romano and Wolf (1999)
Block Bootstrap Animationr = 1 Statistic: Observed Sequence (X): S=f(X) … … … Draw a block of length L from original sequence, this is the block-bootstrapped sequence. Calculate statistic on the block bootstrapped sequence. Repeat this procedure identically B times.
Observing the distributions Block bootstrap distribution of the Region Overlap Statistic Shown here with the PDF of the normal distribution with the same mean and variance The histogram of Is approximately the same as density of QQplot of BB distribution vs. standard normal
What if r > 1 • The estimated distribution is always heavier tailed leading to conservative p values • But it can be enormously so if the segment means of the statistic differ substantially • Less so but still meaningful if the means agree but variances differ
Simulation Study • For simplicity, we concatenate 2 homogeneous regions generated as above
True distribution Uniform Start Site Shuffling Block Bootstrap without Segmentation Block Bootstrap with True Segmentation Block Bootstrap with Estimated Segmentation Simulation Results and comparison to a naïve method
Solutions • Segment using biological knowledge • Essentially done in ENCODE: poor segmentation occasionally led to non-Gaussian distributions (excessively conservative) • Segment using a particular linear statistic which we expect to identify homogeneous segments
Block Bootstrap with Segmentation • Draw a block from each sub-segment and concatenate to form a block bootstrap sample
T(X1*),…T(XB*) 3. Do this B times: Block Bootstrap given Segmentation f3L f1L f2L 1. Draw Subsample of length L: 2. Compute statistic on subsample: T(X*)
True distribution Uniform Start Site Shuffling Block Bootstrap without Segmentation Block Bootstrap with True Segmentation Block Bootstrap with Estimated Segmentation Simulation Results, with segmentation
Dyadic Segmentation • Define, • Find jmax maximizing M(j) creating intervals Ileft and Iright • If length of both intervals falls below a stopping criterion, stop • Else, repeat process for Ileft and/or Iright, whichever are longer than stopping criterion, with redefined M(j)
All subsequent cuts are greedy, making maximal splits First cut maximizes the difference between the means in the new segments Statistic as a function of position The mean is recomputed in each segment, so long as the segment is longer than a set threshold No new cuts exist, the segmentation is complete change in mean of the statistic Dyadic Segmentation
True distribution Uniform Start Site Shuffling Block Bootstrap without Segmentation Block Bootstrap with True Segmentation Block Bootstrap with Estimated Segmentation
Confidence Bounds: r > 1 • Given a statistic, e.g. basepair % overlap: Find such that: as small as possible “Average basepair overlap over all potential genomes for the region considered”
Use Algorithm 4.1 • For each segment pick random block of length proportional to segment length • Concatenate to get block of length L • Compute % bp overlap for block • Repeat many times • Use 100(1-α) percentiles of this for
Testing Association Question: How do we estimate null distribution given only data for which we believe the null is false?
Testing Association (bp overlap) Observed Sequence (Feature 1 = , Feature 2 = ): Statistic is: (X2)(Y1)+(X1)(Y2), properly normalized and set to mean 0. Under the null hypothesis of independence, this should be Gaussian. Align Feature 1 of first block with Feature 2 of second block, And vice versa. Calculate overlap in the blocks after swapping = (X2)(Y1)+(X1)(Y2) Sample two blocks of equal length.
Test Statistic H : Features not associated in each segment (so-called “dummy overlap”) Then has a Gaussian distribution. We form the test statistic: where: Length of segment i/n % of basepairs in segment i identified as Feature 1 % of basepairs in segment i identified as Feature 2
Null Distribution • Choose pairs of blocks at random • Compute false (“dummy”) overlap H • Compute I = % Feature 1 and J = % Feature 2 • Block bootstrapped Null: H – IJ • If r > 1, pairs of blocks are chosen in each region, H and IJ are weighted sums across regions. • The Null is mean zero, and has the correct variance
Example from ENCODE data • ENm001: ENCODE Consortium annotated over 2500 feature-instances exclusive of UTRs and CDSs • Question: “Do these (largely) non-coding features exhibit more overlap with constrained sequences than expected at random?” • To answer, we used the block bootstrap to obtain null distribution • When null is Gaussian, it has the correct variance • When not, it is overly conservative • Segmentation can reduce conservativeness, and detect significance that would otherwise be missed
p-value 0.001 Estimated Segmentation p-value 0.1 No Segmentation
There are two L’s • Ls: the minimum segment length during segmentation • To be discussed • L : the length of blocks during subsamling • Chosen on grounds of stability
A philosophical question:The Issue of Scale • Relevant probability assessments depend on segmentation • Segmentation depends on scale • Things which seem surprising on small scales, may not be at larger ones • E.g. differences in GC content My view: It’s only determinable biologically
Some Future Directions • KS type tests • Beyond overlap, KS-type tests can compare the distributions of features, e.g. “Does the pattern of constrained sequence in coding regions differ from that in non-coding regions?” • Maxima • Aggregative plots can summarize one feature in the neighborhood of another, e.g. “Does binding data (such as Chip-chIP) show that a given regulatory factor tends to bind near TSSs?” • Other types of association • Does wavelet analysis offer significant support for the large scale association of replication timing and conservation? • Many others arising from ENCODE, modENCODE, and elsewhere • Other types of segmentation • Dyadic segmentation is analytically convenient, but other segmentations may be useful
Acknowledgements • The ENCODE Consortium • The MSA and Transcription and Regulation Groups • Especially: Elliot Margulies, Tom Gingeras and Ewan Birney • Supported by NIGMS and NHGRI
Association of functional annotations in Human Genome Table from ENCODE Consortium Paper: Nature, June 14th, 2007
Dyadic Segmentation Algorithm 4.8 Algorithm 4.8 For a minimum region length Ls and threshold b initialize: • For i = 1,…,|t|-1, let M(i)(j) and V(i)(j) be respectively the processes (4.7) and (4.8) computed on the subsequence Xti-1+1,…,Xti. Let t’i= argmaxjM(i)(k), and mi = min(t’i – ti-1,ti - t’i). Let: • Let Vi = V(i)(t’i). Let: If stop, return t. • Let i* = argmaxi Bi , and tnew= t’i* • Let t = t ∪ tnew reordered so that ti is monotonically increasing in i.