440 likes | 684 Views
Exploring the Role of Non-Coding DNA in the Function of the Human Genome through Variation. Christine Bird cpb@sanger.ac.uk. Hypothesis: Conserved non-coding DNA has a function in the human genome. Does human variation data suggest selection is acting on noncoding DNA?
E N D
Exploring the Role of Non-Coding DNA in the Function of the Human Genome through Variation. Christine Bird cpb@sanger.ac.uk
Hypothesis: Conserved non-coding DNA has a function in the human genome Does human variation data suggest selection is acting on noncoding DNA? • Are conserved non-coding sequences selectively constrained? • Detection of fast evolving conserved non-coding sequence. • Exploring the properties and genomic context of human fast evolving non-coding regions.
The Human Genome: ~25,000 genes 1 to 1.5% of human DNA is coding Is the remaining 98.5% “junk”?
Neutral Constrained 5% Selective constraint in mammalian genomes Waterston et al. Nature 2002
Proportions of Lineage Specific Conserved non-coding (CNC) sequences 418 MCSs (Multiple vertebrate Conserved Sequences) in 571Kb: 58 coding, 46 UTRs and 314 non-coding. ~ 27 species Margulies et al. PNAS 2005
CNCs are evenly distributed in the human genome Dermitzakis et al. Nat Rev Genet 2005
The density of CNCs and exons is negatively correlated Dermitzakis et al. Nat Rev Genet 2005
Why study conserved non-coding DNA? • Abundance beyond that expected under neutral evolution. • If function is gene regulation, understanding is limited. • Gene regulation is considered a crucial contributor to evolutionary change (King and Wilson, 1975). • Conserved non-coding sequences (CNCs) may well harbour critical regulatory changes that have driven recent human evolution.
Conserved non-coding sequences • Top conserved 5% of the human genome as detected with a phylogenetic hidden Markov model (phyloHMM) (Siepel, 2005). • Best-in-genome pairwise alignments by blastz, followed by chaining. • A multiple alignment constructed by MULTIZ. • PhastCons constructs a two-state phylo-HMM for conserved and non-conserved regions. • Remove overlap with Ensembl gene annotation. http://genome.ucsc.edu/
Are conserved non-coding sequences selectively constrained? • Conservation of non-coding sequence due to forces acting on the human genome. • CNC SNP density only 82% of noncoding non-conserved sequence.3.9 x 10-4 vs. 4.8 x 10-4; chi2= 686, 1 df; p<10-99 Just due to low local mutation rates? Or Are New alleles deleterious, therefore less likely to be fixed in population? • Address this by looking at the derived allele frequency (DAF) spectra as it is unaffected by local mutation rates. Drake et al. Nat Genet 2006
Derived Allele Frequency • Selective constraint shifts the distribution of constrained alleles toward rarer frequencies (Fay & Wu, 2000). • Allele frequencies in 4 populations from 210 unrelated individuals in the HapMap project: CEU - American of European ancestry (60) YRI - Yoruba from Nigeria (60) JPT - Japanese from Tokyo (45) CHB - Han Chinese from Beijing (45) • Derived Allele Frequency (DAF) was generated for 1 million Phase I HapMap SNPs & 4 million Phase II. • The ancestral allele was inferred by comparison to chimp and/or macaque. • SNPs were assigned to defined genomic features to allow comparison. Drake et al. Nat Genet 2006
Selective constraint CNCs are selectively constrained High Low Drake et al. Nat Genet 2006 Mann-Whitney-U test; P<<10-4
CNCs have an excess of low frequency derived alleles compared to Introns High Low Mann-Whitney-U test; CNC vs Introns P<<10-16
CNC sequences are selectively constrained and not mutation cold spots • Nucleotide variation revealed strong selective constraints upon CNCs in human populations. • SNP density 82% lower in CNCs • CNCs have an excess of low frequency derived alleles. • CNCs subject to purifying selection in humans, likely to harbour functionally important variants. Drake et al. Nat Genet 2006
Why are they conserved? • Regions of the genome are therefore selectively constrained despite being non-coding. But what is the reason for this conservation…? • What is novel about their biology? • How can we tackle this question for so many elements? • What are the most interesting regions? • A subset of CNCs undergoing rapid change with potential common properties or roles.
Why study fast-evolving non-coding? • If CNCs are part of chimpanzee-human lineage differentiation by changes in gene regulation then changes in their nucleotide sequence should be expected despite their overall conservation. • Following gene duplication subfunctionalization by the partitioning of gene regulation among descendant copies (Force, 1999) • Older models of gene duplication proposed an important role for positive selection after duplication (Bridges 1935, Ohno 1970, Ohta, 1987).
Brain Heart Heart Subfunctionalization • Duplicated genes preserved through subfunctionalization by the Duplication-Degeneration-Complementation model. • If CNCs are regulatory elements involved in this process they would have changed rapidly since duplication. Duplicated gene and separated tissue specific regulation Lynch and Force, Genetics 2000
S1 Human S2 Chimp Macaque (S1 - S2)2 = χ2 (S1 + S2) Detecting fast-evolving non-coding sequences Human Chimp Macaque GACTACGTTTGGTTTAGAGAT GACTGGCTTTACTTTTGAGAT GTCTGGGTTTACTTTTCAGAT GACTACGTTTGGTTTAGAGAT GACTGGCTTTACTTTTGAGAT GTCTGGGTTTACTTTTCAGAT 5 1 2 MULTIZ alignments (Webb Miller). Lineage Specific Substitutions Tajima’s Relative rate test Tajima, Genetics 1993
χ2 test of base substitutions. Alignments = 304,291 Power to detect acceleration = 26,477 P < 0.05 Accelerated = 2,794 (11%) Accelerated in chimp = 1438 Accelerated in human = 1356 ANC (Accelerated Non-Coding)
Are Accelerated Non-Coding (ANCs) sequences functional? • Compare to 3 sets of control sequences: • Power CNCs (not lineage specific): CNCs with >= 4 substitutions = 23,683 • Non-accelerated CNCs: CNCs < 4 substitutions = 277,814 • DAF controls 1&2: 1356 x 20Kb windows 500Kb from 5’ & 3’ of ANCs. Repeat analyses excluding potential confounder: Segmental Duplications (SD), Copy Number Variants (CNV), pseudogenes and retroposed genes.
Are ANC sequences functional? • Does nucleotide variation data indicate particular modes of selection implying function? (Is acceleration recent or ancient?) • Derived allele frequency spectrum comparisons • Population differentiation, FST • Are ANCs involved in subfunctionalization? • Is there enrichment in recently duplicated sequences? • What function do these rapidly evolving sequences have? • Association of ANC variation with expression levels of nearby genes
Selective constraint Loss of constraint & Directional Selection? Excess of high frequency derived alleles in ANCs Mann-Whitney-U test; Non-accelerated CNC vs ANCs P =1.63x10-6
Power CNCs are neutral Mann-Whitney-U test; Power CNC vs Control P =0.15
Loss of constraint & Directional Selection? Excess of rare alleles in ANCs excluding confounding elements Mann-Whitney-U test; ANCs vs ANC no confounders P =0.48
FST = HT - HS HT Detecting recent evolution and population-specific selection • A measure of population structure, Wright’s FST. • Compares the mean amount of genetic diversity found within subpopulations to the meta-population. • Sampling from 2 diverged subpopulations as if it is a panmitic population gives an excess of homozygotes & a deficiency of heterozygotes. • FST can be defined as: • Calculated for ANCs • MSG - mean square error within populations • MSP - mean square error between populations • nc - variance-corrected average sample size Weir and Cockerham, Evolution 1984
ANC FST values higher than non-accelerated CNCs Mann-Whitney-U-test; Non-accelerated CNCs vs ANCs P = 0.0504 ; Non-accelerated CNCs vs ANCsno confoundersP = 0.0363
Enrichment in Segmental Duplications • Approximately 5-6% of the human genome in SDs (Bailey et al, Science 2002) ANCs 8% power CNCs 10% non-accelerated CNCs 5% • Excess of ANCs and power CNCs in SDs (chi-square; P< 10-4). • The general enrichment in SDs is not surprising, as it has been observed that sequence divergence is elevated in duplicated sequences. (Hurles et al. GenBio. 2004; She et al. GenRes. 2006).
Human Specific Excess of recent segmental duplications associated with ANCs Mann-Whitney-U test; P<<10-4
Testing for evidence of involvement in Gene Regulation GENE ANC Association SNP mRNA
0 1 2 ANC SNP- Expression Association Additive association model: Linear regression e.g. CC = 0, CT = 1, TT = 2. • What is the functional impact of ANC variation on gene expression phenotypes? • 47,294 transcripts probed in lymphoblastoid cell lines of 210 unrelated HapMap • Associate SNPs genotypes within ANCs to transcript expression levels by linear regression. • Statistical significance adjusted following 10,000 permutations per gene.
SNPs within ANCs are significantly associated with gene expression phenotypes. • Significant SNPs at the 0.01 permutation threshold: 68% ANCs SNPs tested (496 out of 729) 9% Power CNCs SNPs tested (1047 out of 11468) A SNP within an ANC is 7 times more likely to be associated with gene expression levels than a SNP within a power CNC. • Significant at the 0.01 permutation threshold: 16% of ANCs tested (59 out of 366) 3% of Power CNCs tested (165 out of 5968) Nucleotide variation within ANCs is 5 times more likely to be associated with gene expression levels than variation in a power CNC. • Tendency for derived alleles within ANCs to be associated with lower expression levels.
Summary • CNCs are not mutation cold spots but selectively constrained. • Fast evolving noncoding sequences in the human lineage have lost this constraint and some are potentially undergoing positive selection. • This may have contributed to some recent differentiation in human populations. • ANCs are enriched in the most recent segmental duplications. • SNPs in ANCs are associated with significant change in gene expression phenotypes.
Acknowledgements Thanks to my joint supervisors Emmanouil Dermitzakis and Matthew Hurles and the members of their teams; • Barbara Stranger • Dan Jeffares • Catherine Ingle • Julian Huppert • Antigone Dimas • Sarah Lindsay • Dan Andrews • Dan Turner • Chris Barnes Particular thanks to my other co-authors, • Webb Miller - human-chimpanzee-macaque alignments • Daryl Thomas - DAF for both phase I and II SNPs • Maureen Liu - quantifying gene density The Rhesus Macaque Genome Sequencing Consortium (RMGSC) and the HapMap consortium for making data available, and the Wellcome Trust and MRC for funding.
Exploring the Role of Non-Coding DNA in the Function of the Human Genome through Variation. By Christine Bird cpb@sanger.ac.uk
Fig. 3. Phylogenetic tree of vertebrate species. By using the generated 27-species multisequence alignment, branch lengths were calculated based on analysis of synonymous coding positions. The branch lengths (as substitutions per synonymous site) between human and each species are listed (with additional pair-wise branch lengths provided in the supporting information). The last common ancestor among the catarrhine primates (A) is estimated at 25 mya (36, 37), between the rodents and primates (B) at 75 mya (5,6),between eutherians and metatherians (C) at 185 mya (14), between monotremes and other therians (D) at 200 mya (14), and between mammals and birds (E) at 310 mya (13). Margulies et al. PNAS 2005
Proportions of Lineage Specific Conserved non-coding sequences Fig. 4. Lineage specificity of MCSs. The proportion of nonexonic MCSs found in the sequences of species in each category is indicated. Note that virtually all MCSs overlapping known exonic sequences are present in all mammals (data not shown). All Mammals: cat, dog, cow, pig, rat, mouse, N.A. opossum, wallaby, and platypus; Eutherian: cat, dog, cow, pig, rat, and mouse; Marsupials: N.A. opossum and wallaby; and Other: species combinations containing 2% of the analyzed MCSs (see the supporting information for the complete data set). Hashed areas of ‘‘All Mammals’’ reflect portions lacking one or both rodents, and hashed portions of ‘‘Eutherian Marsupials’’ reflect portions lacking both rodents. Margulies et al. PNAS 2005
4 0 0 y c 3 0 0 n e u q e r 2 0 0 F 1 0 0 0 0 1 0 2 0 3 0 M e g a b a s e s ( l o n g a r m ) Distribution of large and small CNCs (Conserved Non-Coding sequences) and exons on Hsa21 4 0 0 Exons exons y c 3 0 0 n e u Frequency Frequency q e r 2 0 0 F Big CNCs ’’CNGs big’’ 1 0 0 Small CNCs ’’CNGs small’’ 0 Mb Mb 0 1 0 2 0 3 0 Big CNCs: 70% ID, 100 bps ungapped Small CNCs: 85% ID, 35-99 bps ungapped Dermitzakis et al. Nature 2002
Conservation of CNCs in multiple species human Conserved block Dermitzakis et al. 2003 Science mouse
Testing DAF spectrum distributions • Non-parametric distributions of unequal sample size • Mann-Whitney U-test: • Compares the median of two populations • Uses the rank order of values in the two samples. • Kolmogorov-Smirnov test: • Measures differences in the entire distributions of two samples in both shape and location of distributions, but at the cost that it is less sensitive to differences in location only. • KS is less powerful with respect to the alternative hypothesis of differences in location than the Mann-Whitney U-test