270 likes | 362 Views
Using Vertebrate Genome Comparisons to Find Gene Regulatory Elements. Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton Nekrutenko, Ross Hardison University of California at Santa Cruz: David Haussler, Jim Kent
E N D
Using Vertebrate Genome Comparisons to Find Gene Regulatory Elements Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton Nekrutenko, Ross Hardison University of California at Santa Cruz: David Haussler, Jim Kent National Human Genome Research Institute: Laura Elnitski Children’s Hospital of Philadelphia: Mitch Weiss Lawrence Livermore National Laboratory: Ivan Ovcharenko
Find common sequences blastZ, multiZ Human Identify functional sequences: ~ 145 Mbp All mammals 1000 Mbp Mouse Rat Also birds: 72Mb Comparative genomics to find functional sequences Genome size 2,900 2,400 2,500 1,200 million base pairs (Mbp) Papers in Nature from mouse and rat and chicken genome consortia, 2002, 2004
5% Coverage of human by alignments with other vertebrates ranges from 1% to 91% Human 5.4 Millions of years 91 92 173 220 310 360 450
Distinctive divergence rates for different types of functional DNA sequences
Large divergence in cis-regulatory modules from opossum to platypus
cis-Regulatory modules conserved from human to fish • About 20% of CRMs • Tend to regulate genes whose products control transcription and development • Recent reports: • Sandelin, A. et al. (2004). BMC Genomics5: 99. • Woolfe, A. et al. (2005). PLoS Biol3: e7 • Plessy, C., Dickmeis, T., Chalme,l F., Strahle, U. (2005) Trends Genet. 21: 207-10. Millions of years 91 173 310 450
cis-Regulatory modules conserved from human to chicken • About 40% of CRMs • Noncoding sequences conserved from human to chicken tend to clusters in gene-poor regions • Conservation jungles • Hillier et al. (2004) Nature • Stable gene deserts are conserved from human to chicken • Ovcharenko et al., (2005) Genome Res. 15: 137-145. • Conserved noncoding sequences in stable gene deserts tend to be long-range enhancers • Nobrega, M.A., Ovcharenko, I., Afzal, V., Rubin, E.M. (2003) Science 302: 413. Millions of years 91 173 310 450 Posters 120 (Bob Harris), 121(Laura Elnitski), 192 (Ivan Ovcharenko)
cis-Regulatory modules conserved in eutherian mammals (and marsupials?) • About 80-90% of CRMs • Within aligned noncoding DNA of eutherians, need to distinguish constrained DNA (purifying selection) from neutral DNA. Millions of years 91 173 310 450
Score multi-species alignments for features associated with function • Multiple alignment scores • Binomial, parsimony (Margulies et al., 2003) • PhastCons • Siepel and Haussler, 2003; Siepel et al. 2005 • Phylogenetic Hidden Markov Model • Posterior probability that a site is among the 10% most highly conserved sites • Allows for variation in rates and autocorrelation in rates • Factor binding sites conserved in human, mouse and rat • Tffind (from M. Weirauch, Schwartz et al., 2003) • Score alignments by frequency of matches to patterns distinctive for CRMs • Regulatory potential (Elnitski et al., 2003; Kolbe et al., 2004)
RP of any 3-way alignment is the sum of the log likelihood ratios of finding the strings of alignment characters in known regulatory regions vs. ancestral repeats. Computing Regulatory Potential (RP) Alignment seq1 G T A C C T A C T A C G C A seq2 G T G T C G - - A G C C C A seq3 A T G T C A - - A A T G T A Collapsed alphabet 1 2 1 3 4 5 7 7 6 8 3 6 3 9 • A 3-way alignment has 124 types of columns. Collapse these to a smaller alphabet with characters s (for example, 1-9). • Train two order t Markov models for the probability that t alignment columns are followed by a particular column in training sets: • positive (alignments in known regulatory regions) • negative (alignments in ancestral repeats, a model for neutral DNA) • E.g. Frequency that 3 4 is followed by 5: • 0.001 in regulatory regions • 0.0001 in ancestral repeats
More species and better models improve discriminatory power of RP scores Poster 257: James Taylor ROC curves for different RP scores, tested on a set of known regulatory regions from the HBB gene complex
Galaxy metaserver for integrative analysis of genomic data • Use servers at primary data repositories (e.g. UCSC Table Browser) to gather initial data • Results stored and analyzed at Galaxy • Operations • Union, intersection, subtraction • Clustering, proximity • Bioinformatic tools: • Retrieve alignments • Ka/Ks • Giardine, Riemer … Nekrutenko, Poster 90
How well do these alignment-based scores work in finding cis-regulatory modules?
RP and phastCons can discriminate most known functional elements from neutral DNA
repressed induced genes time after restoration of GATA-1 Genes co-expressed in late erythroid maturation • G1E-ER cells: proerythroblast line from mice lacking the transcription factor GATA-1. • Can restore the activity of GATA-1 by expressing an estrogen-responsive form of GATA-1 • Allows cells to mature further to erythroblasts • Use microarray analysis of each to find genes that increase or decrease expression upon induction. • Walsh et al., (2004) BLOOD; Image from k-means cluster, GEO:
Predicting cis-regulatory modules (preCRMs) Identify a genomic region with a regulated gene. Find all intervals whose RP score exceeds an empirical threshold. Subtract exons Find all matches to GATA-1 binding sites that are conserved (cGATA-1_BS) Intervals with RP scores above the threshold and with a cGATA-1_BS within 50bp are preCRMs.
test HBG FF luciferase prom Dual luciferase assay tk Ren luciferase prom K562 cells Test predicted cis-regulatory modules (preCRMs) • Enhancement in transient transfections of erythroid cells • Activation and induction of reporter genes after site-directed, stable integration in erythroid cells • Chromatin immunoprecipitation (ChIP) for GATA-1
9 of 24 Zfpm1 preCRMs enhance after stable integration at RL5
About half of the preCRMs are validated as functional Assay Number Number % tested positive validated GATA-1 ChIPs 5 5 100 Transient 64 18 28 transfections Site-directed 54 24 44 integrants All assays 64 34 53
Conclusions • Particular types of functional DNA sequences are conserved over distinctive evolutionary distances. • Multispecies alignments can be used to predict whether a sequence is functional (signature of purifying selection). • Alignments can be used to predict certain functional regions, including some cis-regulatory elements. • The predictions of cis-regulatory elements for erythroid genes are validated at a good rate. • Databases such as the UCSC Table Browser, GALA and Galaxy provide access to these data. • Expect improvements at all steps.
Many thanks … PSU Database crew: Belinda Giardine, Cathy Riemer, Yi Zhang, Anton Nekrutenko Wet Lab: Yuepin Zhou, Hao Wang, Ying Zhang, Yong Cheng, David King RP scores and other bioinformatic input: Francesca Chiaromonte, James Taylor, Shan Yang, Diana Kolbe, Laura Elnitski Alignments, chains, nets, browsers, ideas, … Webb Miller, Jim Kent, David Haussler Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU
Marsupial genome adds substantially to the conserved fraction of regulatory regions
All preCRMs in Gata2 are functional in at least one assay ChIP data are from publications from E. Bresnick’s lab.
The distal Major regulatory element of the human HBA gene complex is conserved in opossum but not beyond
Neutral DNA “cleared out” over 200Myr Platypus Chick Frog Fish Opossum Mouse, Rat Cow Dog Chimp Most human DNA is not alignable to species separated by more than 200 yr. Divergence dates from Kumar and Hedges (Nature 1998) and Hedges (Nature Rev Genet 2002)