730 likes | 918 Views
Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu. " Nothing in biology makes sense except in the light of evolution ." Theodosius Dobzhansky, 1932. What is comparative genomics.
E N D
Comparative Genomics Sridhar Hannenhalli sridhar@umiacs.umd.edu "Nothing in biology makes sense except in the light of evolution." Theodosius Dobzhansky, 1932
What is comparative genomics A comparative investigation of functional elements ( genes, regulatory elements) and biological networksacross multiple species, or evolutionarily related systems.
The basic premise of comparative genomics is that • Living things are related by evolution • The essential properties of biological entities and processes are likely to be conserved across similar organisms • The organism-specific genomic characteristics might reveal genetic bases for phenotypic differences between organisms.
Contexts of comparative genomics Protein family Gene sequence Structure Gene networks Gene order Non-coding regions
Some of the questions that comparative genomic may help address • What genomic regions are likely to be serve an essential biological role? • What genomic regions might underlie species-specific phenotypic differences? • What are the core set of proteins essential to a group species? • Despite sequences changes, are genetic interaction conserved? • Which genes or functional elements might have evolved adaptively?
Different Questions Require Different Comparisons From: Hardison. Plos Biology. Vol 1 (2): 156-160
A General Framework for Comparative Genomics • Comparing a biological entity or process between individuals or species • Why do we want to compare? • How do we compare? (actual data, or model?) • How do we assess the significance of comparison score? • Both similarity and non-similarity can be of interest
Overview of topics • Gene identification in yeast Kellis et al. Nature 423, 241-254 (2003). • Selection Clark, A. G. (2006). "Genomics of the evolutionary process." Trends EcolEvol21(6): 316-321. • KaKs test for selection in protein coding genes • Tree-HMM to identify conserved regions • Selection on epigenomic properties • Phylogenetic profiles Pellegrini et al. PNAS 96, 4285-4288 (1999) • Network alignment Kelley et al. (2003). "Conserved pathways within bacteria and yeast as revealed by global protein network alignment." PNAS 100(20): 11394-11399.
Sequencing and comparison of yeast species to identify genes and regulatory elements. Kellis et al. Nature 423, 241-254 (2003).
….a bit of digression….. Ab Initio Gene Prediction CATGTTTCCACTTACAGATGCTTCAAAAAGAGTGTTTCATAACTGCTCTATGAAAAGGAATGTTCAACTC TGTGAGTTAAATAAAAGCATCAAAAAAAAGTTTCTGAGAATGCTTCTGTCTAGTTTTTATGTGAAGATAT TTCCATTTTCTCTATAAGCCTCAAAGCTGTCCAAATGTCCACTTGCAGATACTACAAAAAGAGTGTTTCA AAAGTGCTCAATGAAAAGGAATGTTCAGCTCTGTGAGTTAAATGCAAACATCACAAATAAGTTTCTGAGA ATGCTTCTGTCTAGTTTTTATGGGAAGATAATTCCGTGTCCAGCGAAGGCTTCAAAGCTTTCAAAATATC CACTTGCAAATTCTACAAAAAGAGTGTTTCAAAGCTGCTTTATCAAAAGAAAGTTTCAACTCTGTGAGTT GAATGTGCACATCACAAAGAAGTTTCTGAGAATGCCTTCAGTCTGGTTTTTATGTGAAGATATTCCCTTT TCCAACGAAAGCCTCGAAGCTGTCCAAATATCCACTTGTAAGTGCTGCAAAAAGAGTGTTTCAAAACTGC TACAGCAAAAGAAAGGTTTATCTCTGTGAGTTGAGTAGACACATCAAGAAGAAATTTCTGAGAATGCTTC TGTCTAGTTTTTATGTGAAGATATTTCCTTTGTCACCATAGGCCTCCAAGCCCTCCAAATGTCCACTTGC AGATGCTACAAAAAGAGTGTTTCAAAACTGCTGTATGAAAAGAAATGCTCAAATCTGTGAGATAAATGCA TACATCACAAAGAAGTCTTTGAGAATGCTTCTGTCTAGTTTTTATGTTAAGATATTTCCTATTTCACCAT ACGTCTCAACGCACACAAAATGTACACTTGCAGATGCTACAAAGAGAGTGTTTCAAAACTTGTAGATCAA AACAAGTGTTCAACTTTGTGAGTTGAGGACACACATCTGAAAGAAGTTTCTGAGAATGCTTCTGTCTAGT TTTTATGTGAAGATATTCCCGTTTCCAGCGAAAGCCCCAAAACTATCCAAATATCCACTTGCACATTCTA CAAAAAGAGTGTTTCAAATCTGCTCTATCAAAATAAAGGTTCAACTCTGTGAGTTGACTACACACATCAC AAAGAAGTTTCTAAGAATGCTTCTGTCTGGTTTTTATGGGAAGATATTTCCTTTTTCAACATAGGCCTTG CAGCATCTACAAAAAGAGTTTTTCAAAACTCCTCTAAGAAAAGGAATGATCAACTCCATGAGTTTAATGC AAAGATCACAAAGAAGTTTCTGAGAATGCTTCTGTCTAGTTTTAACCTGAAGACAGTTCCGTTTCCAGTG AAGGCTTGAAAGCTGTCCAAATATCCACATGCAAATTCTCAAAACGAGTGTTTAAAAGCTGCTCTATCAC TAGAAAGTTTCACCTCTGTGAGCTGAATGCACACAGAAGTTTCTGAGAATGCTTCTGTCTGGTTTTTATG TGAAGATATTCCCGTTTCCAACCAAAGCCTCAAAGCTGTCCAAATATCCATTTGCAGATCCTACAGGGAG AGTGTTTCAAAACTGCTCTATAAAAAGAAAGGTTCTACTCTGTGAGTGGAGTACACACATCACAAAGAAG TTTCTGAGAATGCTTCTGTCTAGTTTTTATGTGAAAATAGTTCGTTCTCCAAAATAGTCCTCAAAGCGCT CCAAATGTCCACTTGCAGATTCTACAAAAAGAGTGTTTCAAAACTGCTCTATGAAAAGGAATGTTCAACT
Markov Model Q2 Q1 Q2 Q1 Q4 Q1 Q1 Q2 Q4 Q3 Q3 Q4 Finite State Set : Q1, Q2, Q3, Q4 Random variable X takes values from the set X(n) : State of X at time n, X(1) = Q2, X(2) = Q1 and so on…. T(i,j) : Transition probability from State i to State j = P(X(n+1) = Qj | X(n) = Qi) Output generated by the Model is a State-sequence
Hidden Markov Model Q2 Q1 Q4 Q1 Q1 Q2 Q4 Q3 Q2 Q1 ACGGATGGCGGGCTGATGATGTAGGGGAAT Q3 Q4 The sequence generated by each state is governed by a state-specific probability function Stochastic Process/ Prob. function ATGCCCGC
A simple HMM to model intronless genes INTERGENIC START Generates A,C,G or T p Generates ATG or GTG 1-p 1 1 1-q q Generates a stop codon Generates a non-stop codon STOP CODING ATG {a sequence of codons except stop codons} TAA Open Reading Frame or ORF
Consider a stochastic process where codons are chosen at random among the 64 possibilities Success is defined as choosing a stop codon => p = 3/64 Define random variable L = number of trial until first success P(L = u) = (1-p)u-1p (Geometric distribution) P(L > u) = (1-p)u-1 E(L) = 1/p ~ 20 Var(L) = (1-p)/p2 P(L = 100) = 0.0004 A ‘long’ ORF is highly likely to be a gene
….back to the ‘comparative’ genomics….. Genome scale alignments Human Global Alignment Mouse Local Alignment Establish synteny Genome Alignment Align syntenic regions
Main challenges with genome alignment • Computational limitations • Incomplete Data • What is the right parameter – poor evolutionary model • Lack of gold standards
Genome Alignment procedures can be generalized as a hierarchical process • Compute anchors (high scoring local alignments) • Chain/Link the anchors • Refine alignment (e.g. Needleman Wunsch)
AVID Bray et al., 2003 • Find all exact matches • Filter matches(keep the matches at least half as long as the longest match ) • Compute optimal chain of matches (Needleman-Wunsch-style) • Within each consecutive match pair in the optimal chain • Back to step 2 and repeat OR • Apply Needelman-Wunsch
The LAGAN algorithm (Brudno et al. 2003) Get local alignments Gap filling using Needleman-Wunsch Optimal Chaining
Yeast genome alignment Identify ORFs, BLAST all versus all SC SB Remove all edges with scores less than 80% of the node max SC SB Remove edges in conflict with synteny SC SB Remove edges not preferred by either node SC SB
Accuracy of orthology mapping • All but 211 of the 6,235 SC ORFs were assigned unambiguously • These 211 genes fall in 32 clusters, almost all of which are in telomere
All 20 inversions are flanked by tRNA genes in opposite orientation • 7 reciprocal translocations have Ty elements and 3 have pairs of highly similar ribosomal protein genes • A majority of rearrangements are in telomeric regions
Q: Is this surprisingly conserved? SC SP SM SB A: Well it depends (on what?)
ORF-based tree • Distance-based tree reconstruction methods: • UPGMA • Neighbor-Joining Intron-based tree
Species specific calibration • Relative to S. Cerevisiae percent identity in • Paradoxus 90% coding 80% non-coding • Mikatae 84% coding 62% non-coding • Bayanus 80% coding 62% non-coding • Coding versus non-coding discrimination • In 4-way alignment • Mutated sites: coding 30% non-coding 58% • Indels: coding 1.3% non-coding 14% • 1,2 indels: coding 0.14% non-coding 10.2%
Reading Frame Conservation (RFC) Test • Label ref species bases 1,2,3,1,2,3… • Label other species 1,2,3 then 2,3,1.. then 3,2,1.. And use the best • Count the number of aligned positions in frame (same label) • Average of all 100 bps window with 50 jump
RFC 80 Use RFC score of 80% ,75% and 70% as cutoff for the three species. Each species votes +1, 0 (absent) and -1. Sum of votes >= 1 qualifies an ORF as a gene Negative control 96% percent of the windows were rejected in introns. 3% do contain an ORF based on manual inspection => 1% false positive Summary of gene list update ~400 genes removed ~50 novel short genes What about the genes with introns?
In summary • Present an effective way to identify new protein coding genes and correct old annotations • Identify gene order differences between species, and present some evidence for mechanism of inversion through non-homologous recombination • Identify specific genes that are fast evolving or extremely conserved
Overview of topics • Gene identification in yeast • Selection • KaKs test for selection in protein coding genes • Tree-HMM to identify conserved regions • Selection on epigenomic properties • Phylogenetic profiles • Network alignment
Mutations create variants and are critical for evolution • Deleterious mutations are likely to be purged from the populations (purifying or negative selection) • Advantageous variants tend to increase in frequency and eventually replace other variants (adaptive or positive selection) • Study of selection involves testing whether observed evolutionary changes are more consistent with a neutral evolution or with either a purifying or adaptive selection
Test of Neutral evolution of coding genesThe KaKs test Synonymous substitution No change Multiple Non-Synonymous substitution Thr Leu His Ser ACG CTC CAT TCT ACG CTT CAA AGT Thr Leu Gln Ser Non-Synonymous substitution • If Non-synonymous mutation rate > synonymous substitution rate it implies positive selection • If Non-synonymous mutation rate < synonymous substitution rate it implies purifying selection • If Non-synonymous mutation rate = synonymous substitution rate it implies neutral evolution
Number of Synonymous and non-synonymous sites Non-synomymous 1/3 Synonymous 2/3 Non-synomymous His C A T Non-synomymous S = 0 + 0 + 1/3 = 1/3 N = 1 + 1 + 2/3 = 8/3 S + N = 9/3 = 3 (total number of nucleotides) Non-synomymous Ile A T T 2/3 Synonymous 1/3 Non-synomymous Non-synomymous S = 0 + 0 + 2/3 = 2/3 N = 1 + 1 + 1/3 = 7/3 S + N = 9/3 = 3 (total number of nucleotides) The number of syn and non-synonymous sites are averaged for the two aligned codons CAT S = (1/3 + 2/3)/2 = 1/2 ATT N = (8/3 + 7/3)/2 = 5/2 S + N = 3 (total number of nucleotides)
Number of Synonymous and non-synonymous mutations TAT TAT TAT TAC TCT TCC 1 syn change Tyr TAC 2 non-syn changes?? Tyr TAT Ser TCC Ser TCT 1 non-syn change Ser TCT Tyr TAC Both paths involve 1 syn substitution And 1 non-syn substitution. This averages out to be 1 syn and 1 non-syn substitutions Codon Sd Nd total 1 1 0 1 2 0 1 1 3 1 1 2 Sd + Nd equals the number of mismatches between the codon pair
Number of Synonymous and non-synonymous mutations 2 non-syn changes?? CTA GTT Leu CTA Ile GTT Val GTA First path involves 2 non-syn substitutions Second path involves 1 syn and 1 non-synonymous substitutions This averages out to be 0.5 syn and 1.5 non-syn, for a total of 2 substitutions. To be fixed Leu CTT Codon Sd Nd total 1 0.5 1.5 2
KS is the relative rate of synonymous mutations per synonymous site KA is the relative rate of nonsynonymous mutations per non-synonymous site = KA/KS If = 1, neutral evolution If < 1, purifying selection If > 1, positive Darwinian selection A log-likelihood measure is used to asses the significance of Often, different sections of a gene will have different KA/KS Detecting genes with unusual values for KA & KS often leads to interesting discoveries
A scan for positively selected genes in the genomes of humans and chimpanzees. Nielsen et al. 2005 • 13,731 human-chimp orthologs were analyzed for evidence of positive selection. • Genes that show the strongest evidence for positive selection are involved in tumor suppression, apoptosis, and spermatogenesis. • Genes with maximal expression in the testis tend to be enriched with positively selected genes. • Genes on the X chromosome also tend to show an elevated tendency for positive selection.
Regions of high conservation in human genome PhastCons Siepel et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034-1050 (2005)
Path 1 Path 2 Si Sn S1 D T Human Mouse Rat T C G C G A C A T A T A C G A T T G G G G C A T G T G G G T A G C A G A C G T C C G C A A Conserved Non-conserved The goal is to determine the state assignments for each site, i.e., the path, that maximizes the likelihood of observing the data.
Dn Di D1 Si Sn S1 T T C G C G A C A T A T A C G A T T G G G G C A T G T G G G T A G C A G A C G T C C G C A A Data States
Conserved = C, Non-conserved = N Si-1 Si Sn S1
Substitution Matrix Q (kimura, HKY model etc.) X={A,C,G,T} t2 t1 Y={A,C,G,T} t3 t4 C C A This can be computed efficiently using the pruning algorithm (dynamic programming)
T C G C G A C A T A T A C G A T T G G G G C A T G T G G G T A G C A G A C G T C C G C A A Non-conserved State Conserved State t2 t1 rt2 rt1 t3 t4 rt3 rt4 A C C C C A The tree representing the conserved state is a scaled version of the non-conserved tree. Scaling factor – r, is the unknown parameter to be estimated.
Parameters • States s1 …. Sn • State frequencies • State transition probabilities • Conservation scaling factor – r • Branch lengths for un-conserved tree model Parameters can be estimated using an Expectation-Maximization (EM) method