410 likes | 592 Views
Comparative genomics. Haixu Tang School of Informatics. WGS of human genome. 2001 Two assemblies of initial human genome sequences published International Human Genome project Celera Genomics: WGS approach. Model organisms. 1995 Haemophilus influenzae sequenced 1997 E. Coli sequenced
E N D
Comparative genomics Haixu Tang School of Informatics
WGS of human genome • 2001 Two assemblies of initial human genome sequences published • International Human Genome project • Celera Genomics: WGS approach
Model organisms • 1995 Haemophilus influenzae sequenced • 1997E. Coli sequenced • 1998 Complete sequence of the Caenorhabditis elegans genome • 2000 Complete sequence of the euchromatic portion of the Drosophila melanogaster genome
Why model organisms? • Testing and improvements of genome sequencing technology and strategy
Model organisms • 1993 Whole genome shotgun sequencing proposed (J. C. Venter) • 1995 Haemophilus influenzae sequenced ~1.5-2 MBps • 1995 Automated fluorescent sequencing instruments and robotic operations (PerkinsElmer, Inc) • 1996 Yeast sequenced • 1996 Double barrelled sequencing • 1997E. Coli sequenced ~4 Mbps • 1998 Complete sequence of the Caenorhabditis elegans genome ~ 100 Mbps • 1998 Whole genome shotgun sequencing (Weber & Myers) • 2000Complete sequence of the euchromatic portion of the Drosophila melanogaster genome ~ 180 Mbps
Why model organisms? • Testing and improvements of genome sequencing technology and strategy • Model organisms have important biological implications themselves.
Model organisms • 1995 Haemophilus influenzae sequenced (infectious disease) • 1996 Yeast sequenced (industry and biology) • 1997E. Coli sequenced (industry and biotechnology) • 1998 Complete sequence of the Caenorhabditis elegans genome (multi-cellular organism, development) • 2000 Complete sequence of the euchromatic portion of the Drosophila melanogaster genome (genetics, entomology)
Why model organisms? • Testing and improvements of genome sequencing technology and strategy. • Model organisms have important biological implications themselves. • Genome sequences provide useful information to study genome function and evolution.
Model organisms • 1995 Haemophilus influenzae sequenced (Bacterial) • 1996 Yeast sequenced (Uni-cellular) • 1997E. Coli sequenced (Bacterial) • 1998 Complete sequence of the Caenorhabditis elegans genome (Multi-cellular organism, nematode) • 2000 Complete sequence of the euchromatic portion of the Drosophila melanogaster genome (Multi-cellular organism, insect)
Model mammalian and vertebrate genomes • 2001 Human genome • 2002 Mouse genome • Initial sequencing and comparative analysis of the mouse genome • 2003 Rat genome • 2004 Chicken genome (first bird) • 2005 Chimpanzee genome
Comparative genomics • Solving biological problems by comparing genomic sequences • Function of genes and genomes • Evolution of genes and genomes • Data driven approaches • Computational methods are the core
Which genomes to sequence? • Species having important biological applications • For comparative genomics studies • Functional consideration • Evolutionary divergent genomes conserved elements, e.g. human vs. mouse (~75% identical) • Evolutionary close genomes divergent elements, e.g. human vs. chimpanzee (98.4% identical) • Evolutionary consideration • Specific evolutionary puzzles whole genome duplications in yeast
Ongoing eukaryotic genome projects • http://igweb.integratedgenomics.com/ERGO_supplement/genomes_eukarya.html • >20 yeast, insects (12 drosophila, 2 mosquitoes, Silkworm), Flea, Sea urchin, frog, fish (Zebrafish, Fugu), Mammals (mouse, rat, dog, cow, pig, monkey, etc.), plants (Arabidopsis, Rice(>2), Maize, etc)
Comparative genomics: case studies • Gene function and evolution • Gene-gene relationship • Genome evolution
Homologue relationships of genes • Orthologues : any gene pairwise relation where the ancestor node is a speciation event • Paralogues : any gene pairwise relation where the ancestor node is a duplication event
Inparalogues Orthologues Outparalogues Inparalogues Homologue Relationships A time Duplication Inparalogues A 2 A 1 Speciation Duplication H 2 H 1 M 1 M 2 M 2’
Functional implications • Orthologous genes same function in different species • Paralogous genes different functions
cerevisiae paradoxus mikatae bayanus glabrata castellii lactis gossypii waltii hansenii albicans lipolytica crassa graminearum grisea nidulans pombe Yeast species • 5-20 million years • Sufficient conservation to align • Sufficient divergence to identify conserved functional elements ~5M ~20M
Large scale genome evolution • Most genes have a clear match • Clear blocks of synteny
Human–chimpanzee comparisons • POSITIVE SELECTION---A sequence change in a species that results in increased fitness is subject to positive selection. As a consequence, the change normally becomes fixed, leading to adaptive evolution of that species.
Genome vs. Genes • The whole genome sequence can tell not only what genes exist in a genome, but also what genes do not exist (deleted) in a genome.
Phylogenetic profile analysis • A non-homologous approach to gene function prediction • The phylogenetic profile of a gene is a string encoding the presence or absence of the gene in every sequenced genome • The phylogenetic profiles of genes involving in the same biological process are often “similar'‘, since they may co-evolve.
Phylogenetic profile analysis • Phylogenetic profile (against N genomes) • For each gene X in a target genome (e.g., E coli), build a phylogenetic profile as follows • If gene X has a homolog in genome #i, the ith bit of X’s phylogenetic profile is “1” otherwise it is “0”
Phylogenetic profile analysis • Example – phylogenetic profiles based on 89 genomes orf1034:1110110110010111110100010100000000111100011111110110111010101 orf1036:1011110001000001010000010010000000010111101110011011010000101 orf1037:1101100110000001110010000111111001101111101011101111000010100 orf1038:1110100110010010110010011100000101110101101111111111110000101 orf1039:1111111111111111111111111111111111111111101111111111111111101 orf104: 1000101000000000000000101000000000110000000000000100101000100 orf1040:1110111111111101111101111100000111111100111111110110111111101 orf1041:1111111111111111110111111111111101111111101111111111111111101 orf1042:1110100101010010010110000100001001111110111110101101100010101 orf1043:1110100110010000010100111100100001111110101111011101000010101 orf1044:1111100111110010010111010111111001111111111111101101100010101 orf1045:1111110110110011111111111111111101111111101111111111110010101 orf1046:0101100000010001011000000111110000010100000001010010100000000 orf1047:0000000000000001000010000001000100000000000000010000000000000 orf105: 0110110110100010111101101010111001101100101111100010000010001 orf1054:0100100110000001100001000100000000100100100001000100100000000 Genes with similar phylogenetic profiles have related functions or functionally linked – D Eisenberg and colleagues (1999)
Genome evolution • Genome rearrangement • Whole genome duplication
Turnip vs Cabbage: Look and Taste Different • Although cabbages and turnips share a recent common ancestor, they look and taste different
Turnip vs Cabbage: Comparing Gene Sequences Yields No Evolutionary Information
Turnip vs Cabbage: Different mtDNA Gene Order • Gene order comparison: Before After Evolution is manifested as the divergence in gene order
Comparative Genomic Architecture of Human and Mouse Genomes To locate where corresponding gene is in humans, the relative architecture of human and mouse genomes were analyzed.
Types of Rearrangements Reversal 1 2 3 4 5 6 1 2 -5 -4 -3 6 Translocation 1 2 3 45 6 1 2 6 4 5 3 Fusion 1 2 3 4 5 6 1 2 3 4 5 6 Fission
Comparative Genomic Architectures: Mouse vs Human Genome • Humans and mice have similar genomes, but their genes are ordered differently • ~245 rearrangements • Reversals • Fusions • Fissions • Translocation
cerevisiae paradoxus mikatae bayanus glabrata castellii lactis gossypii waltii hansenii albicans lipolytica crassa graminearum grisea nidulans pombe Hypothesis (1997): Whole Genome Duplication ? ~100M
Hypothetical resolution of WGD • A 1:2 mapping where • nearly every region in species Y would correspond to two sister regions in S. cerevisiae • the two sister regions in S. cerevisiae would contain ordered interleaving subsequences of the genes in the corresponding region of species Y • nearly every region of S. cerevisiae would correspond to one region of species Y, and thus be paired to a sister region in S. cerevisiae
cerevisiae paradoxus mikatae bayanus glabrata castellii lactis gossypii waltii hansenii albicans lipolytica crassa graminearum grisea nidulans pombe Hypothesis (1997): Whole Genome Duplication ? ~100M
Aligning the S. cerevisiae and K. waltii genomes • Most regions in K. waltii mapped to two regions in S. cerevisiae with each containing matches to only a subset of the K. waltii genes
What happens to genes post WGD? • 12% (457) of paralogous gene pairs were retained • 76 of the 457 gene pairs (17%) show accelerated protein evolution