430 likes | 597 Views
Genome Biology and Biotechnology. 2. The genome structures of invertebrates. Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology (VIB) University of Gent International course 2005. Sequenced genomes of invertebrates. Nematodes
E N D
Genome Biology and Biotechnology 2. The genome structures of invertebrates Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology (VIB) University of Gent International course 2005
Sequenced genomes of invertebrates • Nematodes • Caenorhabditis elegans (1998) • Caenorhabditis briggsae (2003) • Insects • Drosophila melanogaster– fruit fly (2000) • Drosophila pseudoobscura– fruit fly (2005) • Anopheles gambiae - mosquito (2002) • Bombyx mori - silkworm (2004) • Tunicates: ancestral vertebrate genome • Ciona intestinalis (2002)
Phylogeny of the invertebrates ~800 MY >1000 MY 550 MY
Genome Sequence of the Nematode C. elegans • Paper presents • The first complete genome sequence of a multicellularorganism • The initial sequence covered 97-Mbp (6 gaps) • The complete sequence (June 2003) comprises 100,2Mbp without gaps The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
Protein coding Genes • First large-scale genome sequence annotation • The gene structure predictions based on EST and protein similarities • Only 40% of the predicted genes had a confirmingEST match • The first annotation predicted 19,099 genes • An average density of 1 predicted gene per 5 kb • 27% of the genome resides in predicted exons • Each gene has an average of five introns • WormBase: updated and manually curated gene set • Currently contains 18,808 genes Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
RNA genes and repetitive sequences • RNA genes • rRNA genes: occur in long tandem arrays • tRNA genes: 659 tRNA genes occur widely dispersed • Noncoding RNA genes: in dispersed multigene families • Micro RNA genes (miRNA) • ~100 identified to date • Repetitive Sequences • Dispersedrepeat sequences • Most of them are associated with transposons of C. Elegans which are probably no longer active in the genome • Local repeat sequences • Tandem, inverted, or simple sequence repeats Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
Chromosome Structure and Organization • The genome structure is remarkably uniform • Gene density is fairly constant across the chromosomes • No localizedcentromeres • Like in yeast, but in contrast to all other eukaryotes • Differences between the central portion and the arms of the chromosomes • The conservedeukaryotic genes arein the central portion • Repetitive DNA is more prevalent in the arms • Meiotic recombination is much higher on the chromosome arms • suggest that DNA in the arms might be evolving more rapidly than in the centralregions Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
Distribution of sequence elements on Chromosome I arm Central part arm TTAGGC repeats Tandem repeats Inverted repeats Yeast similarities EST matches Predicted genes Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
Conclusions • The complete sequence of the C. elegans genome has • provided a basisfor the discovery of all the genes of a multicellular eukaryotic organism • First inventory of eukaryotic genes • C. elegans is a very effective model organism for • eukaryotic gene analysis: widely used for functional genomics • human disease gene research • nematode pest control research Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics • Paper presents • high-quality draft (> 10-fold coverage) sequence of C. briggsae • Comparative genome analysis of C. briggsae and C. elegans • The two species diverged ~ 100 million years ago • morphologically indistinguishable • same chromosome number (5) and genome size (104 and 100Mb) • Comparisons of the genomes of related species allows • More precise annotation of protein-coding genes • Discovery of noncoding genes, regulatory sequences and “unknown” functional elements Stein et. al., PLoS Biol 1: 166-192 (2003)
Colinearity of the C. briggsae and C. elegans Genomes • Alignment of sequences • ~80% Collinearity • inversions and translocations • blocks of synteny • orthologous genes Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Annotation of Protein-Coding Genes • Concordance of gene predictions refines gene models • C. elegans gene annotationimprovement • >6,000 (30%) genes exon addition, deletion or alterations • 1,300 new genes • 18,808 protein-coding genes C. elegans • 19,507 protein-coding genes C. briggsae Most concordant Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Comparison of Protein-Coding Genes • ~65% are orthologs in C. briggsae /C. elegans • gene pairs with a one-to-one correspondence in the two species • have a common ancestor • have similar gene and coding sequence lengths • show ~80% percent identity at the protein level • ~25% are paralogs in C. briggsae /C. elegans • proteins with multiple BLASTP matches in the other species • Evolving gene families • ~5% are orphans in C. briggsae /C. elegans • proteins that have no BLASTP matches in the other species • 807 in C. elegans and 1061 in C. briggsae genes • Novel genes or pseudogenes? Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Conservation of Operon Structure • C. elegans is unusual among animals in having operons • co-transcribed genes that make a polycistronic pre-mRNA • subsequently separated into single-gene mRNAs by trans-splicing • ~15% of C. elegans genes are encoded in ~1000 operons • contain 2–8 genes • 96% of the operons are preserved intact in C. briggsae genome • C. elegans operons comprise • co-regulated genes encoding proteins with related functions • specific functional classes of genes • Transcription • RNA splicing • translation • RNA degradation Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Repetitive sequences • The different genome sizes result from • Differences in repeat content • 23.3 Mbp of the C. briggsae genome (104 Mbp) • 16.5 Mbp of the C. elegans genome (100.3 Mbp) • Repeated DNA families • comprise DNA transposons or tandem arrays • Not orthologous between the two genomes • suggests that most repeat elements in the two genomes postdate the divergence of the two species • Accumulation of new repetitive elements is balanced by deletions so that • genome sizes remain similar Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Chromosome Structure and Organization • The centers contain orthologous (1) and essential genes (2) • Very long synteny blocks • The arms contain orphan genes (3) and repetitive elements (4) • Short synteny blocks • The arms of the chromosomes are evolving more rapidly than the centers 1 2 3 4 Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Conclusions • C. briggsae/C. elegans comparison shows that • despite large differences at the genomic level, C. briggsae and C. elegans are morphologically almost indistinguishable • Many protein families are very dynamic • ~200 families have expanded or contracted by > 2-fold • several hundred families are either novel or have diverged extensively • share only ~ 50% of the non-coding sequence • Sequencing of additional species is necessary to • identify candidate cis-regulatory elements based on sequence conservation • the noise level in a two-way comparison is too high Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
The Genome Sequence of Drosophila melanogaster • Draft sequence – (2000) • Whole-genome shotgunsequencing • Sequence contained 128 physical gaps and 1630 sequence gaps • Some regions were of poor sequence quality • Demonstrated that whole-genome shotgun sequencing can be used for large eukaryotic genomes • Adams et. al., Science, 287, 2185 (2000) • Finished sequence – (2002) • BAC clone sequencing and gap filling • Sequence contains 7 physical gaps and 37 sequence gaps • Very accurate sequence: error rate of < 1/100.000 • Celniker et al., Genome Biol. ; 3: research 0079.1–0079.14 (2002)
The Drosophila Genome • The (female) Drosophila genome is ~176 Mb in size • Euchromatic part: 117 Mb completely sequenced • heterochromatic part: partly (~20Mb) sequenced (unassembled) • Female: estimated at ~59 Mb • Male: the 40Mb Y chromosome is completely heterochromatic Reprinted from: Adams et. al., Science, 287, 2185 (2000)
Euchromatin and Heterochromatin • Euchromatin • Gene rich portion of the genome • Condenses during mitosis and de-condenses there after • Portion of the genomethat can be cloned stably in BACs • Heterochromatin • Consists mainly of simple sequence repeats (sattelite DNAs), transposableelements, and tandem arrays of rRNA genes • Remains condensed after mitosis • Gene poor portion of the genome • Contains elements required for centromere function • Euchromatin - heterochromatin transition • is gradual at the molecular level Reprinted from: Adams et. al., Science, 287, 2185 (2000)
Euchromatic GenomeSequence Transposons centromere Reprinted from: Celniker et al., Genome Biol. ; 3: research 0079.1–0079.14 (2002)
Gene Content of the Drosophila Genome • Annotation of thedraft genome sequence • Predicted 13,601 genes • >10,000 genes (>75%) supported by ESTand protein matches • This annotation was incomplete • Large number of sequence gaps and sequencing errors • Annotation of thefinished genome sequence • Predicted same number of genes: 13,676 • Majority (85%) of the gene models revised • Improved: a collection of 250.000 ESTs and full length cDNAs • Found only 17 pseudogenes ( much less than in C. elegans ) • Heterochromatic part may contain ~500 genes • The 20Mb sequenced contains ~300 protein coding genes • Reannotation reveals many complex gene models • genes that do not fit the simple 5’UTR – exons – 3’UTR Reprinted from: Adams et. al., Science, 287, 2185 (2000)
Complex Gene models • Alternatively splicing or alternative polyadenylation • At least ~20% of genes have >1 predicted transcript • 65% encode two or more protein products • 35% differ in the UTRs - most have different 5’UTRs: alternative promoters Reprinted from: Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002)
Complex Gene models • Dicistronic genes: 2 non-overlapping coding regions on one mRNA • 31 dicistronic gene pairs found represent an underestimate Reprinted from: Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002)
Complex Gene models • Overlapping genes • overlap of mRNAs on opposite strands: 15% of the genes • Nested genes • genes included within introns of other genes: 15% of the genes Reprinted from: Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002)
Conclusions • The Drosophila genome sequence reveals • genes and proteins common to all multicellular organisms • proteins involved in transcription control and metabolismare very similarto their human counterparts • Drosophila provides an experimental platform for • the study of of humandisease genes involved in • DNA replication and repair • Metabolism of drugs and toxins. Reprinted from: Adams et. al., Science, 287, 2185 (2000)
Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution • Paper presents • High quality draft genome sequence of a second Drosophila species Drosophila pseudoobscura • Comparison with the genome sequence of D. melanogaster • Evolutionary distance is well suited to study • Conserved and diverged genes • Conserved regulatory elements • Mechanisms of genome rearrangement Richards et. al., Genome Res. 15: 1-18 (2005)
The D. pseudoobscura genome • The euchromatic part is estimated at 131 Mb • ~17% larger than that of D. melanogaster • the additional sequence is • primarily found in the intergenic regions • only partly caused by expansion of repeated DNA • The two speciesshow a very high gene synteny • Synteny blocks were identified • on the basis of conservation of protein order • ~10.500/14.000 genes are true orthologs • All synteny blocks are short and extremely mixed • extensive genome rearrangement in the two Drosophila lineages Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
The synteny between D. pseudoobscura and D. melanogaster • The great majority of syntenic blocks are found • on the same chromosome arms in the two species • Chromosomal rearrangements in the two species • Almost exclusively paracentric inversions Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
Intraspecific inversion breakpoints • Repetitive sequences at the inversion breakpoints • Frequently comprise a breakpoint motif • Only found in D. pseudoobscura breakpoint motifs Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
Conservation of gene segments • Sequence conservation in noncoding regions • Is insufficient for the identification of regulatory sequences • Multiple genome sequence alignments will be needed Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
The Genome Sequence of the Malaria Mosquito Anopheles gambiae • The papers present • Draft genome sequence of the PEST strain of A. gambiae • A comparison of the genomes and proteomes of Anopheles and Drosophila • Two very different diptera that diverged ~250MY ago Sequence:Holt et. al., Science. 298: 129-149 (2002) Comparison:Zdobnov et. al., Science, 298, 149 (2002)
The Mosquito Genome Sequence • The draft genome spans 278 Mb • Covers the entire genome including the heterochromatic DNA • Mosquito have larger genomes than Drosophila • estimates from 250 to 500 Mb • Transposable elements constitute ~16% of the genome • Drosophila experienced a recent genome size reduction • The predicted number of genes is ~14.000 • Very similar to Drosophila • The comparison of the Anopheles and Drosophila genomes and proteomes reveals • considerable similarities and numerous differences • Reflects selection and adaptation to different ecologies and life strategies Reprinted from: Holt et. al., Science. 298: 129-149 (2002)
Similarity at the protein level • Identified 4 proteins classes • True orthologs: ~45% (~6.000) • Exhibit 1:1 relationship • Genes with conserved function • Paralogs: ~12% • Duplicated genes • Homologs: ~~25% • Unclear relationship • Orphans: 11% to 18% • New genes • Rapidly evolving genes Reprinted from: Zdobnov et. al., Science, 298, 149 (2002)
The core of conserved proteins • Dynamics of Gene Structure in a span of 250MY • Exon lengths and intron frequencies are similar • introns in Drosophila have half the length of Anopheles • systematic reduction of noncoding regions in Drosophila • Only 50% of the introns are perfectly conserved • one intron gain or loss per gene per 125 My • Intron sequences diverge rapidly • sequence similarity in <2% of the equivalent introns Reprinted from: Zdobnov et. al., Science, 298, 149 (2002)
Family expansions and reductions • Increases and decreases in protein families • Related to adaptations to life strategies and environment • Expansions or reductions are • Uneven: a single gene in one species has many paralogs in the other • More frequent in Anopheles • Examples: • Cuticular proteins • Innate immunity genes • FBN-like (fibrinogen) proteins massively expanded in Anopheles Reprinted from: Zdobnov et. al., Science, 298, 149 (2002)
Genome Rearrangements • Microsynteny • 34% of the orthologs map to ~1000 microsynteny blocks • 2-3 genes per block (cfr. fish-human) • Macrosynteny • Both species have 5 five major chromosomal arms • Clear 1:1 homologies between the chromosomal arms • Inversions much more frequent than translocations Reprinted from: Zdobnov et. al., Science, 298, 149 (2002)
Tunicates 550 MY The Draft Genome of Ciona intestinalis:Insights into chordate and vertebrate origins Dehal et. al., Science, 298, 2157-2167 (2002) • Paper presents • Draft genome sequence of Ciona intestinalis, an ancestral chordate • Chordates appear in the fossil record at the Cambrian explosion • ~ 550 million years ago
Ciona intestinalis • Tessile, hermaphroditic marine invertebrates • Adults are simple filter feeders • Encased in a fibrous tunic • Juvenile showing the • internal structures: • ds, digestive system • es, endostyle • ht, heart • os, neuronal complex; • pg, pharyngeal gill. Adult Reprinted from: Dehal et. al., Science, 298, 2157-2167 (2002)
Gene content and global comparisons • Predicted ~ 16.000 gene models • 75% of the predicted genes are supported by EST evidence • Genes are compact and densely packed: one gene per 7.5 kb • Global comparisons • 60% of the genes have a detectable fly or worm homolog • 20% of the genes have no clear homolog • tunicate- specific genes • 17% of the genes have a vertebrate homologbutno detectable fly or worm homolog • Many are single-copy genes for the vertebrate gene families • signalling and regulatory processes in development • The gene content is a reasonable approximation of the ancestral chordate Reprinted from: Dehal et. al., Science, 298, 2157-2167 (2002)
Future Perspectives • Invertebrate genomes are sequenced at a rapid pace • Worms: 10 species of medical and agricultural importance • Schistosoma, Ancylostoma, Ascaris, Globodera, Meloidogyne • Insects: ~20 species of primarily agricultural importance • Mosquito’s, honey bee, lepidoptera and > 10 Drosophila species • Protozoa: several species of medical importance • Trypanosoma, Theileria, Plasmodium, Leishmania,… • Broad range of species • Sponge, sea urchin, Daphnia, Hydra, snail, lamprey,… • Source: GOLDTM Genomes OnLine Database • http://www.genomesonline.org/
Recommended reading • The nematode genome sequence • The C. elegans Sequencing Consortium, Science, 282, 2012 (1998) • The Drosophila genome sequence • Adams et. al., Science, 287, 2185 (2000)
Further reading • Nematode genomes • C. briggsae: • Stein et. al., PLoS Biol 1: 166-192 (2003) • Insect genomes • Finished Drosophila genome sequence: • Celniker et al., Genome Biol. ; 3: research 0079.1–0079.14 (2002) • Annotation of the Drosophila genome : • Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002) • Draft Drosophila pseudoobscura genome sequence • Richards et. al., Genome Res. 15: 1-18 (2005) • Draft mosquito genome sequence • Holt et. al., Science. 298: 129-149 (2002) • Zdobnov et. al., Science, 298, 149 (2002) • Ciona genome • Dehal et. al., Science, 298, 2157-2167 (2002)