Genome Biology and Biotechnology

Genome Biology and Biotechnology 5. The genome structures of plants Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology (VIB) University of Gent International course 2005

Sequenced genomes of invertebrates and plants • Completed plant genomes • Arabidopsis thaliana • Oryza sativa (rice) • Draft genome sequences • Finished chromosomes • Genome sequencing in progress • Polar (draft sequence completed) • Medicago (in progress) • Tomato (in progress) • Maize (started)

Phylogeny of the flowering plants ~250 MY Dicots Monocots

Analysis of the genome sequence of the flowering plant Arabidopsis thaliana • Plants and animals evolved independently from unicellular eukaryotes, representing contrasting life forms • The worm and fly genomes revealed the common genetic basis of developmental and physiological processes in multicellular organisms • The genome sequence of a plant provides a glimpse of the genetic basis of differences between plants and other eukaryotes • The genome sequence represents the most accurately sequenced genomes (error rate < 1:100.000) The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

The Arabidopsis Genome Sequence • The complete genome size is estimated at ~125 Mb • The total length of the sequenced region is 115,409 Mb • The unsequenced centromeres and rRNA repeat (chr. 2 & 4) regions are estimated at 10 Mb • General features such as gene density and repeat distribution are • very consistent across the five chromosomes Reprinted from:The Arabidopsis Genome Initiative, Nature248: 796 (2000)

Representation of the Arabidopsis Chromosomes Chr.1 29,1 Mb Chr.2 19,6 Mb Chr.3 23,2 Mb rDNA repeat Chr.4 17,5 Mb Chr.5 26,0 Mb Reprinted from:The Arabidopsis Genome Initiative, Nature248: 796(2000)

Representation of Arabidopsis Chromosome 1 Pericentromeric region telomere telomere centromere Protein genes ESTs Transposons Mitoch./Chloropl. RNA genes density Reprinted from:The Arabidopsis Genome Initiative, Nature248: 796 (2000)

Coding Gene Content • AGI annotation predicted 25.489 genes • Non-homogeneous annotation: performed by different groups • Re-annotation estimates28.000 to 29.000 genes • Larger than C. elegans (19.099) and D. melanogaster (13.601) • Larger gene set results from numerousgene duplications • MIPS classification of Arabidopsis proteins in 12 functional categories (cfr yeast) • ~70% classified according to sequence similarity to proteins of known function in all organisms • 9% experimentally characterized • ~30% not be assigned to functional categories • Representing 10.000 “unknown genes” Reprinted from:The Arabidopsis Genome Initiative, Nature248: 796 (2000)

Functional Analysis of Arabidopsis Genes Reprinted from:The Arabidopsis Genome Initiative, Nature248: 796(2000)

Comparison of Functional Categories • Comparison of Arabidopsis genes with those of the complete genomes reveals: • High conservation of eukaryotic gene function • >50% of the genes involved in protein synthesis have counterparts in the other eukaryotic genomes • Independent evolution of many plant gene families • transcription factors:only 8–23% of Arabidopsis proteins involved in transcription have related genes in other eukaryotic genomes • Acquisition of bacterial genes • from the cyanobacterial ancestor of the plastid: in the order of 1.000 genes have been translocated over time from the organelle to the genome. • Genes with high similarity to Synechochistis Reprinted from:The Arabidopsis Genome Initiative, Nature248: 796 (2000)

RNA Gene Content • rRNA Genes • Nucleolar organizers (NORs) on chromosomes 2 and 4 contain • 350–400 repeats of 10 kb encoding the 18S, 5.8S and 25S rRNA genes comprising 3.5–4.0 Mb • 5S rRNA genes • Tandem arrays in the centromeric regions of chr 3, 4 and 5 • tRNA genes: dispersed orginization • 589 cytoplasmic tRNAs, 27 organelle-derived tRNAs and 13 pseudogenes • Spliceosomal RNAs, small nucleolar RNAs (snoRNAs) • Several copies occur dispersed on all chromosomes Reprinted from:The Arabidopsis Genome Initiative, Nature248: 796 (2000)

Genome Duplication in Arabidopsis • The Arabidopsis genome exhibits traces of extensive duplications • >75% of the Arabidopsis genes are duplicated • The fact that most genes are duplicated explains the higher gene number than in other organisms • Segmental duplications • Segmental duplications were first described in yeast • Identified 24 large duplicated segments of > 100 kb • These duplicated regions encompass58% of the genome • Tandem gene arrays • Tandem arrays of genes are common in all genomes • 1,528 tandem arrays containing 4,140 individual genes • 17% of all genes of Arabidopsis are arranged in tandem arrays Reprinted from:The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

Genome Organization and Duplication • First analysis of segmental duplications • Detection of collinear clusters of genes using TBLASTX • This approach detects the “ obvious” duplications • The proportion of homologous genes in each duplicated segment varies widely • Extensive gene loss or gain of genes after the segmental duplication occurred • Sequence conservation/divergence of the duplicated genes varies greatly • Duplications vary in age • suggesting several different large-scaleduplication events • Duplications occurred between 75 to 200 million years ago • Earliest duplication coincides with the radiation of the flowering land plants Reprinted from:The Arabidopsis Genome Initiative, Nature248: 796(2000)

Overall View of the Duplicated Regions Reprinted from:The Arabidopsis Genome Initiative, Nature248: 796(2000)

Implications of Genomic Duplications • What does the duplication in the Arabidopsis genome tell us about the evolution of the species? • Polyploidy occurs widely in plants but not in animals • The hypothesis is that Arabidopsis had a tetraploid ancestor(s) • The majority of the Arabidopsis genome is represented in duplicated segments • Suggests that the duplicated segments arose from whole genome duplications • The long period of time (75 to 200 My) provided ample opportunity for • the divergence of the functions of the duplicated genes • Duplicated genes often have redundant functions • Majority of insertion mutants in Arabidopsis have no obvious phenotypic effect Reprinted from:The Arabidopsis Genome Initiative, Nature248: 796(2000)

The Origin of Genomic Duplications • First detailed analysis of the duplications: • Vision et al, Science290: 2114 (2000) • Identified 103 duplicated segments with >=7 matching ORFs • 81% of the Arabidopsis genes fall within at least one block • The ages of the duplicated blocks were estimated from average extent of amino acid substitution • The number of duplication events was estimated from the distribution of the estimated block ages • Single polyploidization event will produce a unimodal distribution of ages with homogeneity among blocks • Independent duplication events will produce a multimodal distribution Reprinted from:

Age Classes of Duplicated Blocks • Distribution of divergence suggests 4 duplication events • Classes C through F yield age estimatesof 100, 140, 170, and 200 Mya • Age class C , the most recent, comprises 50% of the duplicated segments • Age class F predates the divergenceof monocots and dicots, 180 to220 Mya Reprinted from: Vision et al, Science290: 2114 (2000)

The Origin of Genomic Duplications • Recent study of the Arabidopsis genome duplications • Simillion et al,PNAS 99, 13627(2003) • More refined algorithms detect degenerated block duplications • Degeneration results from extensive gene loss and subsequent reshufflings of gene order • Algorithms detect hidden duplications missed in earlier studies • Study revealed a much larger number ofduplications • 304 nonhidden duplications and 53 hidden duplications • Comprising 82% of all genes in Arabidopsis • >70% of the genes are lost from the duplicated segments

Nonhidden and Hidden Duplications Nonhidden Hidden Reprinted from:Simillion et al, PNAS 99, 13627 (2003)

Multiplication levels of the Duplications • Chromosomal segments exhibit multiple duplications • Multiplication numbers vary from 5 to 8 Reprinted from:Simillion et al, PNAS 99, 13627 (2003)

Conclusions • High multiplication levels • Suggest multiple rounds of whole genome duplication • Observed many duplications with multiplication levels of 5 - 8 • Indicating a maximum of three rounds of duplications • Dating based on silent substitutions • Accurate for the youngest duplication • dated 75 million years ago • Less reliable for the two older age classes • dated 163 and 221 million years ago • Results suggest three whole genome duplication or polyploidization events • The oldest one may have occurred before the monocot/dicot split Reprinted from:Simillion et al, PNAS 99, 13627 (2003)

The grass genomes • Grasses are the primary food source • Wheat, rice, maize barley, sorghum… • Grass genomes vary widely in size

Macro synteny of the grass genomes Reprinted from: Moore et. al., Curr. Biol. 5, 737−739 (1995)

The rice genome sequence • Draft genome sequences (2002) – whole genome shotgun sequences • Oryza sativa L. ssp. japonica – Syngenta • fragmented sequence covers 78% in > 42.000 contigs • Goff et. al., Science, 296, 5565 (2002) • Oryza sativa L. ssp. indica - Beijing Genomics Institute • very fragmented sequence covers 69% in >110.000 contigs • Yu et. al., Science, 296, 79 (2002) • Finished genome sequence (2005) map-based genome sequence • Oryza sativa L. ssp. japonica -The International Rice Genome Sequencing Project • finished quality sequence that covers 95% of the 389 Mb genome • including all of the euchromatin and two centromeres • International Rice Genome Sequencing Project, Nature 436: 793-800 (2005)

Maps of the twelve rice chromosomes • The size of the rice genome was estimated at 389 Mb • the sequence covers 95% of the genome and 98.9% of the euchromatin centromeres Reprinted from: International Rice Genome Sequencing Project, Nature 436: 793-800 (2005)

Maps of the Centromeric Region of Rice Chr 8 • Centromeres contain • highly repetitive 155−165 bp CentO satellite DNA • centromere-specific retrotransposons BACs transposons 155-bp CentO Satellite DNA Reprinted from: Wu, J., et al. Plant Cell 2004;16:967-976

Annotation Map of the Centromeric Region of Chr 8 genes transposons Reprinted from: Wu, J., et al. Plant Cell 2004;16:967-976

Distribution of features on rice chromosome 10 Reprinted from: The Rice Chromosome 10 Sequencing Consortium, Science. 300: 1566-1569 (2003)

Protein coding genes • Predicted 37,544 protein-coding genes • density of one gene per 9.9 kb • 22,840 (61%) genes are supported by ESTs or full-length cDNAs • 4,500 additional genes match entries in the Swiss-Prot database • ~10.000 are predicted ab initio • Rice – Arabidopsis homologies • 90% of the predicted Arabidopsis proteins have a rice protein homologue • 71% of the predicted rice proteins have a Arabidopsis protein homologue • Unique rice genes match unknown or hypothetical proteins • interesting differences between the genome content of these two groups of angiosperms remain to be discovered Reprinted from: International Rice Genome Sequencing Project, Nature 436: 793-800 (2005)

Classification of the predicted rice genes • Functional classification • # of genes in the functional classes is very similar to Arabidopsis Reprinted from: Goff et. al., Science, 296, 5565 (2002)

Other gene features • Tandem gene families • 29% of the genes arranged in tandem repeats • Compared to 17% of genes in Arabidopsis • Non-coding RNA genes • rDNA repeats are located in the nucleolar organizer on chr 9 • A total of 763 tRNA genes • Identified 158 MicroRNAs (miRNAs) • MicroRNAs regulate gene expression by interacting with the target messenger RNAs • Organellar insertions in the nuclear genome • 421−453 chloroplast insertions • 909−1,191 mitochondrial insertions • several successive transfer events have occurred Reprinted from: International Rice Genome Sequencing Project, Nature 436: 793-800 (2005)

Transposable elements • Transposon content is at least 35% • More divergent elements were identified using profile HMM • Much larger than Arabidopsis Reprinted from: International Rice Genome Sequencing Project, Nature 436: 793-800 (2005)

Intraspecific sequence polymorphism • Comparison of orthologous sequences of ssp. indica and ssp. Japonica • Aligned 308 Mb (79%) of the genome • Identified 80,127 different sites Reprinted from: International Rice Genome Sequencing Project, Nature 436: 793-800 (2005)

Gene duplication in rice Duplicated segments Reprinted from: Paterson et al., PNAS 101: 9903-9908 (2004)

Genome duplication in rice • Extensive gene duplication • 9 duplicated blocks account for 62% of the rice genes • blocks have retained 16% to 25% of the duplicate copies • retention of duplicated gene copies is greater than predicted • suggests that gene loss is not random • Phylogenetic Dating of the genome duplication • Ks values suggest a single duplication event • except the chromosome 11-12 duplication, which was more recent • The Ks peak for the rice duplicates corresponds to 70 MY • The time of divergence of the cereals is estimated at 50 MYA • a polyploidization event occurred 70 MY ago • before the divergence of the major cereals Reprinted from: Paterson et al., PNAS 101: 9903-9908 (2004)

Genomic Duplications in Angiosperm Evolution monocots dicots Reprinted from: Paterson et al., PNAS 101: 9903-9908 (2004)

Comparison of rice and grass genomes • Synteny between rice and Arabidopsis • Limited to relatively short segments comprising few genes • Successive rounds of genome duplications in the two lineages (Arabidopsis 2; rice 1) have blurred the ancestral synteny • Macro synteny of the grass genomes is confirmed at the sequence level • 98% of the genes found in the different grasses have a rice homolog • Rice is a model system for the larger cereal genomes

Micro synteny of the grass genomes • Collinear arrangement of genes is interrupted by • Intergenic retrotransposon blocks Reprinted from: Ramakrishna et al., Genetics, 162, 1389 (2002)

The maize genome • Large (2.365 MB) and complex genome • Unusually high repetitive DNA content (>80%) • Stepwise sequencing approach designed to the meet the challenge • Sequencing the gene-rich fraction • Enrichment of Gene-Coding Sequences by Genome Filtration • Whitelaw et. al., Science, 301, 2118-2120 (2003) • High resolution physical map of 300:000 BAC clones • BAC end sequencing: completed • Sequence composition and genome organization of maize • Messing et al., PNAS 101: 14349-14354 (2004) • BAC skim sequencing: in progress • Low pass sequencing of minimal tiling path BACs • Expect the complete genome sequence by 2007 • Martienssen et al.,Curr. Op. in plant biol., 7: 102 – 107 (2004)

Structure of the Maize genome • The maize genome is 6 times larger than that of rice • ~60% of the genome comprises highly repetitive sequences • >90% are LTR–retrotransposons inserted in the last 3 to 6 MY • 10 - 100 -kb tracts of nested insertions separate genic regions Reprinted from: SanMiguel et al., Nat Genet. 20: 43 (1998)

Duplicated genes in maize • A conservative estimate predicts 59,000 genes • A very large fraction of duplicated genes • Two interesting aspects of the gene organization • Despite the fact that the genome was duplicated 5-10 My ago • the tetraploidization was followed by a heavy loss of duplicate genes • <50% of the duplicates are retained (cfr. yeast) • Tandem gene amplification is unusually high • ~1/3 of the genes consist of tandemly arrayed gene families • The maize genome illustrates the exceptional dynamics of genome evolution in plants Reprinted from: Messing et al., PNAS 101: 14349-14354 (2004)

Origin of rice, maize and sorghum Genome duplication Reprinted from: Messing et al., PNAS 101: 14349-14354 (2004)

Enrichment of Gene-Coding Sequences in Maize by Genome Filtration • Paper presents • Two methodologies that enrich for genic sequences for sequencing complex genomes • Methylation filtering • High C0t selection • Combination of the two techniques resulted in a six-fold reduction in the effective genome size • Powerful technologies for sequencing repeat-rich genomes Whitelaw et. al., Science, 301, 2118-2120 (2003)

Enrichment of Gene-Coding Sequences • Methylation filtering • hypermethylated sequences are excluded with the use of bacterial restriction systems that cleave methylated sequences • In plants two methylases will methylate C residues in CG and CNG • Methylation is restricted to primarily repeated DNA sequences • High C0t (HC) selection • allows separation of DNA fractions into low-copy (High C0t) or high-copy (Low C0t) sequences • The repetitive DNA renatures first • The double-stranded DNA can be separated from lower copy number, unrenatured DNA. • The low–copy number fraction is enriched in genes Reprinted from: Whitelaw et. al., Science, 301, 2118-2120 (2003)

Sorghum Genome Sequencing by Methylation Filtration Bedell et al., PLoS Biol. 3: e13 (2005) • Paper presents • Sequence from the hypomethylated portion of the sorghum genome obtained by applying methylation filtration • 96% of the genes have been sequence tagged, with an average coverage of 65% across their length • MF preferentially captures exons and introns, promoters, microRNAs, and simple sequence repeats • MF preferentially minimizes interspersed repeats • MF provides a robust view of the functional parts of the genome.

Genome Reduction in Sorghum Reprinted from: Bedell et al., PLoS Biol. 3: e13 (2005)

Plant and animal genome evolution • Animal genomes • Marked conservation of synteny over long evolutionary times • Evolution proceeds mainly through expansion/contraction of gene families through tandem duplication • Total number of genes remains more or less constant • Increased gene diversity through ehanced alternative splicing • Balanced gene birth and death • Plant genomes • Genomes evolve at a more rapid pace, driven by successive rounds of whole genome duplication events • Duplication events followed by massive gene losses, with retention of substantial fractions (~30%) of the duplicated genes • With subsequent neo-functionalization of duplicated genes (?) • Marked tendency towards increased number of genes • Alternative splicing is much less common

Gene Content versus Genome Size # of genes maize rice fish fungi Million base pairs

Future Perspectives • Different plant genomes projects ongoing or planned (currently totaling ~30) • Grasses: 5 species • Maize, barley, sorghum, oat and grass • Flowering plants: > 10 species • Tomato, potato, coffee, cotton, soybean, clover, lotus, grapevine,… • Trees: 4 species • Poplar, eucalyptus, pine and banana • Algae and mosses: ~10 different species • Source: GOLDTM Genomes OnLine Database • http://www.genomesonline.org/

Recommended reading • The Arabidopsis genome sequence • The Arabidopsis Genome Initiative, Nature 248: 796 (2000) • The map-based sequence of the rice genome • International Rice Genome Sequencing Project, Nature 436: 793-800 (2005) • The maize genome sequence • Martienssen et al.,Curr. Op. in plant biol., 7: 102 – 107 (2004)

Genome Biology and Biotechnology