Advances in Genome Sequencing

Genome Biology and Biotechnology Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology (VIB) University of Gent International course 2005

Genome Biology and Biotechnology • Introduction“The genomics revolution” • Genome structure and evolution • The genome structures of unicellular organisms • The genome structures of invertebrates • The genome structures of vertebrates • The variable human genome • The genome structures of plants • Functional genomics • The ORFeome • The phenome • The transcriptome • The localizome • The proteome • Concluding remarks“Systems biology”

Genome Biology and Biotechnology The genomics revolution International course 2005

The Human Genome Project 1990 1995 2000 2005 Human Genome Project Technological innovations High throughput automation Large scale genome sequencing 1000-fold <1Mb/year >1000 Mb/year 20.000 Mb/year

Technological Innovations • High throughput fingerprinting of BAC clones • Construction of physical maps • Starting DNA for large scale sequencing 1 2 Mb

Technological Innovations • High throughput fingerprinting of BAC clones • Construction of physical maps • Improvements of the dideoxy sequencing technique • Fluorescent labeling and improved sequencing enzymes • Improved sequencing strategies • Shotgun sequencing, improved shotgun libraries • Software for automated interpretation of fluorograms • Assigns 'assembly-quality scores' to each base in the assembled sequence • Assembly of high quality sequence contigs

Shotgun DNA Sequencing Strategy BAC clone

High throughput automation • Automated DNA sequence gel readers • First generation: slab gel-based DNA sequencers • 32 – 96 samples per run • Manual loading • Difficulties in lane tracking causing considerable losses in data • Second generation: capillary DNA sequencers • Automated loading, allowing unattended operation and perfect lane tracking • 20 * 96 samples/day = ~2 million bases of raw sequence/day • Automation of sample preparation and handling • Liquid handling robots made the up scaling feasible • Eliminated most of the “human error”

Sequencing Complex Genomes : the Challenge • Difficulties arise because of repeated sequences • Small amounts of repeated sequence pose little problem for shotgun sequencing • Bacterial genomes (about 1.5% repeat) • Mammalian genomes are filled (> 50%) with repeated sequences • Interspersed repeats derived from transposable elements • Large duplicated segments with high sequence identity (98–99.9%), • Repeated sequences complicate the correct assembly of shotgun sequence reads • Two strategies for sequencing complex genomes • Hierarchical shotgun sequencing strategy('map-based', 'BAC-based' or 'clone-by-clone‘ strategy) • Whole genome shotgun (WGS) sequencing strategy

Hierarchical Shotgun Sequencing Strategy Reprinted from: International Human Genome Sequencing Consortium Nature 409, 860 (2001)

Whole Genome Shotgun Sequencing • Different insert sizes of cloned DNA • 2 kb in multi copy vectors • 10 kb in fosmid vectors • 100 - 200 kb in BACs Reprinted from: Venter et. al., Science 280: 1540 (1998)

Whole-genome shotgun sequence assembly STS Sequence tagged Sites Reprinted from: Venter et. al., Science, 291, 1304 (2001)

Comparison of the two strategies • The hierarchical shotgun sequencing strategy is • Slower and has a higher upfront cost • create a detailed physical map of clones • Sequencing of 10.000s of individual BAC clones involves more handling steps • Is indispensable for the production of a finished sequence • The whole-genome shotgun approach is • Faster and more cost effective • Fully exploits the potential of a streamlined robotics-based operation • But, cannot deliver more than a (high quality) draft sequence

Draft Sequences versus Finished Sequences • Draft genome sequences • High quality draft sequence high (8 to 10-fold) coverage • Yields sequence contigs that cover 95% - 98% of the sequence • Draft sequence is by definition incomplete • 10.000 – 100.000 gaps • Incorrectly assembled sequences – duplicated segments • Finished genome sequences • Close gaps and resolve ambiguities in draft sequences • Correct order and orientation of sequence contigs • Resolution of duplicated regions: collapsed in the draft sequence • Standard error rate: < 1 error per 10,000 bases

Sequencing Complex Genomes • Projects currently underway use • Model organisms where a finished genome sequence is indispensable use a combination of the two approaches • Human, Mouse, Drosophila, zebrafish • Whole genome shotgun to generate high quality drafts • Comparative genome analysis • Hierarchical strategy for genomes with repetitive DNA is clustered in centromeres or telomeres • Plant genomes • Alternative strategies • Methyl filtration or Cot enriched libraries are used for particular (large) plant genomes

Genome sequencing: progress to date • Extraordinary progress in sequencing technologies development in the past 15 years has resulted in • Completion of the human genome project ahead of schedule (2004) • Over 30 eukaryotic genome sequences (including 6 vertebrate genomes) • Over 200 bacterial and archean genome sequences • The completion of the human genome marks the “end of the beginning” • Many more genomes are to follow • awaits the daunting task of unraveling its secrets

1995 2000 2005 8 9 1 2 3 4 6 7 Genome Sequencing Milestones H. influenza Human chrom 20 S. cerevisae yeasts S. pombe Tetrahodon Fugu C. elegans Rat Mouse Human chrom 21 & 22 Chicken Anopheles alga Neurospora Drosophila melanogaster silkworm Ciona Arabidopsis thaliana Human finished Human working draft

The global sequencing output to date Equivalent of 15 human genomes Feb 2004 GenBank website: http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html

Annotation of Genome Sequences • The challenge of identifying genes in genomic sequences varies greatly among organisms • Gene identification is almost trivial in bacteria and yeasts • Genes are readily recognized by ab initio analysis asORFs coding for >100 amino acids (no introns) • Smaller ORFs and overlapping genes are missed • Gene identification is relatively straightforward in small genomes, such as worm, plant and Drosophila • Coding sequences comprise a large proportion of the genome (~50%) • Introns are relatively small • Gene identification is very difficult in large complex genomes (mammalian) • Coding sequences comprise only a few per cent of the genome • Exons are small and introns are very large

Gene Prediction Methods • Three basic approaches • Direct evidence of transcription: ESTs or full length cDNAs • Limited to the more frequently expressed genes – misses rarely expressed genes • Indirect evidence based on sequence similarity to previously identified genes and proteins • Correctly identifies genes, but these may be pseudogenes • Limited to known genes – misses unknown genes • Ab initio prediction of groups of exons on the basis of hidden Markov models (HMMs) that • Combine statistical information about splice sites, coding bias and exon and intron lengths (for example, Genscan, Genie and FGENES)

Genome annotation: state-of-the-art • Genome annotation is an ongoing effort • In all published model genomes the gene counts and gene models are constantly being revised • The gene numbers do not change drastically (10% range) • Gene models are often subject to considerable change • Improvements will result from • The availability of many more complete genome sequences • Comparative genome analysis between related species • Larger databases of confirmed gene and protein sequences • The challenge ahead is the identification of regulatory sequences • Comparing multiple genomes related species • Yeast and the mammalian genome projects

Principal Types of Microarrays • Oligonucleotide arrays • Produced by in situ synthesis, of short 25-70 mer oligonucleotidesonto glass slides • Spotted arrays • Produced by robotic deposition of nucleic acids (PCR products, plasmids or oligonucleotides) onto a glass slide Reprinted from: Lockhart and Winzeler, Nature 405, 827 (2000)

Photolithographic microarrays Reprinted from: Lipshutz et. al., Nature Genet. 21, 20 (1999)

DNA spotting Prehybridization Blocking Silanized Slides Transcribe RNA to labeled cDNA Hybridization Washing Spotted Microarrays • Technology developed in the early 90’s • Deposit micro droplets (nanoliter volumes) onto chemically treated glass surfaces • Multi-pin tools transfer liquid from micro titer plates on glass surface • Chemical coating is necessary for binding nucleic acids

Future Perspectives • Technology developments will continue to drive the genomics field • Large scale genome sequencing improvements • Higher throughput and accuracy– more genomes • Lower the cost of genome sequencing • Microarray technology improvements • Higher probe densities – higher resolution data sets • Enable novel applications – functional genomics • Revolutionary new technologies are now being pioneered • 1000€ (human) genome programmes

Genome Biology and Biotechnology 1. The genome structures of unicellular organisms International course 2005

Sequenced genomes of unicellular eukaryotes • Budding yeasts • Saccharomyces cerevisiae (bakers yeast) • related strains • S. paradoxus, S. mikatae and S. bayanus • Other yeast strains • Kluyveromyces waltii and Ashbya gossypii (vitamin B2 production) • Fission yeast • Schizosaccharomyces pombe • Fungi • Neurospora crassa • Phanerochaete chrysosporium (white rot) • Unicellular algae • Diatom Thalassiosira Pseudonana • Red alga Cyanidioschyzon merolae

Life With 6000 Genes – the Yeast Genome Goffeau et. al., Science, 274, 546-567 (1996) • The genome of the yeast Saccharomyces cerevisiae was the first eukaryotic genome to be sequenced • Sequenced was performed through a worldwide collaboration and took several years • The genome sequence of 12,068 kb (12Mb) comprises • Complete genome minus the rRNA repeats

The Yeast Genome • The genome is very compact • One protein-encoding gene per 2,1 kb • ~70% of the total sequence consists of ORFs • Only 4% of protein-encoding genes contain (mostly one) intron • The genome encodes 5885 ORFs • ORF: encodes proteins => 100 aa • Extensive genetic analysis previously defined~ 1000 genes (<20%) • Sequence reveals the existence of~5.000 unknown genes • Repertoire of RNA genes • ~140 rRNA genes in a large tandem array on chromosome XII • 275 tRNA genes (43 families) scattered on the 16 chromosomes • 40 small nuclear RNA (snRNA) genes are also widely distributed Reprinted from: Goffeau et. al., Science, 274, 546-567 (1996)

Statistics on the Yeast Functional Catalogue • MIPS classification of 50% of the yeast proteins • On the basis of their amino acid sequence similarity with other proteins of known function • 11 functional categories Reprinted from: Goffeau et. al., Science, 274, 546-567 (1996)

Conclusions • The yeast genome sequence has provided the first glimpse into the eukaryotic genome • Sequence confirms that yeast is the model eukaryote of choice for the study the functions common to all eukaryotic cells – the basic cellular functions • Transcription, translation • Cell division • Metabolism • Cellular organization and biogenesis • Challenge will be to elucidate the function of all of the novel genes revealed by the genome sequence • Yeast is the model organism for functional genomics Reprinted from: Goffeau et. al., Science, 274, 546-567 (1996)

Genome Duplications In Yeast • The discovery of segmental duplications in the yeast genome came as a surprise… • Identified 53 homology regions in which • homologous genes have the same order and the same transcriptional orientation Reprinted from: Goffeau et. al., Science, 274, 546-567 (1996)

Proof and evolutionary analysis of ancient genome duplication in the yeast S. cerevisiae Kellis et. al., Nature 428, 617 - 624 (2004) • The paper provides convincing evidence that • the segmental duplications in the yeast genome are the result of an ancient whole genome duplication • The evidence was based on the comparative genome analysis of the yeast strain Kluyveromyces waltii Reprinted from: Kellis et. al., Nature 428, 617 - 624 (2004)

Doubly Conserved Synteny • Each region in K. waltii is syntenic to 2 regions in S. cerevisiae • 145 sister regions in S. cerevisiae covering 88% of the genome Reprinted from: Kellis et. al., Nature 428, 617 - 624 (2004)

Doubly Conserved Synteny • Regions in K. waltii corresponding to 2 regions in S. cerevisiae contain • a few duplicated genes • only a subset of the K. waltii genes Duplicated genes Reprinted from: Kellis et. al., Nature 428, 617 - 624 (2004)

Model of Whole Genome Duplication Whole genome duplication Progressive Gene loss Different gene sets retained Few duplicated genes retained Doubly conserved synteny Reprinted from: Kellis et. al., Nature 428, 617 - 624 (2004)

Centromere Synteny • Each centromere of K. waltii is syntenic to 2 centromeres of S. cerevisiae • Doubling of the number of chromosomes in S. cerevisiae (8 to 16) • S. cerevisiae is a degenerate tetraploid Reprinted from: Kellis et. al., Nature 428, 617 - 624 (2004)

Long-term evolution of a duplicated genome • Pattern of gene loss after duplication • Gene loss occurred by many small deletions • 88% of paralogous genes were lost • the current S. cerevisiae genome contains only 10% more genes than K. waltii • 12% of the paralogous gene pairs were retained • A total of 475 gene pairs are retained in S. cerevisiae • Homologous genes: genes exhibiting sequence homology • Orthologous genes: homologous genes in different genomes • Both sequence and function are retained • Paralogous genes: multiple copies of homologous genes • Genes may have different functions Reprinted from: Kellis et. al., Nature 428, 617 - 624 (2004)

Long-term evolution of a duplicated genome • Evolution of gene pairs after duplication • Permits the study of their evolution with respect to each other and the non-duplicated orthologue • Measure the amino acid substitution rate (encoded proteins) and the sequence divergence rate (regulatory sequences) • Pairs evolving at similar rates: majority (321 / 475) • Subtle functional changes in the gene pairs: e.g. gene expression • Pairs exhibiting accelerated protein divergence (76 / 457) • Accelerated evolution confined to only one of the two paralogues • the slowly evolving paralogue retained the ancestral gene function • the rapidly evolving paralogue probably acquired a derived function Reprinted from: Kellis et. al., Nature 428, 617 - 624 (2004)

Conclusions • Whole genome duplication events are followed by • massive gene loss • gene specialization: neo-functionalization • Gene duplication provides the raw material for the evolution of novel functions (Ohno) • Whole genome duplication offers opportunities for coordinated evolution of genes • E.g. novel pathways or networks may evolve through concerted evolution of the different members • The consequence of genome duplications is genetic redundancy • Duplicated genes encode proteins with very similar sequences • Genetic redundancy complicates genetic analysis • Knock-out mutations in redundant genes often exhibit no phenotype, and hence escape genetic analysis

Sequencing and comparison of yeast species to identify genes and regulatory elements • Landmark paper on comparative genomics • Paper describes • high-quality draft genome sequences of three related Saccharomyces species • separated from S. cerevisiae by an estimated 5–20 million years of evolution • The 4 genome sequences were compared to • Confirm the predicted yeast gene models • Identify putative regulatory elements Kellis et. al., Nature 423, 241 - 254 (2003)

Comparative analysis of the 4 yeast genomes • Large-scale alignment of genomic regionsshows • Most ORFs have clear one-to-one matches defining blocks of conserved synteny across the 4 species Conserved ORF Novel ORF Reprinted from: Kellis et. al., Nature 423, 241 - 254 (2003)

Genome evolution • Macro scale: • chromosome rearrangements occur at low frequency • Reciprocal translocations: 0 to 5 • Inversions: 3 to 13 • Segmental duplications: 4 in one strain • Micro scale: • Nucleotide changes occur at high frequencies • SNPs: single nucleotide polymorphism – nucleotide substitution • Indels: insertions and deletions • Cumulative rate of nucleotide change: 30% to 67% • More frequent in intergenic regions than in genic regions Reprinted from: Kellis et. al., Nature 423, 241 - 254 (2003)

Confirmation of Predicted Yeast Genes • Reading frame conservation test • observes whether the sequences in related species encodes the same ORF • True protein-coding ORFs will be under strong selective pressure to preserve the open reading frame • Spurious ORFs will accumulate frameshifts and stop codons frameshifts Conserved bases Reprinted from: Kellis et. al., Nature 423, 241 - 254 (2003)

Updated Yeast Gene Catalogue • Public yeast gene catalogue • Initial annotation identified 5885 ORFsencoding >100 aa • SGD (may 2002): 6,062 ORFs encoding 100 amino acids • Reading frame conservation (RFC) test • validated 5,550 ORFs • rejected 367 ORFs, most of which are 'uncharacterized' ORFs • Updated yeast gene catalogue • 5,538 ORFs encoding proteins of >100 amino acids • 188 small ORFs encoding proteins of <100 amino acids • Ambiguous ORF matches • Most are clustered in telomeric regions • Rapid genome evolution in telomeric regions • local gene-family expansion or contraction Reprinted from: Kellis et. al., Nature 423, 241 - 254 (2003)

Identification Of Regulatory Elements • Rationale to identify functional (regulatory) elements • functional elements should have a greater degree of sequence conservation than non-functional sequences • Nucleotide change is high enough to facilitate recognition of functional elements • Regulatory elements are typically • short (6–15 bp) • tolerate some degree of sequence variation • Known regulatory elements show strong conservation in the 4 genome sequences • Gal4-binding sites are perfectly conserved in the 4 species • Genome-wide motif discovery • Test the motif conservation score of all XYZn(0–21)UVW motifs • Discovered 72 motifs Reprinted from: Kellis et. al., Nature 423, 241 - 254 (2003)

Conclusions • Despite the intensive study of S. cerevisiae • Comparative genome analysis resulted in a major revision of the yeast gene catalogue • affecting more than 15% of all ORFs • Comparative genome analysis of a modest collection of species can permit • Identification of regulatory elements, which is impossible using the genome of a single species • The power of this approach increases with the number of related species sequences available Reprinted from: Kellis et. al., Nature 423, 241 - 254 (2003)

The genome sequence of Schizosaccharomyces pombe Wood et. al., Nature 415, 871 (2002) • Paper presents • Genome sequence of S. pombe • S. pombe is a free living fungus diverged from budding yeast ~ 330–420 Myr ago • S. pombe was extensively studied since the 1950s • ~1,200 genes have been characterized (cfr yeast) • excellent model organism for the study of • Cell-cycle control, mitosis and meiosis • DNA repair and recombination Reprinted from: Wood et. al., Nature 415, 871 (2002)

The genome sequence of S. pombe • The 13.8 Mb genome comprises • ~ 12.5 Mb of unique sequence similar to S. cerevisiae • only 4,824 protein-coding genes • Similar to the yeast strains having no genome duplication • 43% of the genes contain 4,730 introns (yeast: 4%) • Intergene regions are longer than in S. cerevisiae • possibly reflecting more-extended control regions • The centromeres are 35kb to 110 kb • S. cerevisiae has ~120bp centromeres • Why are the centromeres 300–1,000 times larger? • Different structures of the kinetochore? Reprinted from: Wood et. al., Nature 415, 871 (2002)

Centromeres contain mostly repetitive sequences Reprinted from: Wood et. al., Nature 415, 871 (2002)

Advances in Genome Sequencing

Advances in Genome Sequencing

Presentation Transcript

Cell Biology and Biotechnology in Space

Biotechnology- When biology meets technology

Biology 116-Biotechnology

Genome Biology and Biotechnology

Genome Biology and Biotechnology

Introduction to genome biology and microarrays experiment

Department of Biology and Salamander Genome Project

Genome Biology and Biotechnology

Genome biology

Biotechnology Toolbox for Synthetic Biology

Cancer Genome Atlas and Functional Systems Biology

Synthetic biology Genome engineering

Plant Molecular Biology and Biotechnology (PMBB)

Genome Biology and Biotechnology

Phylogeny and Genome Biology

Genome Biology and Biotechnology

Genome Biology and Biotechnology

Biotechnology and the Human Genome REVIEW

Department of Plant Molecular Biology and Biotechnology,

Techniques of Molecular Biology and Biotechnology

Genome Biology and Biotechnology

Journal of Molecular Biology and Biotechnology