500 likes | 641 Views
Human Genome Sequence and Variability. Gabor T. Marth, D.Sc. Department of Biology, Boston College marth@bc.edu. Medical Genomics Course – Debrecen, Hungary, May 2006. Lecture overview. 1. Genome sequencing strategies, sequencing informatics.
E N D
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College marth@bc.edu Medical Genomics Course – Debrecen, Hungary, May 2006
Lecture overview 1. Genome sequencing strategies, sequencing informatics 2. Genome annotation, functional and structural features in the human genome 3. Genome variability, DNA nucleotide, structural, and epigenetic variations
The genome sequence • the primary template on which to outline functional features of our genetic code (genes, regulatory elements, secondary structure, tertiary structure, etc.)
~3,000 Mb >100 Mb ~100 Mb Completed genomes ~1 Mb
Whole-genome shotgun sequencing Main genome sequencing strategies Clone-based shotgun sequencing Human Genome Project Celera Genomics, Inc.
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
Shotgun subclone library construction cloning vector BAC primary clone subclone insert sequencing vector
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
Robotic automation Lander et al. Nature 2001
Base calling PHRED base = A Q = 40
Hierarchical genome sequencing BAC library construction clone mapping shotgun subclone library construction sequencing/read processing sequence reconstruction (sequence assembly) Lander et al. Nature 2001
Sequence assembly PHRAP
Sequence completion (finishing) region of low sequence coverage and/or quality gap CONSED, AUTOFINISH
Genome annotation – Goals repetitive elements protein coding genes RNA genes GC content
The starting material AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGT AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAG TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCT CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTAT ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCT GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCT AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT
Coding genes – ab initio predictions Stop codon Start codon ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAAAAAAA PolyA signal Open Reading Frame = ORF
Ab initio predictions Gene structure
Ab initio predictions …AGAATAGGGCGCGTACCTTCCAACGAAGACTGGG… splice acceptor site splice donor site
Ab initio predictions Genscan Grail Genie GeneFinder Glimmer etc… EST_genome Sim4 Spidey EXALIN
Homology based predictions known coding sequence from another organism expressed sequence ACGGAAGTCT GGACTATAAA ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAAAAAAA genes predicted by homology Genomescan Twinscan etc…
Consolidation – gene prediction systems Sim4 dbEst Genewise Grail Genscan FgenesH Ensembl Otto
ncRNA genes prediction based on structure (e.g. tRNAs) for other novel ncRNAs, only homology-based predictions have been successful
Repeat annotations Repeat annotation are based on sequence similarity to known repetitive elements in a repeat sequence library
Gene annotations – # of coding genes Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Gene annotations – gene length Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Gene annotations – gene function Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
GC content and coding potential Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
ncRNAs Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Segmental duplications Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Repeat elements Lander et al. Initial sequencing and analysis of the human genome, Nature, 2001
Physical vs. genetic map (Mb/cM) 0.4 cM 1.3 cM 0.7 cM 0.4 Mb 0.7 Mb 0.3 Mb
DNA sequence variations • the reference Human genome sequence is 99.9% common to each human being • sequence variations make our genetic makeup unique • the most abundant human variations are single-nucleotide polymorphisms (SNPs) – 10 million SNPs are currently known SNP
DNA sequence variations insertion-deletion (INDEL) polymorphisms
Structural variations Speicher & Carter, NRG 2005
Structural variations Feuk et al. Nature Reviews Genetics7, 85–97 (February 2006) | doi:10.1038/nrg1767
Detection of structural variants Feuk et al. Nature Reviews Genetics7, 85–97 (February 2006) | doi:10.1038/nrg1767
Epigenetic changes: chromatin structure Sproul, NRG 2005
Epigenetic changes: DNA methylation Laird, NRC 2003