1 / 80

The Evolution of the Reference Human Genome

The Evolution of the Reference Human Genome. Deanna M. Church Staff Scientist, NCBI. 6 May 2012. February 2001. Throughput: 500 Mb/year Cost: < $0.25 per base Variation: 100,000 SNPs mapped. Collins FS et al, 1998. 1999. 2000. 2005. 2011. 2010. Steve Sherry, NCBI.

cicely
Download Presentation

The Evolution of the Reference Human Genome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Evolution of the Reference Human Genome Deanna M. Church Staff Scientist, NCBI 6 May 2012

  2. February 2001

  3. Throughput: 500 Mb/year Cost: < $0.25 per base Variation: 100,000 SNPs mapped Collins FS et al, 1998

  4. 1999 2000 2005 2011 2010 Steve Sherry, NCBI

  5. Kidd et al, 2007 APOBEC cluster

  6. http://www.ncbi.nlm.nih.gov/dbvar

  7. Assumptions Variables: Reads are randomly distributed G= haploid genome length in bp L= sequence read length in bp N= number of reads sequenced T= amount of overlap needed for detection in bp C= Coverage Overlap between reads does not vary y= number of events in an interval l = mean number of events in an interval Poisson distribution P(Y=y)=(ly * e–l)/y! For sequence calculations, coverage can be viewed as l Using this equation, you can calculate the probability that a base hasbeen sequenced y number of times. By manipulating this formula, youcan estimate the numbers of gaps for any given level of coverage. How much sequence do you need? Lander and Waterman (1988) Genomics C=LN/G http://www.genetics.wustl.edu/bio5488/lecture_notes_2005/Lander.htm

  8. 2009 Sanger cost: shotgun sequence ~ $0.01/base finished sequence ~ $0.03/base This clone: Shotgun=$1500 Finish=$3000

  9. Captured gap= no sequence, but a clone spans the gap Uncaptured gap= no sequence, no clone spanning gap Bob Blakesly, NISC

  10. Phred quality scores: Measures the probability that a base is incorrect. If a base has a 1/1000 probability of being incorrect, it has a Phred score of 30. 20 Quality scores Base Not all bases are created equal Sanger Sequencing

  11. Genome Research, May, 1997

  12. Shotgun sequence deeper sequence coverage rarely resolves all gaps Fold sequence Assemble Gaps GAPS “finishers” go in to manually fill the gaps, often by PCR BAC insert BAC vector Clone based assemblies

  13. Restrict and make libraries 2, 4, 8, 10, 40, 150 kb Find sequence overlaps tails WGS contig End-sequence all clones and retain pairing information “mate-pairs” Each end sequence is referred to as a read Scaffold WGS: Sanger Reads

  14. Schatz et al, 2010

  15. De Bruijn graph representations Error free, no repeat, no polymorphism Sequencing error (short reads) Repeat > kmer length SNP, variant, long read error Structural variant, inversion Structural variant, deletion… … Ewan Birney, EBI

  16. De Bruijn graph representations Ewan Birney, EBI

  17. fragment meiosis- genetic radiation- RH enzyme- clone based genome • each line represents an individual cell line/animal that carries a particularbreak • - STSs can be amplified from DNA in these cell lines/animals- based on cell line/animal marker content, the breaks can be determined andthe markers ordered.

  18. Electronic PCR (e-PCR) STS marker D6S1606 microsatellite repeat forward primer GAGTTTGCACCATTGCACTCCAGCCTGGGCAACAAGAGTGAAACTCTGTCACAGA (CA)n AACGTGGCATGTGCCTGTACTCTC CTCAAACGTGGTAACGTGAGGTCGGACCCGTTGTTCTCACTTTGAGACAGTGTCT (GT)n TTGCACCGTACACGGACATGAGAG reverse primer PCR product size: 92 - 100 bases E-PCR software searches DNA sequences for exact matches to both primers in correct order, orientation, and spacing to be consistent with known PCR product size. Schuler (1997), Genome Research 7, 541-550

  19. Electronic PCR (e-PCR) http://www.ncbi.nlm.nih.gov/sutils/e-pcr/

  20. A B C F F D G G E H H F K K G L L H A A I B B J C C K D D L M N O O O N (flip) N Putting genomes together Ideally… Non-sequence based Map

  21. A A A A B B B B C C C D D Y Z E Y F X ? G W H H H H I I I J J J J V K L L L M M M M N N N N O O O O Putting genomes together More like…

  22. WI Genetic WI/MRC RH Sequence vs. Non-sequence based maps Mmu7

  23. 5 60 4 40 3 2 20 1 0 0 -1 20 -2 -3 40 -4 Select regulatory molecule Other transcription factor Nucleic acid binding G-protein modulator Extracellular matrix -5 60 Ribosomal protein Protein kinase Unclassified Hydrolase Chemokine Oxygenase Kinase Apolipoprotein Oxidoreductase Structural protein Cytokine receptor Cysteine protease Transcription factor Signaling molecule Intermediate filament Miscellaneous function Cell adhesion molecule Other cytokine receptor Defense/immunity protein Cysteine protease inhibitor Other cell adhesion molecule Zinc finger transcription factor KRAB box transcription factor Tumor necrosis factor receptor CAM family adhesion molecule Immunoglobulin receptor family member Major histocompatibility complex antigen Enrichment Observed Expected Human- panther classifications (biological process) Evan Eichler, University of Washington

  24. Church et al., 2009 Leo Goodstadt

  25. Inter-chromosomal Intra-chromosomal Both Mouse Human Mouse has an increased amount of intra-chromosomal duplication when compared to human

  26. Snyder et al., 2010

  27. Martin Shumway 5.E+13

  28. YH1 (BGI) KB1 (BGI) Craig, part1 Craig, Part 2 Broad NA12878

  29. Scaffold N50s

  30. Gap Counts

  31. http://genomereference.org

  32. Gaps Closed Issues 5 July 2011 Open Issues

  33. Large-Scale Variation Complicates Genome Assembly Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes

  34. MAPT UGT2B17 MHC 7 alternate haplotypesat the MHC Alternate loci released as: FASTA AGP Alignment to chromosome http://genomereference.org GRCh37 (hg19)

  35. ALT 1 Non-nuclear assembly unit (e.g. MT) Assembly (e.g. GRCh37) ALT 2 PAR Primary Assembly ALT 6 ALT 3 Genomic Region (UGT2B17) Genomic Region (MHC) Genomic Region (MAPT) ALT 7 ALT 4 ALT 8 ALT 5 ALT 9

  36. NCBI36 (hg18)

  37. NCBI36NC_000004.10 (chr4) Tiling Path TMPRSS11E TMPRSS11E2 TMPRSS11E TMPRSS11E GRCh37NC_000004.11 (chr4) Tiling Path AC147055.2 AC079749.5 AC021146.7 AC134921.1 AC074378.4 AC093720.2 AC079749.5 AC147055.2 AC019173.4 AC021146.7 AC134921.2 AC140484.1 AC093720.2 AC074378.4 GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC021146.7 AC019173.4 AC074378.4 AC226496.2 AC140484.1 Xue Y et al, 2008

  38. H1 H2 17q deletion Zody et al, 2008

More Related