The Evolution of the Reference Human Genome

The Evolution of the Reference Human Genome Deanna M. Church Staff Scientist, NCBI 6 May 2012

February 2001

Throughput: 500 Mb/year Cost: < $0.25 per base Variation: 100,000 SNPs mapped Collins FS et al, 1998

1999 2000 2005 2011 2010 Steve Sherry, NCBI

Kidd et al, 2007 APOBEC cluster

http://www.ncbi.nlm.nih.gov/dbvar

Assumptions Variables: Reads are randomly distributed G= haploid genome length in bp L= sequence read length in bp N= number of reads sequenced T= amount of overlap needed for detection in bp C= Coverage Overlap between reads does not vary y= number of events in an interval l = mean number of events in an interval Poisson distribution P(Y=y)=(ly * e–l)/y! For sequence calculations, coverage can be viewed as l Using this equation, you can calculate the probability that a base hasbeen sequenced y number of times. By manipulating this formula, youcan estimate the numbers of gaps for any given level of coverage. How much sequence do you need? Lander and Waterman (1988) Genomics C=LN/G http://www.genetics.wustl.edu/bio5488/lecture_notes_2005/Lander.htm

2009 Sanger cost: shotgun sequence ~ $0.01/base finished sequence ~ $0.03/base This clone: Shotgun=$1500 Finish=$3000

Captured gap= no sequence, but a clone spans the gap Uncaptured gap= no sequence, no clone spanning gap Bob Blakesly, NISC

Phred quality scores: Measures the probability that a base is incorrect. If a base has a 1/1000 probability of being incorrect, it has a Phred score of 30. 20 Quality scores Base Not all bases are created equal Sanger Sequencing

Genome Research, May, 1997

Shotgun sequence deeper sequence coverage rarely resolves all gaps Fold sequence Assemble Gaps GAPS “finishers” go in to manually fill the gaps, often by PCR BAC insert BAC vector Clone based assemblies

Restrict and make libraries 2, 4, 8, 10, 40, 150 kb Find sequence overlaps tails WGS contig End-sequence all clones and retain pairing information “mate-pairs” Each end sequence is referred to as a read Scaffold WGS: Sanger Reads

Schatz et al, 2010

De Bruijn graph representations Error free, no repeat, no polymorphism Sequencing error (short reads) Repeat > kmer length SNP, variant, long read error Structural variant, inversion Structural variant, deletion… … Ewan Birney, EBI

De Bruijn graph representations Ewan Birney, EBI

fragment meiosis- genetic radiation- RH enzyme- clone based genome • each line represents an individual cell line/animal that carries a particularbreak • - STSs can be amplified from DNA in these cell lines/animals- based on cell line/animal marker content, the breaks can be determined andthe markers ordered.

Electronic PCR (e-PCR) STS marker D6S1606 microsatellite repeat forward primer GAGTTTGCACCATTGCACTCCAGCCTGGGCAACAAGAGTGAAACTCTGTCACAGA (CA)n AACGTGGCATGTGCCTGTACTCTC CTCAAACGTGGTAACGTGAGGTCGGACCCGTTGTTCTCACTTTGAGACAGTGTCT (GT)n TTGCACCGTACACGGACATGAGAG reverse primer PCR product size: 92 - 100 bases E-PCR software searches DNA sequences for exact matches to both primers in correct order, orientation, and spacing to be consistent with known PCR product size. Schuler (1997), Genome Research 7, 541-550

Electronic PCR (e-PCR) http://www.ncbi.nlm.nih.gov/sutils/e-pcr/

A B C F F D G G E H H F K K G L L H A A I B B J C C K D D L M N O O O N (flip) N Putting genomes together Ideally… Non-sequence based Map

A A A A B B B B C C C D D Y Z E Y F X ? G W H H H H I I I J J J J V K L L L M M M M N N N N O O O O Putting genomes together More like…

WI Genetic WI/MRC RH Sequence vs. Non-sequence based maps Mmu7

5 60 4 40 3 2 20 1 0 0 -1 20 -2 -3 40 -4 Select regulatory molecule Other transcription factor Nucleic acid binding G-protein modulator Extracellular matrix -5 60 Ribosomal protein Protein kinase Unclassified Hydrolase Chemokine Oxygenase Kinase Apolipoprotein Oxidoreductase Structural protein Cytokine receptor Cysteine protease Transcription factor Signaling molecule Intermediate filament Miscellaneous function Cell adhesion molecule Other cytokine receptor Defense/immunity protein Cysteine protease inhibitor Other cell adhesion molecule Zinc finger transcription factor KRAB box transcription factor Tumor necrosis factor receptor CAM family adhesion molecule Immunoglobulin receptor family member Major histocompatibility complex antigen Enrichment Observed Expected Human- panther classifications (biological process) Evan Eichler, University of Washington

Church et al., 2009 Leo Goodstadt

Inter-chromosomal Intra-chromosomal Both Mouse Human Mouse has an increased amount of intra-chromosomal duplication when compared to human

Snyder et al., 2010

Martin Shumway 5.E+13

YH1 (BGI) KB1 (BGI) Craig, part1 Craig, Part 2 Broad NA12878

Scaffold N50s

Gap Counts

http://genomereference.org

Gaps Closed Issues 5 July 2011 Open Issues

Large-Scale Variation Complicates Genome Assembly Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes

MAPT UGT2B17 MHC 7 alternate haplotypesat the MHC Alternate loci released as: FASTA AGP Alignment to chromosome http://genomereference.org GRCh37 (hg19)

ALT 1 Non-nuclear assembly unit (e.g. MT) Assembly (e.g. GRCh37) ALT 2 PAR Primary Assembly ALT 6 ALT 3 Genomic Region (UGT2B17) Genomic Region (MHC) Genomic Region (MAPT) ALT 7 ALT 4 ALT 8 ALT 5 ALT 9

NCBI36 (hg18)

NCBI36NC_000004.10 (chr4) Tiling Path TMPRSS11E TMPRSS11E2 TMPRSS11E TMPRSS11E GRCh37NC_000004.11 (chr4) Tiling Path AC147055.2 AC079749.5 AC021146.7 AC134921.1 AC074378.4 AC093720.2 AC079749.5 AC147055.2 AC019173.4 AC021146.7 AC134921.2 AC140484.1 AC093720.2 AC074378.4 GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC021146.7 AC019173.4 AC074378.4 AC226496.2 AC140484.1 Xue Y et al, 2008

H1 H2 17q deletion Zody et al, 2008

The Evolution of the Reference Human Genome

The Evolution of the Reference Human Genome

Presentation Transcript

The Human Genome and Human Evolution

The Human Genome

The Human Genome

The Human Genome

The Human Genome and Human Evolution

The Human Genome

The Human Genome

Organization of the human genome

The Human Genome Project and 100 Million Years of Human Evolution

THE HUMAN GENOME

The Human Genome

The Human Genome

Genome Trees and the Nature of Genome Evolution

The Human Genome

The Human Genome

The Human Genome

The Human Genome

The Human Genome and Human Evolution Y Chromosome

The Human Genome

THE HUMAN GENOME

The Human Genome

The Human Genome Project