420 likes | 620 Views
The Human Reference Assembly. Updating the assembly. Deanna M. Church Staff Scientist, NCBI. Short Course in Medical Genetics 2013. @ deannachurch. Updating the assembly. Oh No! Not a new version of the human genome!. Updating the assembly. GRCh37.p13 (160 regions: >3% of chromosomes).
E N D
The Human Reference Assembly Updating the assembly Deanna M. Church Staff Scientist, NCBI Short Course in Medical Genetics 2013 @deannachurch
Updating the assembly Oh No! Not a new version of the human genome!
GRCh37.p13 (160 regions: >3% of chromosomes) 120 Fix PATCHES: Chromosome update in GRCh38 (adds >5 Mb of novel sequence to the assembly) 71 Novel PATCHES: Additional sequence added (adds >800K of novel sequence to the assembly) Summer of 2013 Releasing patches quarterly
ALT 1 Data Model Non-nuclear assembly unit (e.g. MT) Assembly (e.g. GRCh37.p5) ALT 2 PAR Primary Assembly ALT 6 ALT 3 Genomic Region (UGT2B17) Genomic Region (PECAM1) Genomic Region (MHC) Genomic Region (MAPT) Genomic Region (ABO) Genomic Region (SMA) ALT 7 ALT 4 ALT 8 ALT 5 ALT 9 Patches …
ALT 1 Data Model Non-nuclear assembly unit (e.g. MT) GCA_000001405.6 /GCF_000001405.17 GCA_000001345.1/ GCF_000001345.1 GCA_000001305.1/ GCF_000001305.13 ALT 2 Primary Assembly GCA_000001355.1/ GCF_000001355.1 ALT 6 ALT 3 GCA_000006015.1/ GCF_000006015.1 GCA_000001365.1/ GCF_000001365.2 ALT 7 ALT 4 GCA_000001375.1/ GCF_000001375.1 ALT 8 GCA_000001315.1/ GCF_000001315.1 GCA_000001385.1/ GCF_000001385.1 ALT 5 GCA_000001325.1/GCF_000001325.2 GCA_000001395.1/ GCF_000001395.1 ALT 9 GCA_000001335.1/ GCF_000001335.1 Patches GCA_000005045.5 GCF_000005045.4
GRCh38 is coming (September, 2013)
Why does missing sequence matter? Sample genome Duplicon A Duplicon B x x Duplicon A G>A (allelic difference – true variant) GRCh37 G>C (paralogous sequence variant- false positive) May or may not detect increased coverage depending on sequencing depthand library quality (easier to find with new technologies than with old, low through technologies)
CDC27 1KG Phase 1 Strict accessibility mask SNP (all) SNP (not 1KG) http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
Part of chr22 assembly Alternate locus for chr22 White: Insertion Black: Deletion Kidd et al, 2007 APOBEC cluster
129S6/SVEvTac tiling path Alignment to C57BL/6J chr1 + 32Kb in 129S6/SvEvTac B6 Genes 129S6/SvEvTac Genes Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320 NM_031192.3: transcript from C57BL/6J NM_031193.2: transcript from FVB/N 129S6/SvEvTac Alt Locus Alignment (allelic) FVB/N Transcript Alignment (paralog)
Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320 NM_031192.3: transcript from C57BL/6J NM_031193.2: transcript from FVB/N 129S6/SvEvTac Ren1 FVB Ren2 Tx Paralogous diff SNP +Paralogousdiff
Doggett et al., 2006 Hydin: chr16 (16q22.2) Hydin2: chr1 (1q21.1) Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38 Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID Alignment to Hydin1 CHM1_1.0, >99.9% ID Alignment to Hydin1 CHM1_1.0, >99.9% ID
1q32 1q21 1p21 1p21 patch alignment to chromosome 1 Dennis et al., 2012
GRCh37 (current reference assembly) chrX Preview of GRCh38 (scheduled Fall 2013) TEX28 TKTL1 LOC101060233 LOC101060234 (opsin related) (TEX28 related)
GRCh37 (hg19) NCBI36 (hg18) http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-21
AL139246.20 NCBI35 (hg17) GRCh37 (hg19) AL139246.21
A = 0.000 G=1.000 rs4732519
RP11 WGS reads Private RP11 variant? Missing in 1000G? rs4732519
FAM23_MRC1 Region, chr10 Segmental Duplications 1KG accessibility Mask Novel Patch 250 kb of artificial duplication
Adding Novel Sequence Karen Hayden and Jim Kent
Human Resolved for GRCh38 http://genomereference.org
MHC Alternate locus Alignment to chr6 Richa Agarwala
Making the assembly accessible to existing tools: masking Query set: 439,109,084 NA12878 HiSeq reads
Masking effectively blocks alignments in regions with high identity • Simulated reads from GRCh37.p9 • Unpaired reads • 101 bp • 1x coverage • Default wgsim parameters • Masking parameters • Percent Id: 100% • Step size: 5 bp • Minimum length: 101 bp • Center SNPs in unmasked regions
Masking improves alignments in regions with alternate loci or patches
Masking effectively reduces the increase in NA12878 reads that have alignments with MAPQ=0 that occurs when the full assembly is used as an alignment substrate NA12878 reads whose best alignment was on an alt/patch in the masked assembly were evaluated for their alignment location when aligned to the primary assembly alone
Take home messages • The assembly you use for analysis is an important part ofyour analysis package. • The reference assembly is not a set of linear sequences butcan now represent allelic diversity • Tools still need to catch up. • The human reference assembly is updating soon!(Remember: assemblies are not static if you are lucky!)