1 / 42

The Human Reference Assembly

The Human Reference Assembly. Updating the assembly. Deanna M. Church Staff Scientist, NCBI. Short Course in Medical Genetics 2013. @ deannachurch. Updating the assembly. Oh No! Not a new version of the human genome!. Updating the assembly. GRCh37.p13 (160 regions: >3% of chromosomes).

quant
Download Presentation

The Human Reference Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Human Reference Assembly Updating the assembly Deanna M. Church Staff Scientist, NCBI Short Course in Medical Genetics 2013 @deannachurch

  2. Updating the assembly Oh No! Not a new version of the human genome!

  3. Updating the assembly

  4. GRCh37.p13 (160 regions: >3% of chromosomes) 120 Fix PATCHES: Chromosome update in GRCh38 (adds >5 Mb of novel sequence to the assembly) 71 Novel PATCHES: Additional sequence added (adds >800K of novel sequence to the assembly) Summer of 2013 Releasing patches quarterly

  5. ALT 1 Data Model Non-nuclear assembly unit (e.g. MT) Assembly (e.g. GRCh37.p5) ALT 2 PAR Primary Assembly ALT 6 ALT 3 Genomic Region (UGT2B17) Genomic Region (PECAM1) Genomic Region (MHC) Genomic Region (MAPT) Genomic Region (ABO) Genomic Region (SMA) ALT 7 ALT 4 ALT 8 ALT 5 ALT 9 Patches …

  6. ALT 1 Data Model Non-nuclear assembly unit (e.g. MT) GCA_000001405.6 /GCF_000001405.17 GCA_000001345.1/ GCF_000001345.1 GCA_000001305.1/ GCF_000001305.13 ALT 2 Primary Assembly GCA_000001355.1/ GCF_000001355.1 ALT 6 ALT 3 GCA_000006015.1/ GCF_000006015.1 GCA_000001365.1/ GCF_000001365.2 ALT 7 ALT 4 GCA_000001375.1/ GCF_000001375.1 ALT 8 GCA_000001315.1/ GCF_000001315.1 GCA_000001385.1/ GCF_000001385.1 ALT 5 GCA_000001325.1/GCF_000001325.2 GCA_000001395.1/ GCF_000001395.1 ALT 9 GCA_000001335.1/ GCF_000001335.1 Patches GCA_000005045.5 GCF_000005045.4

  7. GRCh38 is coming (September, 2013)

  8. http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

  9. http://genomereference.org

  10. Why does missing sequence matter? Sample genome Duplicon A Duplicon B x x Duplicon A G>A (allelic difference – true variant) GRCh37 G>C (paralogous sequence variant- false positive) May or may not detect increased coverage depending on sequencing depthand library quality (easier to find with new technologies than with old, low through technologies)

  11. CDC27 1KG Phase 1 Strict accessibility mask SNP (all) SNP (not 1KG) http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

  12. Sudmant et al., 2010

  13. Part of chr22 assembly Alternate locus for chr22 White: Insertion Black: Deletion Kidd et al, 2007 APOBEC cluster

  14. http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes

  15. 129S6/SVEvTac tiling path Alignment to C57BL/6J chr1 + 32Kb in 129S6/SvEvTac B6 Genes 129S6/SvEvTac Genes Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320

  16. Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320 NM_031192.3: transcript from C57BL/6J NM_031193.2: transcript from FVB/N 129S6/SvEvTac Alt Locus Alignment (allelic) FVB/N Transcript Alignment (paralog)

  17. Mouse Ren1 chr1 (CM000994.2/NC_000067.6): 133350674-133360320 NM_031192.3: transcript from C57BL/6J NM_031193.2: transcript from FVB/N 129S6/SvEvTac Ren1 FVB Ren2 Tx Paralogous diff SNP +Paralogousdiff

  18. Doggett et al., 2006 Hydin: chr16 (16q22.2) Hydin2: chr1 (1q21.1) Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38 Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID Alignment to Hydin1 CHM1_1.0, >99.9% ID Alignment to Hydin1 CHM1_1.0, >99.9% ID

  19. 1q32 1q21 1p21 1p21 patch alignment to chromosome 1 Dennis et al., 2012

  20. GRCh37 (current reference assembly) chrX Preview of GRCh38 (scheduled Fall 2013) TEX28 TKTL1 LOC101060233 LOC101060234 (opsin related) (TEX28 related)

  21. GRCh37 (hg19) NCBI36 (hg18) http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-21

  22. AL139246.20 NCBI35 (hg17) GRCh37 (hg19) AL139246.21

  23. Fixing Rare/Incorrect Bases

  24. Fixing Rare/Incorrect Bases

  25. Fixing Rare/Incorrect Bases

  26. A = 0.000 G=1.000 rs4732519

  27. RP11 WGS reads Private RP11 variant? Missing in 1000G? rs4732519

  28. FAM23_MRC1 Region, chr10 Segmental Duplications 1KG accessibility Mask Novel Patch 250 kb of artificial duplication

  29. Genovese et al., 2013

  30. Adding Novel Sequence

  31. Adding Novel Sequence Karen Hayden and Jim Kent

  32. Human Resolved for GRCh38 http://genomereference.org

  33. MHC Alternate locus Alignment to chr6 Richa Agarwala

  34. Making the assembly accessible to existing tools: masking Query set: 439,109,084 NA12878 HiSeq reads

  35. Masking effectively blocks alignments in regions with high identity • Simulated reads from GRCh37.p9 • Unpaired reads • 101 bp • 1x coverage • Default wgsim parameters • Masking parameters • Percent Id: 100% • Step size: 5 bp • Minimum length: 101 bp • Center SNPs in unmasked regions

  36. Masking improves alignments in regions with alternate loci or patches

  37. Masking effectively reduces the increase in NA12878 reads that have alignments with MAPQ=0 that occurs when the full assembly is used as an alignment substrate NA12878 reads whose best alignment was on an alt/patch in the masked assembly were evaluated for their alignment location when aligned to the primary assembly alone

  38. Take home messages • The assembly you use for analysis is an important part ofyour analysis package. • The reference assembly is not a set of linear sequences butcan now represent allelic diversity • Tools still need to catch up. • The human reference assembly is updating soon!(Remember: assemblies are not static if you are lucky!)

More Related