1 / 68

26th International Mammalian Genome Conference 2012

@IMGC2012. #IMGC2012. 26th International Mammalian Genome Conference 2012. Bioinformatics Workshop. Sunday, October 21, 2012. 09.00 – 12.00. Wi-Fi: twgroup / password: group5500. Location: Tarpon Room. IMGS 2012 Bioinformatics Workshop. Deanna Church, NCBI

tova
Download Presentation

26th International Mammalian Genome Conference 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. @IMGC2012 #IMGC2012 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Wi-Fi: twgroup / password: group5500 Location: Tarpon Room

  2. IMGS 2012Bioinformatics Workshop Deanna Church, NCBI Carol Bult, The Jackson Laboratory

  3. Tutorial Resources • Galaxy • https://main.g2.bx.psu.edu/ • Genome Analysis for Biologists • http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/ • NCBI 1000 Genomes Browser • http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/ • Genome Reference Consortium • http://genomereference.org/

  4. Schedule 9-10 am: Intro • Genome Assembly Basics • Alignment Basics 10-11 am: Getting Stuff Done • File formats (sequences, alignments, annotations) 11-12 am: Doing stuff • Typical RNA-Seqworkflow • RNA Seq in Galaxy • Differential Gene Expression with RNA Seq data

  5. Assembly Basics 19 Oct 2012

  6. Some assembly required…

  7. Layout-Consensus-Overlap Restrict and make libraries 2, 4, 8, 10, 40, 150 kb Find sequence overlaps tails WGS contig WGS: Sanger Reads End-sequence all clones and retain pairing information “mate-pairs” Each end sequence is referred to as a read

  8. http://schatzlab.cshl.edu/teaching/2010/Lecture%203%20-%20Graphs%20and%20Genomes.pdfhttp://schatzlab.cshl.edu/teaching/2010/Lecture%203%20-%20Graphs%20and%20Genomes.pdf

  9. Alignable trace count in frameshift window vs control in Opossum:51nt window, >95% identity 23,894 genes 452 models with >1 exon, sym.best hit, and one frameshift 334 cases have 3 or less hits Alexander Souvorov, NCBI

  10. Fragmented genomes tend to have less frame shifts Alexander Souvorov, NCBI

  11. Fragmented genomes tend to have more partial models Alexander Souvorov, NCBI

  12. BAC insert Shotgun sequence deeper sequence coverage rarely resolves all gaps Fold sequence Assemble Gaps GAPS “finishers” go in to manually fill the gaps, often by PCR Clone based assemblies BAC vector

  13. Scaffold N50 by chromosome

  14. Spanned Gaps by Assembly

  15. http://genomereference.org Church et al., 2011 PLoS Biology

  16. GRCh37 (hg19) NCBI36 (hg18)

  17. AL139246.20 NCBI35 (hg17) GRCh37 (hg19) AL139246.21

  18. Build sequence contigs based on contigs defined in TPF (Tiling Path File). Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence

  19. NCBI36

  20. nsv832911 (nstd68) Submitted on NCBI35 (hg17)

  21. Moved approximately 2 Mb distal on chr15 NCBI35 (hg17) Tiling Path NC_0000015.8 (chr15) Gap Inserted Removed from assembly GRCh37 (hg19) Tiling Path Added to assembly NC_0000015.9 (chr15) HG-24

  22. Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes

  23. nsv532126 (nstd37) NCBI36NC_000004.10 (chr4) Tiling Path TMPRSS11E2 TMPRSS11E2 TMPRSS11E TMPRSS11E GRCh37NC_000004.11 (chr4) Tiling Path AC147055.2 AC079749.5 AC021146.7 AC134921.1 AC074378.4 AC093720.2 AC079749.5 AC147055.2 AC019173.4 AC021146.7 AC134921.2 AC140484.1 AC093720.2 AC074378.4 GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC021146.7 AC019173.4 AC074378.4 AC226496.2 AC140484.1 Xue Y et al, 2008

  24. GRCh37 (hg19) MAPT UGT2B17 MHC 7 alternate haplotypesat the MHC Alternate loci released as: FASTA AGP Alignment to chromosome http://genomereference.org

  25. ALT 1 Non-nuclear assembly unit (e.g. MT) Assembly (e.g. GRCh37.p2) ALT 2 PAR Primary Assembly ALT 6 ALT 3 Genomic Region (UGT2B17) Genomic Region (PECAM1) Genomic Region (ABO) Genomic Region (MAPT) Genomic Region (SMA) Genomic Region (MHC) ALT 7 ALT 4 ALT 8 ALT 5 ALT 9 … Patches

  26. MHC (chr6) Chr 6 representation (PGF) Alt_Ref_Locus_2 (COX)

  27. RichaAgarwala Eugene Yaschenko

  28. Data Archives GenBank • Data in a common format • Data in a single location (and mirrored) • Most quality checked prior to deposition • Robust data tracking mechanism (accession.version) • Data owned by submitter

  29. Data tracking ABC14-1065514J1 Date Phase Gaps Length FP565796.1 1 1 21-Oct-2009 FP565796.2 1 0 14-Oct-2010 FP565796.3 3 0 07-Nov-2010

  30. Mouse chrX: 35,000,000-36,000000

  31. Mouse chrX: 35,000,000-36,000000 X MGSCv3 MGSCv36

  32. Unique Identification chrX in MGSCv36 NC_000086.6 List of scaffolds and gaps (AGP) List of components and gaps (AGP)

  33. What’s in a name? GRCh37 hg19 Zv7 danRer5 MGSCv37 mm8 NCBIM37

  34. What’s in a name?

  35. Assemblies with the same name aren’t always the same chr21:8,913,216-9,246,964

  36. Assemblies with the same name aren’t always the same Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX

  37. Assembly Database to the rescue GRCh37 hg19 GCA_000001405.1 GRCh37.p2 GCA_000001405.3

  38. GRCh37 hg19 http://www.ncbi.nlm.nih.gov/genome/assembly

  39. Assembly (e.g. GRCh37.p5) ALT 1 Non-nuclear assembly unit (e.g. MT) GCA_000001405.6 /GCF_000001405.17 GCA_000001345.1/ GCF_000001345.1 GCA_000001305.1/ GCF_000001305.13 ALT 2 Primary Assembly GCA_000001355.1/ GCF_000001355.1 ALT 6 ALT 3 GCA_000006015.1/ GCF_000006015.1 GCA_000001365.1/ GCF_000001365.2 ALT 7 ALT 4 GCA_000001375.1/ GCF_000001375.1 ALT 8 GCA_000001315.1/ GCF_000001315.1 GCA_000001385.1/ GCF_000001385.1 ALT 5 GCA_000001325.1/GCF_000001325.2 GCA_000001395.1/ GCF_000001395.1 ALT 9 GCA_000001335.1/ GCF_000001335.1 Patches GCA_000005045.5 GCF_000005045.4

  40. GenBank RefSeq vs Submitter Owned RefSeq Owned Redundancy Non-Redundant Updated rarely Curated INSDC Not INSDC BRCA1 83 genomic records 3 genomic records 31 mRNA records 5 mRNA records 27 protein records 1 RNA record 5 protein records

  41. Sequence Alignments Basics

  42. Hypothesis

  43. The biological basis of sequence alignment is evolution • Sequences that share a common ancestor are homologous • Sequence similarity is evidence of homology • Sequences, genes, etc. are homologous or not, there is no “percent homology”

  44. Homology • Orthologous sequences • Common ancestor; speciation • Paralogous sequences • Gene duplication within a species (lineage specific expansion) http://www.nature.com/nrd/journal/v2/n8/box/nrd1152_BX2.html

  45. Alignment to NR -> Homology Alignment to an Assembly -> Mapping

  46. Optimal global alignment Optimal local alignment Needleman-Wunsch Smith-Waterman Sequences align essentially from end to end Sequences align only in small, isolated regions Global and local alignments References Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453. Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.

More Related