680 likes | 824 Views
@IMGC2012. #IMGC2012. 26th International Mammalian Genome Conference 2012. Bioinformatics Workshop. Sunday, October 21, 2012. 09.00 – 12.00. Wi-Fi: twgroup / password: group5500. Location: Tarpon Room. IMGS 2012 Bioinformatics Workshop. Deanna Church, NCBI
E N D
@IMGC2012 #IMGC2012 26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, 2012 09.00 – 12.00 Wi-Fi: twgroup / password: group5500 Location: Tarpon Room
IMGS 2012Bioinformatics Workshop Deanna Church, NCBI Carol Bult, The Jackson Laboratory
Tutorial Resources • Galaxy • https://main.g2.bx.psu.edu/ • Genome Analysis for Biologists • http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/ • NCBI 1000 Genomes Browser • http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/ • Genome Reference Consortium • http://genomereference.org/
Schedule 9-10 am: Intro • Genome Assembly Basics • Alignment Basics 10-11 am: Getting Stuff Done • File formats (sequences, alignments, annotations) 11-12 am: Doing stuff • Typical RNA-Seqworkflow • RNA Seq in Galaxy • Differential Gene Expression with RNA Seq data
Assembly Basics 19 Oct 2012
Layout-Consensus-Overlap Restrict and make libraries 2, 4, 8, 10, 40, 150 kb Find sequence overlaps tails WGS contig WGS: Sanger Reads End-sequence all clones and retain pairing information “mate-pairs” Each end sequence is referred to as a read
http://schatzlab.cshl.edu/teaching/2010/Lecture%203%20-%20Graphs%20and%20Genomes.pdfhttp://schatzlab.cshl.edu/teaching/2010/Lecture%203%20-%20Graphs%20and%20Genomes.pdf
Alignable trace count in frameshift window vs control in Opossum:51nt window, >95% identity 23,894 genes 452 models with >1 exon, sym.best hit, and one frameshift 334 cases have 3 or less hits Alexander Souvorov, NCBI
Fragmented genomes tend to have less frame shifts Alexander Souvorov, NCBI
Fragmented genomes tend to have more partial models Alexander Souvorov, NCBI
BAC insert Shotgun sequence deeper sequence coverage rarely resolves all gaps Fold sequence Assemble Gaps GAPS “finishers” go in to manually fill the gaps, often by PCR Clone based assemblies BAC vector
http://genomereference.org Church et al., 2011 PLoS Biology
GRCh37 (hg19) NCBI36 (hg18)
AL139246.20 NCBI35 (hg17) GRCh37 (hg19) AL139246.21
Build sequence contigs based on contigs defined in TPF (Tiling Path File). Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence
nsv832911 (nstd68) Submitted on NCBI35 (hg17)
Moved approximately 2 Mb distal on chr15 NCBI35 (hg17) Tiling Path NC_0000015.8 (chr15) Gap Inserted Removed from assembly GRCh37 (hg19) Tiling Path Added to assembly NC_0000015.9 (chr15) HG-24
Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes
nsv532126 (nstd37) NCBI36NC_000004.10 (chr4) Tiling Path TMPRSS11E2 TMPRSS11E2 TMPRSS11E TMPRSS11E GRCh37NC_000004.11 (chr4) Tiling Path AC147055.2 AC079749.5 AC021146.7 AC134921.1 AC074378.4 AC093720.2 AC079749.5 AC147055.2 AC019173.4 AC021146.7 AC134921.2 AC140484.1 AC093720.2 AC074378.4 GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC021146.7 AC019173.4 AC074378.4 AC226496.2 AC140484.1 Xue Y et al, 2008
GRCh37 (hg19) MAPT UGT2B17 MHC 7 alternate haplotypesat the MHC Alternate loci released as: FASTA AGP Alignment to chromosome http://genomereference.org
ALT 1 Non-nuclear assembly unit (e.g. MT) Assembly (e.g. GRCh37.p2) ALT 2 PAR Primary Assembly ALT 6 ALT 3 Genomic Region (UGT2B17) Genomic Region (PECAM1) Genomic Region (ABO) Genomic Region (MAPT) Genomic Region (SMA) Genomic Region (MHC) ALT 7 ALT 4 ALT 8 ALT 5 ALT 9 … Patches
MHC (chr6) Chr 6 representation (PGF) Alt_Ref_Locus_2 (COX)
RichaAgarwala Eugene Yaschenko
Data Archives GenBank • Data in a common format • Data in a single location (and mirrored) • Most quality checked prior to deposition • Robust data tracking mechanism (accession.version) • Data owned by submitter
Data tracking ABC14-1065514J1 Date Phase Gaps Length FP565796.1 1 1 21-Oct-2009 FP565796.2 1 0 14-Oct-2010 FP565796.3 3 0 07-Nov-2010
Mouse chrX: 35,000,000-36,000000 X MGSCv3 MGSCv36
Unique Identification chrX in MGSCv36 NC_000086.6 List of scaffolds and gaps (AGP) List of components and gaps (AGP)
What’s in a name? GRCh37 hg19 Zv7 danRer5 MGSCv37 mm8 NCBIM37
Assemblies with the same name aren’t always the same chr21:8,913,216-9,246,964
Assemblies with the same name aren’t always the same Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
Assembly Database to the rescue GRCh37 hg19 GCA_000001405.1 GRCh37.p2 GCA_000001405.3
GRCh37 hg19 http://www.ncbi.nlm.nih.gov/genome/assembly
Assembly (e.g. GRCh37.p5) ALT 1 Non-nuclear assembly unit (e.g. MT) GCA_000001405.6 /GCF_000001405.17 GCA_000001345.1/ GCF_000001345.1 GCA_000001305.1/ GCF_000001305.13 ALT 2 Primary Assembly GCA_000001355.1/ GCF_000001355.1 ALT 6 ALT 3 GCA_000006015.1/ GCF_000006015.1 GCA_000001365.1/ GCF_000001365.2 ALT 7 ALT 4 GCA_000001375.1/ GCF_000001375.1 ALT 8 GCA_000001315.1/ GCF_000001315.1 GCA_000001385.1/ GCF_000001385.1 ALT 5 GCA_000001325.1/GCF_000001325.2 GCA_000001395.1/ GCF_000001395.1 ALT 9 GCA_000001335.1/ GCF_000001335.1 Patches GCA_000005045.5 GCF_000005045.4
GenBank RefSeq vs Submitter Owned RefSeq Owned Redundancy Non-Redundant Updated rarely Curated INSDC Not INSDC BRCA1 83 genomic records 3 genomic records 31 mRNA records 5 mRNA records 27 protein records 1 RNA record 5 protein records
The biological basis of sequence alignment is evolution • Sequences that share a common ancestor are homologous • Sequence similarity is evidence of homology • Sequences, genes, etc. are homologous or not, there is no “percent homology”
Homology • Orthologous sequences • Common ancestor; speciation • Paralogous sequences • Gene duplication within a species (lineage specific expansion) http://www.nature.com/nrd/journal/v2/n8/box/nrd1152_BX2.html
Alignment to NR -> Homology Alignment to an Assembly -> Mapping
Optimal global alignment Optimal local alignment Needleman-Wunsch Smith-Waterman Sequences align essentially from end to end Sequences align only in small, isolated regions Global and local alignments References Needleman and Wunsch (1970). J. Mol. Biol. 48, 443-453. Smith and Waterman (1981). Nucleic Acids Res 13, 645-656.