500 likes | 614 Views
Tools For HTS Analysis. Michael Brudno and Marc Fiume Department of Computer Science University of Toronto. Outline. Lab focus Our tools SHRiMP : read mapper VARiD : SNP and indel finder Savant : genome browser Discussion. Our Tools. READ MAPPING ( SHRiMP ). ASSEMBLY (UNNAMED).
E N D
Tools For HTS Analysis Michael Brudno and Marc Fiume Department of Computer Science University of Toronto
Outline • Lab focus • Our tools • SHRiMP: read mapper • VARiD: SNP and indel finder • Savant: genome browser • Discussion
Our Tools READ MAPPING (SHRiMP) ASSEMBLY (UNNAMED) VISUALIZATION (SAVANT) SNP DETECTION (VARiD) INDEL DETECTION (MODiL) CNV DETECTION (CNVer)
SHRiMP – Short read mapping package Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Key SHRiMP Features • High Sensitivity • Support for common formats (SAM, FASTQ, etc) • Flexible seeding framework • Multi-threading • Full support for SOLiD and Illumina (and 454) reads Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Sensitivity/Specificity Comparison Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Runtime comparison Mapping 6 million reads to C. Savignyi (180 Mb) Unpaired 50bp Reads Paired 75bp Reads Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Varid – snp and indel detection Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Variation detection from NGS reads • Determine differences (variation) between reference and donorusing NGS reads of the donor Reference: TCAGCATCGGCATCGACTGCACAGGACCAGTCGATCGAC Donor: ??????????????????????????????????????? GCATCGACTGCA CGGGATCGACTG Aligned reads: ATCCATTGCA GATCCACTGCAC motivation | methods | results | summary
Motivation • Color-spaceand Letter-space platforms • bring them together Methods Results Summary
Sequencing Platforms > NC_005109.2 | BRCA1 SX3 TCAGCATCGGCATCGACTGCACAGG • letter-space • Sanger, 454, Illumina, etc > NC_005109.2 | BRCA1 AF3 T212313230313232121311120 • color-space • AB SOLiD • less software tools available • many differences -> useful to combine this information • sequencing biases • inherent errors • advantages motivation | methods | results | summary
Color Space Translation Matrix Translation Automata motivation | methods | results | summary
Color Space Translating > T212313230313232121311120 > T C A G C A T C G G C A T C G A C T G C A C A G G G A Sequencing Error vs SNP Sequencing Error > T212313230313232121311120 > T212313230310232121311120 > TCAGCATCGGCAAGCTGACGTGTCC SNP > TCAGCATCGGCATCGACTGCACAGG> TCAGCATCGGCAGCGACTGCACAGG > T212313230312332121311120 C T motivation | methods | results | summary
Color Space • clear distinction between a sequencing error and a SNP • can this help us in SNP detection? sounds like it! • single color change error, • 2 colors changed (likely) SNP. Difficult Case Well covered bases Easy snp call reference in color-space reads position motivation | methods | results | summary
Motivation • Motivation • variation caller to handle both letter-space & color-space reads • Detection • Heterozygous SNPs • Homozygous SNPs • Tri-allelic SNPs • small indels • account for various errors, quality values & misalignments • VARiD • system to make inferences on the donor bases • variation detection motivation | methods | results | summary
Motivation • Methods • Simple HMM Model • states, emissions, transitions, FB • Extended HMM Model • gaps, diploids, exceptions Results Summary
Hidden Markov Model (HMM) Statistical model for a system - states Assume that system is a Markov process with state unobserved. Markov Process: next state depends only on current state We can observe the state’s emission (output) each state has a probability distribution over outputs S1 S2 S3 e1 e1 e1 e2 e2 e2 motivation | methods | results | summary
Hidden Markov Model (HMM) • Apply HMM to variation detection: • we don’t know the state (donor), but • we can observe some output determined by the state (reads) motivation | methods | results | summary
Hidden Markov Model (HMM) unknowndonor . . . . . B6 B7 B8 B9 . . . . . B6 B7 B7 B8 B8 B9 AA AA AA AC AC AC color 0 color 0 color 0 color 1 color 1 color 1 motivation | methods | results | summary
States B6 B7 • Why pairs of letters? Handle colors. • AA and TT gives the same colors. Can’t just model colors • The donor could be: • letters: AA color 0 • letters: AC color 1 • : • letters: TT color 0 • 16 combinations motivation | methods | results | summary
States and Transitions • States • 16 possible states • only look at second letter . . . B6 B7 B8 . . . B6 B7 B7 B8 AA AA : . . . . . . . . . AT • Transitions • only certain transitions allowed • when allowed, p(Xt|Xt-1) = freq(Xt) • each state depends only on the previous states (Markov Process) CA Possible States : CT GA : : TT TT motivation | methods | results | summary
Emissions Unknown genome B7 B8 ..... B6B7B8B9 ..... color 0 T01020100311223 T1030101311223 T20100311223 Color reads color 1 color 0 ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC Letter reads AA AA AC motivation | methods | results | summary
Emission Probabilities probability p(em|AA) AA emission color 0 1 – 3ε TT color 1 ε Same color emission distribution color 2 ε color 3 ε letters A 1-3ξ letters C ξ TT letters G ξ Different letter emission distribution letters T ξ motivation | methods | results | summary
Emission Probabilities ..... B6B7B8B9 ..... • Combining emission probabilities • probability that this state emitted these reads. T01020100311223 T1030101311223 T20100311223 E.g. For state CC: ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC motivation | methods | results | summary
Simple HMM • Summary • unknown state • donor pair at location • transitions • transition probabilities • emissions • reads at location • emission probabilities B6 B7 AA AC color 0 color 1 motivation | methods | results | summary
Forward-Backward Algorithm • Have set-up a form of an HMM • run Forward-Backward algorithm • get probability distribution over states at some position AA : likely state AT CA : CT • Variation Detection: • compare most likely state with reference: • ref: GCTATCCA • don: ...AT... GA : : TT motivation | methods | results | summary
Motivation • Methods • Simple HMM Model • states, emissions, transitions, FB • Extended HMM Model • gaps, diploids, exceptions Results Summary
Extended HMM • Simple HMM • only detects homozygousSNPs • Extended HMM: • short indels • heterozygousSNPs • complex error profiles & quality values motivation | methods | results | summary
Expansion: Gaps and het. SNPs • Expand states • Have states that include gaps • emit: gap or color A- -- -G AGTG • Have larger states, for diploids T-T- • Transitions built in similar fashion as before • Same algorithm, but in all we have 1600 states with very sparse transitions motivation | methods | results | summary
Expansion • Emission probabilities • Support quality values • Use variable error rates for emissions Donor: ACAGCATCGGCATCGACTGC1123132303123321213read: >T2123132303123321213 > C123132303123321213 • Translate through the first letter • first color is incorrect • letter-space signal • Post-process putative SNPs • uncorrelated adjacent errors may support het SNPs • check putative SNPs motivation | methods | results | summary
Summary blue: varid steps motivation | methods | results | summary
Motivation Methods Results Summary
Results • Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773) • Color-space dataset: • Compare random subsets: • Corona (with AB mapper) • VARiD (with SHRiMP) • VARiD (with AB mapper) • Conclusions: • Using F-measure, the three • pipelines perform very similarly. • High-coverage results is as good as can be achieved motivation | methods | results | summary
Results • Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773) • Letter-space dataset: • Compare random subsets : • GigaBayes (with Mosaik) • VARiD (with SHRiMP) • VARiD (with Mosaik) • Conclusion: • Using F-measure the three • pipelines perform very similarly. • High-coverage results is as good as can be achieved motivation | methods | results | summary
Results VARiD: Combining Letter-space and Color-space Datato achieve increased accuracy in at-cost comparison motivation | methods | results | summary
Motivation Methods Results Summary
Summary • Summary of VARiD • HMM modeling underlying donor • Treats color-space and letter-space together in the same framework • no translation – take advantage of each technology’s properties • accurately calls short SNPs, indels in both color- and letter-space • improved results with hybrid data. • Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available) • Contact: varid@cs.utoronto.ca motivation | methods | results | summary
SAVANT GENOME BROWSER Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Challenge in Genomic Data Analysis • genomic data is generated in high volumes • interpretation and analysis challenge • typical pipeline employs many separate tools for computation and visualization Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Tools for HTS data analysis • substantial disconnect between the processes of computational analysis and visualization Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Tools for HTS data analysis • substantial disconnect between the processes of computational analysis and visualization Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Savant Genome Browser • platform for integratedvisual analysis of genomic data • feature-rich genome browser • computationally extensible via plugin framework Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
(Very) Short List of Features Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
FEATURE demonstration • INTERFACE • HTS READ ALIGNMENTS • EXAMPLE PLUGINS: SNP FINDER Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Power of visual analytics • task: find the correct parameter for command-line tool Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Plugin Framework • unlocks the potential for performing visual analytics • mutually beneficial for both users and tool developers • for users: perform complex data analyses on-the-fly within a visual environment • for programmers: platform for simple development and deployment of various programs Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
ConclusionS Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Conclusions • Savant is a platform for integrated visualization and analysis of genomic data • stand-alone genome browser • novel features: e.g. table view, visualization modes, data selection, etc. • computationally extensible through plugin framework • makes interpretation and analysis of genomic data easier and more efficient Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Acknowledgements Orion Vanessa Joe Nilgun Paul Vera Recep Andrew Vlad Mike Brudno Yue Marc Misko Yoni
Questions? • SHRiMP • http://compbio.cs.toronto.edu/shrimp • VARiD • http://compbio.cs.toronto.edu/varid • Savant Genome Browser • http://compbio.cs.toronto.edu/savant