210 likes | 354 Views
Pictured above: The structure of HIV. VirVarSeq vs ViVaMBC. Bie Verbist | NCS Brugge | 10-10-2014. Statistical methods for improved variant calling of massively parallel sequencing data. OUTLINE Viral dynamics Massive parallel sequencing Variant calling VirVarSeq ViVaMBC Results
E N D
Pictured above: The structure of HIV. • VirVarSeq vs ViVaMBC • BieVerbist | NCS Brugge | 10-10-2014 • Statistical methods for improved variant calling of massively parallel sequencing data.
OUTLINE • Viral dynamics • Massive parallel sequencing • Variant calling • VirVarSeq • ViVaMBC • Results • HCV plasmids • HCV clinical sample
Viral dynamics A virus is a small infectious agent that replicates only inside the living cells of other organisms. • High replication rate (1011 replications a day for HIV) • High mutation rate Viral population consist of closely related subgroups, viral quasispecies, which we want to identify and quantify.
Viral dynamics Drug-sensitive variants Drug-resistant variant Number of virusus in population Heterogeneous viral population Undetectable Time Before treatment On treatment
Sequencing Sanger sequencing • detection limit: 20-30% • no accurate estimate of frequency Massively parallel sequencing ACGGTTTCCGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCCGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGGTTTCTGTCTGGG ACGATTTCTGTCTGGG • detection limit << 20% • more accurate estimate of frequency
Massively parallel sequencing Fragmentation Amplification Viral population DNA Fragments Sequencing by synthesis Example, one fragment: A A G T G C A C G T T T A C G T C
Massively parallel sequencing Viral population @HWUSI-EAS1524:17:FC:1:120:19254:21417 1:N:0:GATCAG GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAA + G@GG@GG@GGHHHBH>GEGDGGBGEGG?GGHHHH>GEGBG@?BEF?DBB<GDGGGGFGG3GGEBA>EC:; @HWUSI-EAS1524:17:FC:1:120:9430:21420 1:N:0:GATCAG ATCGGAAGAGCACACGNCTGAACTCCAGTCACGATCAGATCCCGTATGCCGTCTTCTGCTTGAAAAAAAA + DDDDDDDDDD2DDDDD#DDDDDDDDDDDDDDDDDDDDDDD2:8:7;<@>;DDDDDDDDDDD:DDDDD### @HWUSI-EAS1524:17:FC:1:120:12760:21420 1:N:0:GATCAG ATCATACTGTCTTACTNTGATAAAACCTCCAATTCCCCCTANCATTNTTGGTTNCCATCTTCCTTGCAAA + HHHHHHHHHHHHHHHG#GGGFFFF@HHHHHHGHHHHHHHHF#FFEB#BBBA>B#BFFFFFHHHHHHHHHG
Variant calling Distinguish low-frequency variants from sequencing error. VirVarSeq ViVaMBC Adaptive filtering approach based on quality scores. Verbist et al. 2014, Bioinformatics. doi: 10.1093/ bioinformatics/btu587. Model based clustering approach which models the error probabilities based on quality scores. Verbist et al. 2014, BMC bioinformatics. under revision.
VirVarSeq Extract reads that cover codon of interest Filter based on the quality scores. Build a codon table ... ... ... ... Reference Reads ... CGA CGA CCA CGA CGA CGA CCA CCA CGT CGA CGA GGA CGA CGA CCA CGA CGA CGA CCA CCA CGT CGA CGA GGA ... ... ... ... ... Codon Table Filtering ... ... ... ... ... ... ... ... ... ... * codon = nucleotide triplets which specifies a single amino acid
VirVarSeq QIT Image or graphic goes here Definition of the Q-threshold (QIT) : Fit mixture distribution on Q-scores with 3 components: • Point prob around Q 2 • Error distribution • Reliable call distribution Intersection point is threshold.
ViVaMBC Extract reads that cover codon of interest Perform Model Based Clustering Model the error probability Clusters unknown, EM algorithm ... ... Reference Reads ... CGA CGA CCA CGA CGA CGA CCA CCA CGT CGA CGA GGA ... ... ... CCA ... ... CCA GGA Codon Table ... Clustering CCA ... CGT ... ... CGA ... ... CGA CGA ... CGA CGA ... CGA CGA ... Cluster medoids = variant Size of Cluster = Frequency N° Clusters = N° variants ...
Results – HCV plasmids Two plasmids • Amino acids 1 to 181 of NS3 region • differ at two codon positions (36 and 155) • mixed 4 different proportions
Results – HCV plasmids Other variants (11481 max) are false positives. VirVarSeq reports: • more false positives • with frequencies going up to 0,5%
Results - HCV clinical sample Image or graphic goes here VirVarSeq ViVaMBC VirVarSeq reports more variants. Above 1% methods in agreement, even above 0.5%. Few false pos in GC region for ViVaMBC ?
VirVarSeq vs ViVaMBC When applying reporting limits of 1% or 0.5%, methods are in agreement. Below this limit, trade-off between sensitivity and specificity, with VirVarSeq less specific. VirVarSeq Adaptive approach Easy development Runs fast ViVaMBC • More elegant • Longer development time • Longer run time
Acknowledgements Promoters: Prof.Dr.O.Thas1, Prof.Dr.L.Clement1 and Prof.Dr.L.Bijnens2 Yves Wetzels, Tobias Verbeke, Joris Meys1 for IT support Scientists within discovery sciences2 Non-clinical statistics team2 2 2 1
bverbis2@its.jnj.com • 10-10-2014 • Thank you
ViVaMBC • Notation: • ri: best base calls of read i (i=1 ... n) • si: second best base calls of read i (i=1 ... n) • zij: zij=1 when read i belongs to haplotype j (j=1...k) • τj: probability to belong to haplotype j • Complete Data Likelihood:
ViVaMBC • Complete Data Likelihood: Likelihood depends on cluster membership zij EM algorithm
Library preparation Sequencing by synthesis