590 likes | 789 Views
Inferring Genomic Sequences. Irina Astrovskaya Dr. Alexander Zelikovsky 02/15/2011. Outline. Quasispecies Spectrum Reconstruction Background Viral Quasispecies 454 Life Sciences Pyrosequencing System Problem Formulation ViSpA: Viral Spectrum Assembler
E N D
Inferring Genomic Sequences Irina Astrovskaya Dr. Alexander Zelikovsky 02/15/2011
Outline • Quasispecies Spectrum Reconstruction • Background • Viral Quasispecies • 454 Life Sciences Pyrosequencing System • Problem Formulation • ViSpA: Viral Spectrum Assembler • Experimental Results • Future Work • Tagging
Viral Quasispecies • RNA viruses (HIV, HCV) • Many replication mistakes • Quasispecies (qsps) = co-existing closely related variants • Variants differ in • virulence • ability to escape the immune system • resistance to antiviral therapies • tissue tropism • How do qsps contribute to viral persistence and evolution?
454 Pyrosequencing • Pyrosequencing =Sequencing by Synthesis. • GS FLX Titanium : • Fragments (reads): 300-800 bp • Sequence of the reads • Assembled single genome via the software • We need a software that assembles reads to multiple genomes!
Quasispecies Spectrum Reconstruction (QSR) Problem • Given • pyrosequencing reads from a quasispecies population of unknown size and distribution • Reconstruct the quasispecies spectrum • sequences • frequencies
State-of-the-Art: ShoRAH • N. Eriksson et al. (2008): • error correction via clustering • #clusters: statistical significance tests • 3 windows • majority rule • assembling via chain decomposition • maximum-likelihood freq. estimation • ShoRAH (O. Zagordi et al) : • probabilistic clustering • #clusters: Dirichlet process mixture
Alignment • Reference sequence is available. • Multiple coverage • No repeats => unique alignment. • Alignment score • minimizes Hamming distances • penalizes indels more than mismatches • DeletionsInsertions
454 Genotyping Error • Genotyping error rate =0.1%. • Fixed number of incorporated bases vs. light intensity value. • Incorrect resolution of homopolymers => • over-calls (insertions) • 65-75% of errors • under-calls (deletions) • 20-30% of errors
Preprocessing of Aligned Reads • Deletions in reads: D • Replace deletion, confirmed by a single read, with either allele value that is present in all other reads or N. • Insertions into reference: I • Remove insertions, confirmed by a single read. • Imputation of missing values N
Genotyping Error Correction • Rare mutations vs genotyping errors • ViSpA “Tag boxes” = correlation ≥ 0.8. • Fix incorrect value based on the top association. • ShoRAH (Zagordi et al, 2010) • Probabilistic clustering • 3 overlapping windows • k-mer correction (P.Pevzner, SOLID (D. Brinza)) • Count k-mers • Still works for quasispecies (P. Skums, CDC)
Read Graph: Vertices ACTGGTCCCTCCTGAGTGT GGTCCCTCCT TGGTCACTCGTGAG ACCTCATCGAAGCGGCGTCCT Subread = completely contained in some read with ≤ n mismatches. Superread = not a subread => the vertex in the read graph.
Read Graph: Edges • Edge b/w two vertices exists • if there is an overlap between superreads • they agree on their overlap with ≤ m mismatches. • Auxiliary vertices: source and sink
Read Graph: Transitive Reduction Forbidden paths Candidate sequence may be represented by several paths.
Read Graph: Edge Cost Δ • The most probable source-sink path through each vertex • Cost: uncertainty that two superreads are from the same qsps. • OverhangΔis the shift in start positions of two overlapping superreads.
Contig Assembling read r of length l qspss of length L k is #mismatches, t/L is a mutation rate • Max Bandwidth Path through vertex • path minimizing maximum edge cost for the path and each subpath • Consensus of path’s superreads • Each position: >70%-majority or N • Weighted consensus obtained on all reads • Remove duplicates • Duplicated sequences = statistical evidence
Expectation Maximization • E step: • M step: • Bipartite graph: • Qqis a candidate with frequency fq • Rris a read with observed frequency or • Weight hq,r= probability that read r is produced by qspsq with j mismatches
Experimental Validations • Comparison with ShoRAH on • Simulated data • Error-free reads from known HCV quasispecies • Reads with errors generated by FlowSim • Real 454 reads • HCV data • HIV data (10 qsps)
Simulations: Error-Free Reads • 44 qsps (1739 bp long) from the E1E2 region of Hepatitis C virus (von Hahn et al. (2006)) • Simulated reads: • 4 populations sizes: 10, 20, 30, 40 sequences • Geometric distribution • The quasispecies population: • Number of reads is (20K, 40K, 60K, 80K, 100K } • The read length distribution N(μ,400); μ is varied from 200 to 500.
Simulations with FlowSim • 44 qsps (1739 bp long) from E1E2 region of HCV • 30K reads • average length=350bp • Indels: ~99.96% of aligned reads • Insertions length: 1 (99.6%) • Deletions length: 1 (99.97%) • N: ~1.1% • of • reads
Simulations with FlowSim • “Bootstrap” tests • 10%-reduced data (subsample) • For each out of 10 most frequent assemblies on sample: • count % of iterations when it has <= k mismatches with one out of top 10 qsps assembled on subsample
HCV Qsps (from P. Balfe) • 30927 reads from 5.2Kb-long region of HCV-1a genomes • intravenous drug user being infected for less than 3 months => mutation rate is in [1.75%, 8%] • 27764 reads • average length=292bp • Indels: ~77% of reads • Insertions length: 1 (86%) , 3 (9.8%) • Deletions length: 1 (98%) • N: ~7% of reads
Estimation of Genotyping Error • Optimistic estimate: 0.000253 • # spurious insertions (confirmed by a single read) • Divide it by #reads, covering the positions in which spurious insertions were seen. • Pessimistic estimate: 0.001203 • Insertion is spurious if it has < 6 reads confirming it. • Pessimistic estimate is used in EM-based algorithm as genotyping error probability.
NJ Tree for 12 Most Frequent Qsps (No Insertions) Peter Balfe: “We found that the most abundant of these theoretical assemblies was 99% identical to one of the actual ORFs obtained by cloning the quasispecies.” • The top sequence: • 26.9% (no mismatches) and 50.4% (≤1 mismatch) of the reads. • In sum: • 35.6% (no mismatches ) and 64.5% (≤1 mismatch) of the reads.
NJ Tree for 10 Most Frequent Qsps: ShoRAH +ViSpA • ShoRAH • 1 qsps with viable protein • ViSpA: • 7 qsps with viable protein • 3 qsps have stop codon at 5229 or 5238 (length=5244) • Top 20: 16 are viable.
Robustness • ShoRAH : 35% of times infers only the 3rd most frequent sequence. • ViSpA repeats 7 sequences >= 15% times and the top sequence is repeated 40% times.
Systematic Sequencing Errors • Stop codons in amino-acid sequences • (Manual) Resolution method for qsps sequences • Find the frame (MSTNP for HCV) • Find the first stop-codon position in qsps • Align the amino-acid translations of qsps and reference • In the alignment, go to the left until alignments agree • Extend/reduce closest to the left homopolymer • Choose the one that matches the reference
Example Reference Contig
HIV Qsps (Zagordi et al.(2010)) • 55,611 reads from ten 1.5Kbp long region of HIV-1 • average length=345bp • No deletion of low-quality reads! • Indels: ~99% of reads • Insertions length: 1 (85%) , 2 (10%), 3 (3.5%) • Deletions length: 1 (99.97%) • N : ~11.6 % of reads
HIV Qsps (Zagordi et al.(2010)) • Mutation rate is in [3%,10%] • ShoRAH: • 2 qspssequences with <=4 mismatches • ViSpA : • 5 qspswith <=2 mismatches • 2 qsps is inferred exactly • ShoRAHreads+ViSpA: • 3qspsis inferred exactly
ViSpA’s Parameters • Three self-tuned parameters • n =#mismatches allowed between a read and a superread • m = #mismatches allowed in the overlap between two consecutive reads, • t/L = mutation rate • One user-defined parameter • # of assemblies to focus • default value is 10
Rationale for m Value • m depends on n. • n mismatches between read and its superread => • n mismatches in overlap between superreads • 2n mismatches between their subreads • Experiments: better connectivity if m=2.5n. • For each candidate path (m= n) • the same path (m=2.5n) or • path with better total cost (if m=2.5n). • path (m= n) represents only part of the quasispecies. • Thus, m=2.5n.
Self-Tuning of n Value n=0 while (true){ build a read graph(n,m=2.5n) count % of non-broken paths if (the percentage>=95%) break; else n++; } • In our experiments, n varies from 0 to 6. • Note: ViSpA do not transitively reduce graph. • Longer edge would not be used
Self-Tuning of t Value • Variability is decreased by error correction, reads clustering reads t=L*1.75 cumFr=0 and maxT=0 while (true){ assemble sequences estimate their frequencies cum = cumulative freq.of 10 top assemblies if (cumFr<cum) { cumFr=cum and maxT=t} t=t+0.2*L } t =maxT
ViSpA’s Availability • http://alan.cs.gsu.edu/~irina/vispa.html • a jar file upon request • It requires SEGEMEHL alignment • Reference file (fasta format) • 454 reads file (fasta format) • Then ViSpA’s input: • Reference file (fasta format) • SEGEMEHL output: <readfile>.map nohup java –Xms7020m –Xmx7020m –jar vispa.jar reference.fa readfile.map &
Conclusion • Viral Spectrum Assembler (ViSpA) tool • a simple error correction • qsps assembling based on maximum-bandwidth paths in weighted read graphs • frequency estimation via EM on all reads • statistical and experimental validation on simulated data and assemblies constructed on 454 reads from HCV and HIV samples: • ShoRAH’s error correction algorithm is prone to overcorrection. • ViSpA is better in assembling sequences.
Future Work Weights=1 Weights are {0,1}
Thank You • My advisor Dr. Alexander Zelikovsky • My committee members Dr. Yi Pan, Dr. Robert Harrison and Dr. Yury Khudyakov • My parents, Anatolij and Alexandra, my sister Nataliya, my niece Kseniya and Nirajan • My friends and colleagues Gulsah, Dima, Kelly, Vanessa, Tanya, David, Qiong, Navin, Akshaye, Serghei, Marco, Ed, Chad, Stephen, Dinesh, Bassam, Hamed… • Computer Science Department at GSU • Molecular Basis of Disease Program
Outline • SNPs, Haplotypes, and Genotypes • Tagging Problem • Overfitting • Minimum Tag Selection Problem for 2-Tags Covering • 2LR Tagging • Correlation Calculation • Graph Formulation • Linear Programming • Experimental results • Conclusions and Future Work
SNPs,Haplotypes and Genotypes • Length of Human Genome 3109bp. • Single Nucleotide Polymorphism (SNP) is a nucleotide variation that occurs across population (107bp). • SNPs are mostly biallelic. • Minor allele frequency should be considerable e.g. >0.1% • Haplotype: SNPs on a single copy of chromosome (0=major allele, 1=minor allele) • Genotype: a pair of haplotypes (0=major allele, 1=minor allele, 2=heterozygote; or -1=major allele, 1=minor allele, 0=heterozygote)
Tagging Motivation Since many SNPs are linked, we can: • Reduce genotyping cost: predictive tagging • sequence small amount of SNPs (tags) • infer the rest of SNP based on the tags • Reduce data analysis: informative tagging • find informative SNPs that cover variation of the non-tagged SNPs with correlation of at least 0.8. • remove “noise”
1 1 1 0 0 1 1 0 2 1 0 0 1 1 2 2 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 1 0 1 0 SNPs data 1 0 0 2 0 1 0 0 1 0 1 0 1 1 1 Tags data Minimum Tag Selection Problem (MTS Problem) • Given a sample from population S of n genotypes (haplotypes), each consisting of m SNPs • Find a minimum size subset of tag SNPs that statistically covers all other SNPs.
Tagging Tools/Methods • HapBlock (K. Zhang, M.S. Waterman, et al.) • greedy algorithm for tag selection • majority voting for prediction • STAMPA (E. Halperin and R. Shamir) • dynamic programming for tag selection • majority voting for prediction • Tagger (de Bakker et al.) • graph algorithm for tag selection • majority voting for prediction • Multiple Linear Regression (MLR) Tagging • greedy algorithm for tag selection • MLR for prediction
Overfitting • How many tags should be used to explain variation of an untagged SNP? • Is there a trade-off between number of tags and overfitting? • Minimum Tag Selection Problem for Two-Tag Covering (M2TS Problem) • Given a sample from population S of n genotypes, each consisting of m SNPs, • Select a minimum size subset of tag SNPs such that each untagged SNP is covered by at most two tags. • 2LR Tagging covers a SNP variation by either a single tag or a linear combination of two tags. • Tagger finds approximately minimum number of tags such that there is a tag per each non-tag SNP. • MLRfinds less tags since it covers non-tagged SNPs by a linear combination of tags. • Multiple linear regression might suffer from overfitting: unrelated tags can cover an untagged SNP. • The more tags per untagged SNP is used, the lower prediction accuracy is.
Correlation for Triple of SNPs The correlation between SNPs A and B: • Geometrically, the best approximation X =αB +βC of SNP A by SNPs B and C: • multiple linear regression • rounding to {-1; 0; 1}. • A correlation r2(A|B;C) between SNP A and pair of SNPs B and C:
Correlation and Missing Values Assumption: missing values are distributed randomly and uniformly. We use Expectation Maximization (EM) algorithm to estimate the most probable frequencies of alleles in SNPs with missing values.