Inferring Genomic Sequences

Inferring Genomic Sequences Irina Astrovskaya Dr. Alexander Zelikovsky 02/15/2011

Outline • Quasispecies Spectrum Reconstruction • Background • Viral Quasispecies • 454 Life Sciences Pyrosequencing System • Problem Formulation • ViSpA: Viral Spectrum Assembler • Experimental Results • Future Work • Tagging

Viral Quasispecies • RNA viruses (HIV, HCV) • Many replication mistakes • Quasispecies (qsps) = co-existing closely related variants • Variants differ in • virulence • ability to escape the immune system • resistance to antiviral therapies • tissue tropism • How do qsps contribute to viral persistence and evolution?

454 Pyrosequencing • Pyrosequencing =Sequencing by Synthesis. • GS FLX Titanium : • Fragments (reads): 300-800 bp • Sequence of the reads • Assembled single genome via the software • We need a software that assembles reads to multiple genomes!

Quasispecies Spectrum Reconstruction (QSR) Problem • Given • pyrosequencing reads from a quasispecies population of unknown size and distribution • Reconstruct the quasispecies spectrum • sequences • frequencies

State-of-the-Art: ShoRAH • N. Eriksson et al. (2008): • error correction via clustering • #clusters: statistical significance tests • 3 windows • majority rule • assembling via chain decomposition • maximum-likelihood freq. estimation • ShoRAH (O. Zagordi et al) : • probabilistic clustering • #clusters: Dirichlet process mixture

ViSpA Viral Spectrum Assembler

Alignment • Reference sequence is available. • Multiple coverage • No repeats => unique alignment. • Alignment score • minimizes Hamming distances • penalizes indels more than mismatches • DeletionsInsertions

454 Genotyping Error • Genotyping error rate =0.1%. • Fixed number of incorporated bases vs. light intensity value. • Incorrect resolution of homopolymers => • over-calls (insertions) • 65-75% of errors • under-calls (deletions) • 20-30% of errors

Preprocessing of Aligned Reads • Deletions in reads: D • Replace deletion, confirmed by a single read, with either allele value that is present in all other reads or N. • Insertions into reference: I • Remove insertions, confirmed by a single read. • Imputation of missing values N

Genotyping Error Correction • Rare mutations vs genotyping errors • ViSpA “Tag boxes” = correlation ≥ 0.8. • Fix incorrect value based on the top association. • ShoRAH (Zagordi et al, 2010) • Probabilistic clustering • 3 overlapping windows • k-mer correction (P.Pevzner, SOLID (D. Brinza)) • Count k-mers • Still works for quasispecies (P. Skums, CDC)

Read Graph: Vertices ACTGGTCCCTCCTGAGTGT GGTCCCTCCT TGGTCACTCGTGAG ACCTCATCGAAGCGGCGTCCT Subread = completely contained in some read with ≤ n mismatches. Superread = not a subread => the vertex in the read graph.

Read Graph: Edges • Edge b/w two vertices exists • if there is an overlap between superreads • they agree on their overlap with ≤ m mismatches. • Auxiliary vertices: source and sink

Read Graph: Transitive Reduction Forbidden paths Candidate sequence may be represented by several paths.

Read Graph: Edge Cost Δ • The most probable source-sink path through each vertex • Cost: uncertainty that two superreads are from the same qsps. • OverhangΔis the shift in start positions of two overlapping superreads.

Contig Assembling read r of length l qspss of length L k is #mismatches, t/L is a mutation rate • Max Bandwidth Path through vertex • path minimizing maximum edge cost for the path and each subpath • Consensus of path’s superreads • Each position: >70%-majority or N • Weighted consensus obtained on all reads • Remove duplicates • Duplicated sequences = statistical evidence

Expectation Maximization • E step: • M step: • Bipartite graph: • Qqis a candidate with frequency fq • Rris a read with observed frequency or • Weight hq,r= probability that read r is produced by qspsq with j mismatches

Experimental Validations • Comparison with ShoRAH on • Simulated data • Error-free reads from known HCV quasispecies • Reads with errors generated by FlowSim • Real 454 reads • HCV data • HIV data (10 qsps)

Simulations: Error-Free Reads • 44 qsps (1739 bp long) from the E1E2 region of Hepatitis C virus (von Hahn et al. (2006)) • Simulated reads: • 4 populations sizes: 10, 20, 30, 40 sequences • Geometric distribution • The quasispecies population: • Number of reads is (20K, 40K, 60K, 80K, 100K } • The read length distribution N(μ,400); μ is varied from 200 to 500.

Results

Simulations with FlowSim • 44 qsps (1739 bp long) from E1E2 region of HCV • 30K reads • average length=350bp • Indels: ~99.96% of aligned reads • Insertions length: 1 (99.6%) • Deletions length: 1 (99.97%) • N: ~1.1% • of • reads

Simulations with FlowSim • “Bootstrap” tests • 10%-reduced data (subsample) • For each out of 10 most frequent assemblies on sample: • count % of iterations when it has <= k mismatches with one out of top 10 qsps assembled on subsample

HCV Qsps (from P. Balfe) • 30927 reads from 5.2Kb-long region of HCV-1a genomes • intravenous drug user being infected for less than 3 months => mutation rate is in [1.75%, 8%] • 27764 reads • average length=292bp • Indels: ~77% of reads • Insertions length: 1 (86%) , 3 (9.8%) • Deletions length: 1 (98%) • N: ~7% of reads

HCV Data Statistics

Estimation of Genotyping Error • Optimistic estimate: 0.000253 • # spurious insertions (confirmed by a single read) • Divide it by #reads, covering the positions in which spurious insertions were seen. • Pessimistic estimate: 0.001203 • Insertion is spurious if it has < 6 reads confirming it. • Pessimistic estimate is used in EM-based algorithm as genotyping error probability.

NJ Tree for 12 Most Frequent Qsps (No Insertions) Peter Balfe: “We found that the most abundant of these theoretical assemblies was 99% identical to one of the actual ORFs obtained by cloning the quasispecies.” • The top sequence: • 26.9% (no mismatches) and 50.4% (≤1 mismatch) of the reads. • In sum: • 35.6% (no mismatches ) and 64.5% (≤1 mismatch) of the reads.

NJ Tree for 10 Most Frequent Qsps: ShoRAH +ViSpA • ShoRAH • 1 qsps with viable protein • ViSpA: • 7 qsps with viable protein • 3 qsps have stop codon at 5229 or 5238 (length=5244) • Top 20: 16 are viable.

Robustness • ShoRAH : 35% of times infers only the 3rd most frequent sequence. • ViSpA repeats 7 sequences >= 15% times and the top sequence is repeated 40% times.

Systematic Sequencing Errors • Stop codons in amino-acid sequences • (Manual) Resolution method for qsps sequences • Find the frame (MSTNP for HCV) • Find the first stop-codon position in qsps • Align the amino-acid translations of qsps and reference • In the alignment, go to the left until alignments agree • Extend/reduce closest to the left homopolymer • Choose the one that matches the reference

Example Reference Contig

HIV Qsps (Zagordi et al.(2010)) • 55,611 reads from ten 1.5Kbp long region of HIV-1 • average length=345bp • No deletion of low-quality reads! • Indels: ~99% of reads • Insertions length: 1 (85%) , 2 (10%), 3 (3.5%) • Deletions length: 1 (99.97%) • N : ~11.6 % of reads

HIV Qsps (Zagordi et al.(2010)) • Mutation rate is in [3%,10%] • ShoRAH: • 2 qspssequences with <=4 mismatches • ViSpA : • 5 qspswith <=2 mismatches • 2 qsps is inferred exactly • ShoRAHreads+ViSpA: • 3qspsis inferred exactly

ViSpA’s Parameters • Three self-tuned parameters • n =#mismatches allowed between a read and a superread • m = #mismatches allowed in the overlap between two consecutive reads, • t/L = mutation rate • One user-defined parameter • # of assemblies to focus • default value is 10

Rationale for m Value • m depends on n. • n mismatches between read and its superread => • n mismatches in overlap between superreads • 2n mismatches between their subreads • Experiments: better connectivity if m=2.5n. • For each candidate path (m= n) • the same path (m=2.5n) or • path with better total cost (if m=2.5n). • path (m= n) represents only part of the quasispecies. • Thus, m=2.5n.

Self-Tuning of n Value n=0 while (true){ build a read graph(n,m=2.5n) count % of non-broken paths if (the percentage>=95%) break; else n++; } • In our experiments, n varies from 0 to 6. • Note: ViSpA do not transitively reduce graph. • Longer edge would not be used

Self-Tuning of t Value • Variability is decreased by error correction, reads clustering reads t=L*1.75 cumFr=0 and maxT=0 while (true){ assemble sequences estimate their frequencies cum = cumulative freq.of 10 top assemblies if (cumFr<cum) { cumFr=cum and maxT=t} t=t+0.2*L } t =maxT

ViSpA’s Availability • http://alan.cs.gsu.edu/~irina/vispa.html • a jar file upon request • It requires SEGEMEHL alignment • Reference file (fasta format) • 454 reads file (fasta format) • Then ViSpA’s input: • Reference file (fasta format) • SEGEMEHL output: <readfile>.map nohup java –Xms7020m –Xmx7020m –jar vispa.jar reference.fa readfile.map &

Conclusion • Viral Spectrum Assembler (ViSpA) tool • a simple error correction • qsps assembling based on maximum-bandwidth paths in weighted read graphs • frequency estimation via EM on all reads • statistical and experimental validation on simulated data and assemblies constructed on 454 reads from HCV and HIV samples: • ShoRAH’s error correction algorithm is prone to overcorrection. • ViSpA is better in assembling sequences.

Future Work Weights=1 Weights are {0,1}

Thank You • My advisor Dr. Alexander Zelikovsky • My committee members Dr. Yi Pan, Dr. Robert Harrison and Dr. Yury Khudyakov • My parents, Anatolij and Alexandra, my sister Nataliya, my niece Kseniya and Nirajan • My friends and colleagues Gulsah, Dima, Kelly, Vanessa, Tanya, David, Qiong, Navin, Akshaye, Serghei, Marco, Ed, Chad, Stephen, Dinesh, Bassam, Hamed… • Computer Science Department at GSU • Molecular Basis of Disease Program

Genotype Tagging with Limited Overfitting

Outline • SNPs, Haplotypes, and Genotypes • Tagging Problem • Overfitting • Minimum Tag Selection Problem for 2-Tags Covering • 2LR Tagging • Correlation Calculation • Graph Formulation • Linear Programming • Experimental results • Conclusions and Future Work

SNPs,Haplotypes and Genotypes • Length of Human Genome  3109bp. • Single Nucleotide Polymorphism (SNP) is a nucleotide variation that occurs across population (107bp). • SNPs are mostly biallelic. • Minor allele frequency should be considerable e.g. >0.1% • Haplotype: SNPs on a single copy of chromosome (0=major allele, 1=minor allele) • Genotype: a pair of haplotypes (0=major allele, 1=minor allele, 2=heterozygote; or -1=major allele, 1=minor allele, 0=heterozygote)

Tagging Motivation Since many SNPs are linked, we can: • Reduce genotyping cost: predictive tagging • sequence small amount of SNPs (tags) • infer the rest of SNP based on the tags • Reduce data analysis: informative tagging • find informative SNPs that cover variation of the non-tagged SNPs with correlation of at least 0.8. • remove “noise”

1 1 1 0 0 1 1 0 2 1 0 0 1 1 2 2 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 1 0 1 0 SNPs data 1 0 0 2 0 1 0 0 1 0 1 0 1 1 1 Tags data Minimum Tag Selection Problem (MTS Problem) • Given a sample from population S of n genotypes (haplotypes), each consisting of m SNPs • Find a minimum size subset of tag SNPs that statistically covers all other SNPs.

Tagging Tools/Methods • HapBlock (K. Zhang, M.S. Waterman, et al.) • greedy algorithm for tag selection • majority voting for prediction • STAMPA (E. Halperin and R. Shamir) • dynamic programming for tag selection • majority voting for prediction • Tagger (de Bakker et al.) • graph algorithm for tag selection • majority voting for prediction • Multiple Linear Regression (MLR) Tagging • greedy algorithm for tag selection • MLR for prediction

Overfitting • How many tags should be used to explain variation of an untagged SNP? • Is there a trade-off between number of tags and overfitting? • Minimum Tag Selection Problem for Two-Tag Covering (M2TS Problem) • Given a sample from population S of n genotypes, each consisting of m SNPs, • Select a minimum size subset of tag SNPs such that each untagged SNP is covered by at most two tags. • 2LR Tagging covers a SNP variation by either a single tag or a linear combination of two tags. • Tagger finds approximately minimum number of tags such that there is a tag per each non-tag SNP. • MLRfinds less tags since it covers non-tagged SNPs by a linear combination of tags. • Multiple linear regression might suffer from overfitting: unrelated tags can cover an untagged SNP. • The more tags per untagged SNP is used, the lower prediction accuracy is.

Correlation for Triple of SNPs The correlation between SNPs A and B: • Geometrically, the best approximation X =αB +βC of SNP A by SNPs B and C: • multiple linear regression • rounding to {-1; 0; 1}. • A correlation r2(A|B;C) between SNP A and pair of SNPs B and C:

Correlation and Missing Values Assumption: missing values are distributed randomly and uniformly. We use Expectation Maximization (EM) algorithm to estimate the most probable frequencies of alleles in SNPs with missing values.

Inferring Genomic Sequences

Inferring Genomic Sequences

Presentation Transcript

Finding approximate palindromes in genomic sequences.

Evolutionary and genomic approaches to find gene regulatory sequences

Inferring:

Algorithms for Alignment of Genomic Sequences

Inferring meaning

Inferring

Finding Promoters other important genomic sequences

Inferring

Inferring

Inferring

Inferring

Standardized Metadata for Human Pathogen/Vector Genomic Sequences

Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

Alignment of large genomic sequences

Alignment of Genomic Sequences

Inferring Phylogeny using Permutation Patterns on Genomic Data

Locus Reference Genomic (LRG) Sequences

INFERRING

High Performance Direct Pairwise Comparison of Genomic Sequences

Inferring Specifications

Extracting homoeologous genomic sequences – the challenge of the wheat genome

Alignment of large genomic sequences