1 / 33

An alignment-free test for recombination

An alignment-free test for recombination. Presented by WONG Pak Kan. Contents. Background Motivation Methods Theory of rush Experiments on strains of E.coli Discussions. Coalescent theory. A retrospective model of population genetics

cece
Download Presentation

An alignment-free test for recombination

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An alignment-free test for recombination Presented by WONG Pak Kan

  2. Contents • Background • Motivation • Methods • Theory of rush • Experiments on strains of E.coli • Discussions

  3. Coalescent theory • A retrospective model of population genetics • To trace all alleles of a gene shared by all members of a population to a single ancestral copy (the most recent common ancestor) • Gene genealogy • The inheritance relationships between alleles • Model the time to coalescence and neutral variation • Applications • Disease gene mapping • The genomic distribution of heterozygosity Reference: http://en.wikipedia.org/wiki/Coalescent_theory Image credit: http://www.math.duke.edu/groups/mathbio/images/geneTreesInSpeciesTree.jpg

  4. Horizontal gene transfer (HGT) Also termed lateral gene transfer (LGT) • The transfer of genes between organisms in a manner other than traditional reproduction • An important factor in the evolution of many organisms • Primary reason for bacterial antibiotic resistance Reference: http://en.wikipedia.org/wiki/Horizontal_gene_transfer http://zlgc.seu.edu.cn/jpkc/2010jpkc/jykc2/Content/jxzy/genetics/chapt14/art_library/color_art_library/14_09.jpg

  5. Motivation • Finding horizontal gene transfer in bacteria to uncover the emergence of virulent pathogens • Assessing the genetic exchange • Detect the presence of recombination • Estimate the recombination rate

  6. Utilize shustrings and statistical measures Methods

  7. SHortest Unique subSTRING(shustring) • Consider the queryq and the subject s. • At every position i in q we look for the shortest substring q[i..j] that is absent from s. • This is the shustring at position i and denote its length by Xi. • Example: q=ATATC and s=ACGTG

  8. Finding all shustrings • Enhanced suffix array (an abstract version of suffix tree) • Illustration in suffix tree

  9. Homologous matches and shustrings • Assumption • In pairs of homologous DNA sequence the shustrings correspond to homologous matches and hence their lengths represent distances to the next SNP. • Reasonable for closely related sequences such as those sampled from populations or recently diverged species. Reference: http://en.wikipedia.org/wiki/Homologous_recombination

  10. Test statistics (mean-based) • Consider the case without recombination • If sequences differ at a fraction π of their sites, every site i has probability π of differing between q and s, independently of other sites. • Based on the infinite sites model from population genetics, the average shustring length • where Xi is the length of the shustring at position i • is the total number of shustrings

  11. Fluctuations in the coalescence times • Recombination leads to fluctuations in the coalescence times along a sequence and therefore to clustering of polymorphisms. • The mean shustring length increases. • Unfortunately, inferring the expected average shustring length requires estimating π to a precision attainable only with alignments. Fig. 1. Simulated time to the most recent common ancestor (TMRCA) and mutations (vertical lines) along a recombination stretch of DNA sequence. TMRCA is proportional to the population size.

  12. Test statistics (variance-based) • Consider the case without recombination • The empirical variance of shustring lengths • where Xi is the length of the shustring at position i • is the total number o f shustrings • is then compared with the observed variance to test the null hypothesis and

  13. One-sided test • Assume that is normally distributed with expectation • Use a one-sided test for the normalized difference between and • which is approximately normally distributed with mean 0 and standard deviation 1. • A rough measure of recombination ?

  14. rush (Recombination detection Using SHustrings) • Input • A query and a subject DNA sequence in FASTA format • Output • Compute Q, Dr and the corresponding P-value. • Suffix tree construction • Based on deep-shallow algorithm (one of the most efficient string indexing methods) • Restricted to the analysis of sequences consisting solely of the nucleotide designations {A,C,G,T}, all other characters were removed prior to the analysis.

  15. On complete genome sequences from GenBank for the 58 E.colistrains Experiments

  16. Experiments • Simulation • Verify • Sensitivity analysis on the rejection rate • Comparison on Φw, Max-χ2 and Dr • Finding out horizontal gene transfer

  17. Parameters for simulations • Samples of homologous DNA sequences using coalescent simulation program ms in conjunction with ms2dna • http://guanine.evolbio.mpg.de/bioBox/ • Given the effective population size Ne • Population recombination rate ρ= 2Nec • C: probability of recombination per generation. • Population mutation rate ϑ= 2Neμ • μ: mutation probability per generation • Infinite site model • ϑ = expected number of pairwise mismatches =π

  18. Accuracy of < <

  19. Null hypothesis of no recombination is rejected at α=0.05 Rejection rate grows to >0.99 until Sequence length = 106 π = 10-3 Replicates = 104 Sample size for Φw = 4 0.05 π Population recombination rate Fig. 4. Dr has similar sensitivity as the alignment-based (Bruen et al., 2006). The graph shows the fraction of hypothesis tests rejected with significance P≤0.05 as a function of the rate of recombination, ρ. Φw: homoplasy-based test statistic.

  20. The number of mismatches sampled affects sensitivity. Length increases  Sensitivity increases π = 10-3 ρ = 100 for entire region Replicates = 104 Fig. 5. Rejection rate as a function of sequence length for Dr compared with Φw.

  21. Rejection rate: Φw, Max-χ2 vs. Dr Φw and Max-χ2 perform much better than Dr. Φw and Max-χ2 become too slow. Dr. is comparable to Φw and Max-χ2. ϑ/bp= 0.01 ρ = 16 Expected number of recombination events as in the sample of 10 = 45.3 Fig. 6. Comparing rejection frequencies between Φw, Max-χ2 and Dr as a function of the length of the input sequences.

  22. Running time: Φw, Max-χ2 vs. Dr • Simulated sequence quartets of lengths 104-107bp and timed Phi given an alignment. Phi: Slope 2.2 (22.2=4.6) rush: Slope 1.2 (21.2=2.3)

  23. Invariant to codon data and changes in GC context

  24. Horizontal gene transfer in E.coli • Apply to the 58 fully sequenced E.coli genomes available in GenBank. Cluster diagram of the strains drawn by Phylip using distances from kr.

  25. Statistics for the measures of recombination • 58*57 pairwise recombination tests. (16.5s per test) • 97% of these tests were rejected with significance P≤0.05 • Median 2.051 with a range of 0.587-40.939 • KO11 and K12_MG1655 has the largest Q. where

  26. Horizontal gene transfer K12_MG1655 Mb *alfy is a program for comparing one or more query sequences to a set of subject sequences and determining the closest homologue among the subjects along each query. http://math.cmu.edu/~lleung/project/Alfy_1.5/Doc/alfyDoc.pdf

  27. Discussion • Recombination detector rush • String-indexing techniques • Variance of the shustring length to form the statistical measures • Linear run time in the length of the sequences • Accurate and robust when either genetic diversity (10-3) is small or sequence length is long (>100kb). • Weakness • Identification of clustered polymorphisms methods are sensitive to variations in the rate of mutation. • 97% of the pairs of E.coli genomes tested had a significant Dr • Should use homoplasy-based methods including Φw over cluster-based methods

  28. Future • Use shustring to develop other measures • One algorithm may not be enough to solve the same problem at different scale. • How can alignment-based method attain the same or even faster speed (with statistical-based approach)?

  29. BioBox: Collection of Tools for Sequence Analysis • http://guanine.evolbio.mpg.de/bioBox/ • This is a collection of programs for efficiently carrying out routine sequence analysis tasks under the UNIX command line. Each tool comes with its own documentation. Please drop me a line at last_name at evolbio.mpg.de in case you find an error in this software. • cchar, v. 1.6: Count characters in sequence data. • cpg, v. 0.7: Compute the CpG content of DNA sequences. • cutSeq, v. 0.11: Cut regions from molecular sequences. • generateQuerySbjct, v. 0.4: Generate pairs of homologous DNA sequences. • gd, v. 0.12: Calculate genetic diversity (pi, S, and Tajima's D) from aligned DNA sequences with or without sliding window. • getSeq, v. 0.4: Get specific sequences from a FASTA file containing multiple entries. • ms2dna, v. 1.16: Generate samples of homologous DNA sequences evolved under defined evolutionary scenarios by converting the output of Richard Hudson's coalescent simulation program ms. As of version 1.11, it can also deal with output generated by Gary Chen's fast coalescent simulator MaCS using the pipeline macs [options] | msformatter | ms2dna -a. • randomizeSeq, v. 0.8: Randomize sequences. • sequencer, v. 1.12: Simulate shotgun sequencing with paired (as of version 1.11) or unpaired reads and a user-defined error rate.

  30. Infinite sites model • http://www.math.wisc.edu/~roch/285k.1.10s/285k-s10-lect20.pdf • http://www.stats.ox.ac.uk/~didelot/popgen/lecture8.pdf

  31. Search for the evidence of horizontal gene transfer • Regions of KO11 genome that were more closely related to K12_MG1655 than to its closest relative • However, an artifact of the clustering algorithm, which guarantees finding the correct tree only for distances that actually fit a tree. • Empirical distances may not form a tree • Look up from raw distances • Find that the distance between KO11 and KO11FL is 2.1x10-5, whereas that between KO11 and W1 or W2 is below the sensitivity of kr, (the program we used for estimating the substitution rates between genomes)

  32. Tools • Program Phi computes Max-Χ2 and Φw • Distance between E.coli genomes are computed using kr • Tree based on these distances is computed and drawn with Phylipusing neighbor joining and midpoint rooting

  33. Comparison with Max-χ2 • Max-χ2: traditional method for detecting recombination for sequences no longer than a few kilobase. • Simulation scheme: For Φw and Dr, we simulated samples of 10 sequences with ϑ/bp=0.01 and ρ =16 • Keeping the number of recombination events as in the sample of 10

More Related