10 likes | 86 Views
Input: sequence X ( x 0 x 1 …x N-1 ). Set i =0, j = N -1. Detect a possible cluster ( c s , c e ) among X ( x i x i+1 …x j ). Sub-sequence: set i = i , j = c s -1. Sub-sequence: set i = c s , j = c e. Sub-sequence: set i = c e +1, j = j.
E N D
Input: sequence X(x0x1…xN-1) Set i=0, j=N-1 Detect a possible cluster (cs, ce) among X(xixi+1…xj) Sub-sequence: set i=i, j=cs-1 Sub-sequence: set i=cs, j=ce Sub-sequence: set i=ce+1,j=j The cluster (cs, ce) is significant? Y N 0 cs ce N-1 , Maximum-likelihood estimation of substitution heterogeneity through clustering Zhang Zhang and Jeffrey P. Townsend Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut, United States of America Summary: Detecting substitution heterogeneity is of great importance for locating divergent and conservative regions along DNA and protein sequences. Heterogeneous regions correspond to nonuniform features and imply different evolutionary process arising from different functional constraints or selections. Here we propose a maximum-likelihood method to detect substitution heterogeneity in sequences. The method uses divide and conquer to cluster sequences into heterogeneous regions of different substitutions. To determine whether a cluster is deviated significantly from the sequences, the method adopts several criteria, such as Likelihood Ratio Test (LRT), Akaike Information Criterion (AIC) and its variation (AICc), and Bayesian Information Criterion (BIC). The method does not need a priori knowledge for clustering or the number of clusters, and particularly, it is more accurate testified by the application to several real data and can be applied to comparative evolutionary studies. In addition, we recommend that criteria with a consideration of sampling size (AICc and BIC) be used in all circumstances. • Introduction • Substitutions along sequences are concentrated in some regions and/or relatively sparse in others. • Why should we care • Substitution heterogeneity provides: • clues to sequence function • implications of sequence structure • evidence of interesting evolutionary phenomena • What is the question • Substitutions are not uniform. • Positive selection is often acting on small regions along sequence. • Heterogeneity of substitution within genes is not well accounted for by existing methods that average selective pressure across all sites under the assumption that all sites evolve at the same rate, or estimate selective pressure at each site under a given statistical distribution. • Lack of a statistical method to detect clustering within discrete linear sequences when there is no priori specification of cluster size or the number of clusters. • How we can do it • Here we propose a method based on maximum likelihood estimation (MLE) to detect regional clusters with different substitution heterogeneity. • Model selection • To examine whether a clustered model best fits a sequence, our method adopts several different criteria for model selection. • Likelihood Ratio Test (LRT) is a most popular strategy for model selection. The maximized log-likelihoods of the null (lnL0) and the clustering (lnLc) models should be asymptotically distributed as a χ2 with two degrees of freedom. That is, the p-value < the significance level (usually 0.05), indicates that the clustering model fits the data significantly better than the null model and vice versa. • Akaike Information Criterion (AIC) represents the Kullback-Leibler distance between a true model and an examined model and quantifies the information lost by approximating the true model, where L is the maximized likelihood value and k is the number of parameters. We define k0 and kc as the number of parameters under the null hypothesis and the clustering model, respectively and thus k0=0 and kc=2. • AICc, a modification of AIC, allows for sampling size (n) as well as k and L, especially for clustering short sequences into heterogeneous regions which probably involves more biases. • Bayesian Information Criterion (BIC), similar to AICc, is a function of its maximized log-likelihood, number of parameters and sampling size. • Algorithm implementation • The new method uses divide and conquer approach to detect all possible clusters among sequences. After locating the first cluster, the method partitions sequences into three sub-sequences and then repeats the same analysis for these three sub-sequences, until all segments of the sequence have failed to demonstrate clustering (see Figure 2). • Figure 2 Flowchart of detection of heterogeneous clusters using divide and conquer approach. Note that i and j represent the start position and end position of (sub-) sequence that will be clustered. • Results • We used our method to detect clusters along the Drosophila Adh gene within five species of Drosophila melanogaster species subgroup (D. melanogaster, D. sechellia, D. simulans, D. yakuba and D. erecta) and identified heterogeneous clusters with different substitution (see Table 1 and Figure 3). Figure 3 Clusters for the Adh gene (254 amino acids) in D. melanogaster, using (a) LRT, AIC, AICc and (b) BIC as a criterion. Substitution sites reside at 1, 8, 25, 71, 80, 82, 84, 97, 190, 212, 215, 218, 228, and 246 (sites start with 0), represented by solid lines on the top of sequence coordinate. The left vertical axis indicates substitution probability among cluster. Clusters between sites 98 and 189 and between sites 26 and 70 are cold spots shown in (a) and (b). On the contrary, clusters between sites 80 and 84 and between sites 212 and 218 are hot spots, only shown in (a). The null hypothesis is rejected at the 0.05 threshold by the LRT method. Discussion The new method adopts maximum likelihood estimation, uses several criteria to identify whether a cluster is present within the sequence and employs divide and conquer approach to locate all possible clusters. It has several properties: • Do not need a priori knowledge of the number of clusters. • Offer much more statistical power. • Multiple criteria to test significance. • Other applications, such as, detecting GC heterogeneity by setting G/C=1 and A/T=0. Methods to detect positive selection and McDonald-Kreitman (MK) analysis do not consider substitution heterogeneity. Therefore, our future work includes: • Detecting positive selection on localized clusters. • Extending the method to polymorphism data within a MK framework. • Materials and methods • Alignment • Suppose that two or more aligned sequences have N sites and for each site, 0 represents identical and 1 represents variant. Therefore, the aligned sequences can be denoted as • Clustering model • The null hypothesis assumes no heterogeneous cluster among sequences. Consequently, the likelihood of the null model (without clusters) is calculated as • where n is the number of variant sites. • Under the clustering model, the entire sequence is partitioned into three regions and the central region is considered as the heterogeneous cluster (see Figure 1). • Figure 1 Illustration of a cluster among sequence X, suppose that cs and ce are the start position and end position of cluster, respectively, and ns, nc and ne are the number of variant sites in the beginning, central and ending regions, respectively, where n = ns+ nc + ne. • The likelihood of the clustering model is formulated as • where and . • Therefore, we define • p0 < pc: The central cluster (cs, ce)is hot spot. • p0 > pc: The central cluster (cs, ce) is cold spot. Literature cited Gaut, B. S., and B. S. Weir. 1994. Detecting substitution-rate heterogeneity among regions of a nucleotide sequence. Mol Biol Evol 11:620-629. Goss, P. J., and R. C. Lewontin. 1996. Detecting heterogeneity of substitution along DNA and protein sequences. Genetics 143:589-602. Posada, D., and T. R. Buckley. 2004. Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. Syst Biol 53:793-808. Tang, H., and R. C. Lewontin. 1999. Locating regions of differential variability in DNA and protein sequences. Genetics 153:485-495. Acknowledgments We thank Zheng Wang, Francesc López, Aleksandra Adomas, Gina Wilpiszeski and Andrea Hodgins-Davis for valuable discussions. This work is supported by a grant from the National Institute of General Medical Sciences at the U.S. National Institutes of Health (GM068087). Further information Please contact Zhang.Zhang@yale.edu and Jeffrey.Townsend@yale.edu. The proposed method has been implemented in the program MLCluster that is freely available at www.yale.edu/townsend/software.html. A PDF version of the poster can be obtained at www.yale.edu/townsend/Poster/MLCluster.pdf.