480 likes | 1.02k Views
Multiple Sequence Alignment. Highly conserved region in MSA (multiple sequence alignment) may imply important functional information. Families. gene family: a set of homologous genes protein family: a set of homologous proteins examples: globin gene family
E N D
Highly conserved region in MSA (multiple sequence alignment) may imply important functional information.
Families • gene family: a set of homologous genes • protein family: a set of homologous proteins • examples: • globin gene family • HOX gene family • serine/threonine kinase family
Protein families • “The large majority of proteins come from no more than one thousand families” (Chothia 1994)
Protein structure • amino acid sequence (primary) • three-dimensional structure • small scale (secondary) • alpha-helix, beta sheet, fold • large scale (tertiary) • domain • fully functional protein (quarternary)
Domains • Protein composed from several domains • domain carries specific function • Structure is more likely to be conserved than sequence • one exon might represent one domain
Related Motivation • Gain insight into evolutionary history • By looking at the number of mutations necessary to go from one sequence to another, one can assess the time of divergence
Alternatives to SP score • What we have now (loglikelihood ratio) • A natural extension for aligning 3 sequences (– can be unrealistically over-parameterized)
Example • VSNS • SNA • AS
Carrollo & Lipman Algorithm -- an attempt to reduce the volume of the dynamic programming matrix
3 or more sequences The optimal alignment path is contained in a "polyhedron" close to the main diagonal. Here, a polyhedron is a solid formed by plane faces, or more complicated 2-dimensional surfaces. For better visualization, the polyhedron's shadows are displayed. While visiting a node and looking for the minimum along all the incoming edges, we can ignore those edges that are "coming from outside the polyhedron", as in the top part the inset. On its top-left side, the cube is "covered" by the polyhedron. The edges 1, 2, 3, 6 and 7 are coming from the inside, and edges 4 and 5 can be ignored.
Progressive Alignment Methods Most commonly used approach to multiple alignment
Progressive Methods • Start with the most related sequence then progressively add less related sequence(s) to the initial alignment
Guide Tree for Progressive Methods Do NOT confuse with phylogenetic tree
Ad hoc Guide Tree Building First construct a distance matrix of all pairwise distances
Problems of Progressive Alignment • No guarantee of the global optimal multiple alignment • Initial choice of sequences affects the final alignment • When sequences are highly divergent, the progressive approach becomes less reliable
The CLUSTALW program • Fine tuned version of the above algorithm • Sequences are weighted to account for biased representation in large sub-families. • Substitution matrix is chosen flexibly • Manipulation of gap penalties
Motif Representations CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAATCCG ... CGGGGCAGACTATTCCG • Consensus • Frequency Matrix • Logo CGGNGCACANTCNTCCG
Logo explanation • The characters representing the sequence are stacked on top of each other for each position in the aligned sequences. • The height of each letter is made proportional to its frequency, the most common one is on top. • The height of the entire stack is then adjusted to signify the information content of the sequences at that position.
Information Content • Uncertainty = • Information= Thomas D. Schneider and R. Michael Stephens, Nucleic Acids Research, 18: 6097-6100 (1990)
Other MSA methods • Phylogenetic tree building • Alignment using the Sum-of-Pairs scoring scheme can be accomplished in a more probabilistic framework: using profile HMM • EM algorithm
Motif Sampler (EM) • Lawrence et al. 1993, Liu et al. 1995 • Model the distribution of residues with multinomial distributions • One multinom. dist’n per position within motif • One background dist’n for outside motif • The motif location is missing!
Problem Description • Given a set of N sequences S1,…,SN of lengthnk (k=1,…,N) • Identify a single pattern of fixed width(W) within each (N)input sequence • A= {ak}(k=1,…,N) : a set of starting positions for the common pattern within each sequence ; ak=1…nk-W+1 • Objective: to find the “best,” defined as the most probable, common pattern
Algorithm- Initialization (1) Choose random starting positions {ak} within the various sequences A= {ak}(k=1,…,N) : a set of starting positions for the common pattern within each sequence ; ak=1…nk-W+1
N=6, W=10 q1A= 3/5, q2G = 2/5, … q1G= 0
Algorithm- Predictive Update (2) • One of the N sequences, Z, is chosen either at random or in specified order. • The pattern description qij and background frequency q0j are then calculated excluding z.
Calculate the new multinomial frequencies if the motif start at a given location in Z • calculated analogously with counts taken over all non-motif positions • Find the most “reasonable” location in Z • Iterate!
AX= Qx/Bx = Select a set of ak’s that maximizes the product of these ratios, or F F = Σ1≤i≤W Σj∈ {A,T,G,C} ci,jlog(qij/q0j)