1 / 47

Multiple Sequence Alignment

Multiple Sequence Alignment. Highly conserved region in MSA (multiple sequence alignment) may imply important functional information. Families. gene family: a set of homologous genes protein family: a set of homologous proteins examples: globin gene family

Leo
Download Presentation

Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Sequence Alignment

  2. Highly conserved region in MSA (multiple sequence alignment) may imply important functional information.

  3. Families • gene family: a set of homologous genes • protein family: a set of homologous proteins • examples: • globin gene family • HOX gene family • serine/threonine kinase family

  4. Protein families • “The large majority of proteins come from no more than one thousand families” (Chothia 1994)

  5. Protein structure • amino acid sequence (primary) • three-dimensional structure • small scale (secondary) • alpha-helix, beta sheet, fold • large scale (tertiary) • domain • fully functional protein (quarternary)

  6. Domains • Protein composed from several domains • domain carries specific function • Structure is more likely to be conserved than sequence • one exon might represent one domain

  7. Domains

  8. Related Motivation • Gain insight into evolutionary history • By looking at the number of mutations necessary to go from one sequence to another, one can assess the time of divergence

  9. Alternatives to SP score • What we have now (loglikelihood ratio) • A natural extension for aligning 3 sequences (– can be unrealistically over-parameterized)

  10. Example • VSNS • SNA • AS

  11. Carrollo & Lipman Algorithm -- an attempt to reduce the volume of the dynamic programming matrix

  12. 3 or more sequences The optimal alignment path is contained in a "polyhedron" close to the main diagonal. Here, a polyhedron is a solid formed by plane faces, or more complicated 2-dimensional surfaces. For better visualization, the polyhedron's shadows are displayed. While visiting a node and looking for the minimum along all the incoming edges, we can ignore those edges that are "coming from outside the polyhedron", as in the top part the inset. On its top-left side, the cube is "covered" by the polyhedron. The edges 1, 2, 3, 6 and 7 are coming from the inside, and edges 4 and 5 can be ignored.

  13. Progressive Alignment Methods Most commonly used approach to multiple alignment

  14. Progressive Methods • Start with the most related sequence then progressively add less related sequence(s) to the initial alignment

  15. Guide Tree for Progressive Methods Do NOT confuse with phylogenetic tree

  16. Ad hoc Guide Tree Building First construct a distance matrix of all pairwise distances

  17. Joining Nearest Neighbors

  18. Preserving & Adding Gaps

  19. An Example of Progressive Multiple Alignment

  20. Problems of Progressive Alignment • No guarantee of the global optimal multiple alignment • Initial choice of sequences affects the final alignment • When sequences are highly divergent, the progressive approach becomes less reliable

  21. The CLUSTALW program • Fine tuned version of the above algorithm • Sequences are weighted to account for biased representation in large sub-families. • Substitution matrix is chosen flexibly • Manipulation of gap penalties

  22. Motif Representations CGGCGCACTCTCGCCCG CGGGGCAGACTATTCCG CGGCGGCTTCTAATCCG ... CGGGGCAGACTATTCCG • Consensus • Frequency Matrix • Logo CGGNGCACANTCNTCCG

  23. Logo explanation • The characters representing the sequence are stacked on top of each other for each position in the aligned sequences. • The height of each letter is made proportional to its frequency, the most common one is on top. • The height of the entire stack is then adjusted to signify the information content of the sequences at that position.

  24. Information Content • Uncertainty = • Information= Thomas D. Schneider and R. Michael Stephens, Nucleic Acids Research, 18: 6097-6100 (1990)

  25. Other MSA methods • Phylogenetic tree building • Alignment using the Sum-of-Pairs scoring scheme can be accomplished in a more probabilistic framework: using profile HMM • EM algorithm

  26. Motif Sampler (EM) • Lawrence et al. 1993, Liu et al. 1995 • Model the distribution of residues with multinomial distributions • One multinom. dist’n per position within motif • One background dist’n for outside motif • The motif location is missing!

  27. Problem Description • Given a set of N sequences S1,…,SN of lengthnk (k=1,…,N) • Identify a single pattern of fixed width(W) within each (N)input sequence • A= {ak}(k=1,…,N) : a set of starting positions for the common pattern within each sequence ; ak=1…nk-W+1 • Objective: to find the “best,” defined as the most probable, common pattern

  28. Algorithm- Initialization (1) Choose random starting positions {ak} within the various sequences A= {ak}(k=1,…,N) : a set of starting positions for the common pattern within each sequence ; ak=1…nk-W+1

  29. N=6, W=10 q1A= 3/5, q2G = 2/5, … q1G= 0

  30. Algorithm- Predictive Update (2) • One of the N sequences, Z, is chosen either at random or in specified order. • The pattern description qij and background frequency q0j are then calculated excluding z.

  31. Calculate the new multinomial frequencies if the motif start at a given location in Z • calculated analogously with counts taken over all non-motif positions • Find the most “reasonable” location in Z • Iterate!

  32. AX= Qx/Bx = Select a set of ak’s that maximizes the product of these ratios, or F F = Σ1≤i≤W Σj∈ {A,T,G,C} ci,jlog(qij/q0j)

More Related