1 / 47

Chapter 2 Data Searches and Pairwise Alignments

Chapter 2 Data Searches and Pairwise Alignments. 暨南大學資訊工程學系 黃光璿 2004/03/08. Introduction. What is the difference between acctga and agcta?. a c c t g a a g c t g a a g c t - a. Nomenclature. 2.1 Dot Plots. 2.2 Simple Alignments. No gap. mutation (substitution): common insertion

tymon
Download Presentation

Chapter 2 Data Searches and Pairwise Alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 2Data Searches and Pairwise Alignments 暨南大學資訊工程學系 黃光璿 2004/03/08

  2. Introduction • What is the difference between acctga and agcta? a c c t g a a g c t g a a g c t - a

  3. Nomenclature

  4. 2.1 Dot Plots

  5. 2.2 Simple Alignments • No gap

  6. mutation (substitution): common • insertion • deletion • scoring scheme • match score • mismatch score } gap, indel (rare)

  7. 2.3 Gaps

  8. 2.3.1 Gap Penalty • uniform gap • affine gap • origination penalty • length penalty

  9. 2.4 Scoring Matrices

  10. Modeling 之問題 • 大自然是否真的依此規則運作?

  11. Modeling

  12. Define the odds ratio as

  13. 2.4.1 PAM Matrices • Dayhoff, Schwartz, Orcutt (1978) • Point Accepted Mutation • Based on observed substitution rates • (Box. 2.1) • Input • A set of observed substitution rates • Output • PAM-1 matrix (log-odds matrix)

  14. Multiple Alignment (1) Group the sequences with high similarity (> 85% identity).

  15. Phylogenetic Tree (2) For each group, build the corresponding phylogenetic tree.

  16. Mutation Frequency A->G, I->L, A->G, A->L, C->S, G->A (3) FG,A=3

  17. Relative Mutability • (4)

  18. Mutation Probability • (5)

  19. Odds Ratio • (6)

  20. Log-Odds Ratio • (7)

  21. Which PAM matrix is the most appropriate? • the length of the sequences • How closely the sequences are believed to be related. •  PAM 120 for database search •  PAM 200 for comparing two specific proteins

  22. 2.4.2 BLOSUM Matrices • Henikoff & Henikoff (1992) • PAM-k: k愈大, 愈不相似 • BLOSUM-k: k愈大愈相似 •  BLOSUM62: for ungapped matching •  BLOSUM50: for gapped matching

  23. 2.5 Dynamic Programming • The Needleman and Wunsch Algorithm (Global Alignment)

  24. Alignment Graph

  25. A C - - T C G A C A G T A G

  26. Complexity

  27. 2.6 Global and Local Alignments • Semi-global alignment • Local alignment

  28. 2.6.1 Semi-global Alignments • A A C A C G T G T C T • - - - A C G T - - - -

  29. 2.6.2 Local Alignment • The Smith-Waterman Alignment

  30. 2.7 Database Searches • BLAST and its relatives • FASTA and related algorithms

  31. 2.7.1 BLAST and Its Relatives

  32. BLASTP • Using PAM or BLOSUM matrices

  33. 2.7.2 FASTA and Related Algorithms 改進 dot plot & band search • Preprocess the target sequence. • Identify the position for each word. (for amino acid & word length=1, a 20-entry array) • Scan the query sequence. • Compute the shifts of query to align each word with the target. • Find the mode (眾數) of the shifts. • Join the possible shifts into one new target sequence. Perform the full local alignment algorithm.

  34. Target: FAMLGFIKYLPGCM Query:TGFIKYLPGACT

  35. 2.7.3 Alignment Scores and Statistical Significance of Database Searches • related model v.s. random model • S-score: the alignment score • E-score: expected number of sequences with score >= S by random chance • P-score: probability that one or more sequences with score >= S would be found randomly •  Low E & P are better.

  36. length correction • Scores

  37. PAM 120 (ln 2)/2 nats A R N D C Q E G H I L K M F P S T W Y V B Z X * A 3 -3 -1 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0 0 -1 -1 -8 R -3 6 -1 -3 -4 1 -3 -4 1 -2 -4 2 -1 -5 -1 -1 -2 1 -5 -3 -2 -1 -2 -8 N -1 -1 4 2 -5 0 1 0 2 -2 -4 1 -3 -4 -2 1 0 -4 -2 -3 3 0 -1 -8 D 0 -3 2 5 -7 1 3 0 0 -3 -5 -1 -4 -7 -3 0 -1 -8 -5 -3 4 3 -2 -8 C -3 -4 -5 -7 9 -7 -7 -4 -4 -3 -7 -7 -6 -6 -4 0 -3 -8 -1 -3 -6 -7 -4 -8 Q -1 1 0 1 -7 6 2 -3 3 -3 -2 0 -1 -6 0 -2 -2 -6 -5 -3 0 4 -1 -8 E 0 -3 1 3 -7 2 5 -1 -1 -3 -4 -1 -3 -7 -2 -1 -2 -8 -5 -3 3 4 -1 -8 G 1 -4 0 0 -4 -3 -1 5 -4 -4 -5 -3 -4 -5 -2 1 -1 -8 -6 -2 0 -2 -2 -8 H -3 1 2 0 -4 3 -1 -4 7 -4 -3 -2 -4 -3 -1 -2 -3 -3 -1 -3 1 1 -2 -8 I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 1 -3 1 0 -3 -2 0 -6 -2 3 -3 -3 -1 -8 L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 -4 3 0 -3 -4 -3 -3 -2 1 -4 -3 -2 -8 K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 0 -7 -2 -1 -1 -5 -5 -4 0 -1 -2 -8 M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 -1 -3 -2 -1 -6 -4 1 -4 -2 -2 -8 F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 -5 -3 -4 -1 4 -3 -5 -6 -3 -8 P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 1 -1 -7 -6 -2 -2 -1 -2 -8 S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3 2 -2 -3 -2 0 -1 -1 -8 T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2 4 -6 -3 0 0 -2 -1 -8 W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2 -6 12 -2 -8 -6 -7 -5 -8 Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2 -5 -4 4 -6 -3 -3 -2 8 -3 -3 -5 -3 -8 V 0 -3 -3 -3 -3 -3 -3 -2 -3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 -3 -3 -1 -8 B 0 -2 3 4 -6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3 4 2 -1 -8 Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1 -1 -2 -7 -5 -3 2 4 -1 -8 X -1 -2 -1 -2 -4 -1 -1 -2 -2 -1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 -8 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8

  38. Applications • Reconstructing long sequences of DNA from overlapping sequence fragments • Determining physical and genetic maps from probe data under various experiment protocols • Database searching • Comparing two or more sequences for similarities

  39. Protein structure prediction (building profiles) • Comparing the same gene sequenced by two different labs

  40. 2.8 Multiple Sequence Alignemnts • CLUSTAL • R. G. Higgins & P. M. Sharp, 1988 • CLUSTALW • Sequences are weighted according to how divergent they are from the most closely related pair of sequences. • Gaps are weighted for different sequences.

  41. Summary • notion of similarity • the scoring system used to rank alignments • the algorithms used to find optimal scoring alignment • the statistical method used to evaluate the significance of an alignment score

  42. 參考資料及圖片出處 • Fundamental Concepts of BioinformaticsDan E. Krane and Michael L. Raymer, Benjamin/Cummings, 2003. • BLAST, by I. Korf, M. Yandell, J. Bedell, O‘Reilly & Associates, 2003. (天瓏代理) • Biological Sequence Analysis – Probabilistic Models of Proteins and Nucleic AcidsR. Durbin, S. Eddy, A. Krogh, and G. Mitchison,Cambridge University Press, 1998. • Biochemistry, by J. M. Berg, J. L. Tymoczko, and L. Stryer, Fith Edition, 2001.

More Related