470 likes | 617 Views
Chapter 2 Data Searches and Pairwise Alignments. 暨南大學資訊工程學系 黃光璿 2004/03/08. Introduction. What is the difference between acctga and agcta?. a c c t g a a g c t g a a g c t - a. Nomenclature. 2.1 Dot Plots. 2.2 Simple Alignments. No gap. mutation (substitution): common insertion
E N D
Chapter 2Data Searches and Pairwise Alignments 暨南大學資訊工程學系 黃光璿 2004/03/08
Introduction • What is the difference between acctga and agcta? a c c t g a a g c t g a a g c t - a
2.2 Simple Alignments • No gap
mutation (substitution): common • insertion • deletion • scoring scheme • match score • mismatch score } gap, indel (rare)
2.3.1 Gap Penalty • uniform gap • affine gap • origination penalty • length penalty
Modeling 之問題 • 大自然是否真的依此規則運作?
2.4.1 PAM Matrices • Dayhoff, Schwartz, Orcutt (1978) • Point Accepted Mutation • Based on observed substitution rates • (Box. 2.1) • Input • A set of observed substitution rates • Output • PAM-1 matrix (log-odds matrix)
Multiple Alignment (1) Group the sequences with high similarity (> 85% identity).
Phylogenetic Tree (2) For each group, build the corresponding phylogenetic tree.
Mutation Frequency A->G, I->L, A->G, A->L, C->S, G->A (3) FG,A=3
Relative Mutability • (4)
Mutation Probability • (5)
Odds Ratio • (6)
Log-Odds Ratio • (7)
Which PAM matrix is the most appropriate? • the length of the sequences • How closely the sequences are believed to be related. • PAM 120 for database search • PAM 200 for comparing two specific proteins
2.4.2 BLOSUM Matrices • Henikoff & Henikoff (1992) • PAM-k: k愈大, 愈不相似 • BLOSUM-k: k愈大愈相似 • BLOSUM62: for ungapped matching • BLOSUM50: for gapped matching
2.5 Dynamic Programming • The Needleman and Wunsch Algorithm (Global Alignment)
A C - - T C G A C A G T A G
2.6 Global and Local Alignments • Semi-global alignment • Local alignment
2.6.1 Semi-global Alignments • A A C A C G T G T C T • - - - A C G T - - - -
2.6.2 Local Alignment • The Smith-Waterman Alignment
2.7 Database Searches • BLAST and its relatives • FASTA and related algorithms
BLASTP • Using PAM or BLOSUM matrices
2.7.2 FASTA and Related Algorithms 改進 dot plot & band search • Preprocess the target sequence. • Identify the position for each word. (for amino acid & word length=1, a 20-entry array) • Scan the query sequence. • Compute the shifts of query to align each word with the target. • Find the mode (眾數) of the shifts. • Join the possible shifts into one new target sequence. Perform the full local alignment algorithm.
Target: FAMLGFIKYLPGCM Query:TGFIKYLPGACT
2.7.3 Alignment Scores and Statistical Significance of Database Searches • related model v.s. random model • S-score: the alignment score • E-score: expected number of sequences with score >= S by random chance • P-score: probability that one or more sequences with score >= S would be found randomly • Low E & P are better.
length correction • Scores
PAM 120 (ln 2)/2 nats A R N D C Q E G H I L K M F P S T W Y V B Z X * A 3 -3 -1 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0 0 -1 -1 -8 R -3 6 -1 -3 -4 1 -3 -4 1 -2 -4 2 -1 -5 -1 -1 -2 1 -5 -3 -2 -1 -2 -8 N -1 -1 4 2 -5 0 1 0 2 -2 -4 1 -3 -4 -2 1 0 -4 -2 -3 3 0 -1 -8 D 0 -3 2 5 -7 1 3 0 0 -3 -5 -1 -4 -7 -3 0 -1 -8 -5 -3 4 3 -2 -8 C -3 -4 -5 -7 9 -7 -7 -4 -4 -3 -7 -7 -6 -6 -4 0 -3 -8 -1 -3 -6 -7 -4 -8 Q -1 1 0 1 -7 6 2 -3 3 -3 -2 0 -1 -6 0 -2 -2 -6 -5 -3 0 4 -1 -8 E 0 -3 1 3 -7 2 5 -1 -1 -3 -4 -1 -3 -7 -2 -1 -2 -8 -5 -3 3 4 -1 -8 G 1 -4 0 0 -4 -3 -1 5 -4 -4 -5 -3 -4 -5 -2 1 -1 -8 -6 -2 0 -2 -2 -8 H -3 1 2 0 -4 3 -1 -4 7 -4 -3 -2 -4 -3 -1 -2 -3 -3 -1 -3 1 1 -2 -8 I -1 -2 -2 -3 -3 -3 -3 -4 -4 6 1 -3 1 0 -3 -2 0 -6 -2 3 -3 -3 -1 -8 L -3 -4 -4 -5 -7 -2 -4 -5 -3 1 5 -4 3 0 -3 -4 -3 -3 -2 1 -4 -3 -2 -8 K -2 2 1 -1 -7 0 -1 -3 -2 -3 -4 5 0 -7 -2 -1 -1 -5 -5 -4 0 -1 -2 -8 M -2 -1 -3 -4 -6 -1 -3 -4 -4 1 3 0 8 -1 -3 -2 -1 -6 -4 1 -4 -2 -2 -8 F -4 -5 -4 -7 -6 -6 -7 -5 -3 0 0 -7 -1 8 -5 -3 -4 -1 4 -3 -5 -6 -3 -8 P 1 -1 -2 -3 -4 0 -2 -2 -1 -3 -3 -2 -3 -5 6 1 -1 -7 -6 -2 -2 -1 -2 -8 S 1 -1 1 0 0 -2 -1 1 -2 -2 -4 -1 -2 -3 1 3 2 -2 -3 -2 0 -1 -1 -8 T 1 -2 0 -1 -3 -2 -2 -1 -3 0 -3 -1 -1 -4 -1 2 4 -6 -3 0 0 -2 -1 -8 W -7 1 -4 -8 -8 -6 -8 -8 -3 -6 -3 -5 -6 -1 -7 -2 -6 12 -2 -8 -6 -7 -5 -8 Y -4 -5 -2 -5 -1 -5 -5 -6 -1 -2 -2 -5 -4 4 -6 -3 -3 -2 8 -3 -3 -5 -3 -8 V 0 -3 -3 -3 -3 -3 -3 -2 -3 3 1 -4 1 -3 -2 -2 0 -8 -3 5 -3 -3 -1 -8 B 0 -2 3 4 -6 0 3 0 1 -3 -4 0 -4 -5 -2 0 0 -6 -3 -3 4 2 -1 -8 Z -1 -1 0 3 -7 4 4 -2 1 -3 -3 -1 -2 -6 -1 -1 -2 -7 -5 -3 2 4 -1 -8 X -1 -2 -1 -2 -4 -1 -1 -2 -2 -1 -2 -2 -2 -3 -2 -1 -1 -5 -3 -1 -1 -1 -2 -8 * -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8
Applications • Reconstructing long sequences of DNA from overlapping sequence fragments • Determining physical and genetic maps from probe data under various experiment protocols • Database searching • Comparing two or more sequences for similarities
Protein structure prediction (building profiles) • Comparing the same gene sequenced by two different labs
2.8 Multiple Sequence Alignemnts • CLUSTAL • R. G. Higgins & P. M. Sharp, 1988 • CLUSTALW • Sequences are weighted according to how divergent they are from the most closely related pair of sequences. • Gaps are weighted for different sequences.
Summary • notion of similarity • the scoring system used to rank alignments • the algorithms used to find optimal scoring alignment • the statistical method used to evaluate the significance of an alignment score
參考資料及圖片出處 • Fundamental Concepts of BioinformaticsDan E. Krane and Michael L. Raymer, Benjamin/Cummings, 2003. • BLAST, by I. Korf, M. Yandell, J. Bedell, O‘Reilly & Associates, 2003. (天瓏代理) • Biological Sequence Analysis – Probabilistic Models of Proteins and Nucleic AcidsR. Durbin, S. Eddy, A. Krogh, and G. Mitchison,Cambridge University Press, 1998. • Biochemistry, by J. M. Berg, J. L. Tymoczko, and L. Stryer, Fith Edition, 2001.