650 likes | 798 Views
Lesson 3. Aligning sequences and searching databases. Homology. Similarity between objects due to a common ancestry. Sequence homology. Similarity between sequences that results from a common ancestor. VLS P AV K WAKV G A HA AGHG VLS E AV L WAKV E A DV AGHG. Basic assumption :
E N D
Lesson 3 Aligning sequences and searching databases
Homology • Similarity between objects due to a common ancestry
Sequence homology • Similarity between sequences that results from a common ancestor VLSPAVKWAKVGAHAAGHG VLSEAVLWAKVEADVAGHG • Basic assumption: Sequence homology → similar structure/function
Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.
G G G G G G G1,G2 G G G Homology • Ortholog – homolog with similar function (via speciation) • Paralog – homolog which arose from gene duplication Orthologs – 2 homologs from different species Paralogs – 2 homologs within the same species
How close? • Rule of thumb: • Proteins are homologous if over 25% identical (length >100) • DNA sequences are homologous if over 70% identical
Twilight zone • < 20% identity in proteins – may be homologous and may not be…. • (Note that 5% identity will be obtained completely by chance!)
Why sequence alignment? Predict characteristics of a protein – use the structure/function of known proteins for predicting the structure/function of an unknown proteins
Sequence modifications Sequences change in the course of evolution due to random mutations Three types of mutations: • Insertion - an insertion of a nucleotide or several nucleotides to the sequence. AAGA AAGTA • Deletion– a deletion of a nucleotide (or more) from the sequence. AAGA AGA • Substitution– a replacement of a nucleotide by another. AAGA AACA Insertion or Deletion ? ->Indel
Local vs. Global Global alignment: forces alignment in regions which differ • Global alignment– finds the best alignment across the entire two sequences. • Local alignment– finds regions of similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ Local alignment will return only regions of good alignment ADLG CDRYFQ |||| |||| | ADLG CDRYYQ
Global alignment • PTK2 protein tyrosine kinase 2 of human and rhesus monkey
Protein tyrosine kinase domain • Human PTK2 and leukocyte tyrosine kinase • Both function as tyrosine kinases, in completely different contexts • Ancient duplication
Pairwise alignment AAGCTGAATTCGAA AGGCTCATTTCTGA One possible alignment: AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-
AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- This alignment includes: 2mismatches 4 indels (gap) 10 perfect matches
Choosing an alignment: • Many different alignments are possible: AAGCTGAATTCGAA AGGCTCATTTCTGA A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- Which alignment is better?
Alignment scoring - scoring of sequence similarity: • Assumes independence between positions • Each position is considered separately • Scores each position • Positive if identical (match) • Negative if different (mismatch) or gap (indel) • Total score = sum of position scores • Can be positive or negative
Example - naïve scoring system: • Perfect match: +1 • Mismatch: -2 • Indel (gap): -1 AAGCTGAATT-C-GAA AGGCT-CATTTCTGA- A-AGCTGAATTC--GAA AG-GCTCA-TTTCTGA- Score: =(+1)x10 + (-2)x2 + (-1)x4= 2 Score: =(+1)x9 + (-2)x2 + (-1)x6 = -1 Higher score Better alignment
Scoring system: • The choice of +1,-2, and -1 scores is quite arbitrary • Different scoring systems different alignments • Scoring systems implicitly represent a particular theory of evolution • Some mismatches are more plausible • Transition vs. Transversion • LysArg ≠ LysCys • Gap extension ≠ Gap opening
Scoring matrix • Representing the scoring system as a table or matrix nn (n is the number of letters the alphabet contains. n=4 for nucleotides, n=20 for amino acids) • symmetric
Match DNA scoring matrices • Uniform substitutions between all nucleotides: Mismatch
DNA scoring matrices Can take into account biological phenomena such as: • Transition-transversion
Amino-acid scoring matrices • Take into account physico-chemical properties
Amino-acid substitutions matrices • Actual substitutions: • Based on empirical data • Commonly used by many bioinformatics programs • PAM & BLOSUM
Protein matrices – actual substitutions The idea: Given an alignment of a large number of closely related sequences we can score the relation between amino acids based on how frequently they substitute each other M G Y D E M G Y D E M G Y E E M G Y D E M G Y Q E M G Y D E M G Y E E M G Y E E In the fourth column E and D are found in 7 / 8
PAM Matrix - Point Accepted Mutations • Based on a database of 1,572 changes in 71 groups of closely related proteins (85% identity) • Alignment was easy • Counted the number of the substitutions per amino-acid pair (20 x 20) • Found that common substitutions occurred between chemically similar amino acids
PAM Matrices • Family of matrices PAM 80, PAM 120, PAM 250 • The number on the PAM matrix represents evolutionary distance • Larger numbers are for larger distances
Example: PAM 250 Similar amino acids have greater score
PAM - limitations • Based only on a single, and limited dataset • Examines proteins with few differences (85% identity) • Based mainly on small globular proteins so the matrix is biased
BLOSUM • Henikoff and Henikoff (1992) derived a set of matrices based on a much larger dataset • BLOSUM observes significantly more replacements than PAM, even for infrequent pairs
BLOSUM:BlocksSubstitutionMatrix • Based on BLOCKS database • ~2000 blocks from 500 families of related proteins • Families of proteins with identical function • Blocks are short conserved patterns of 3-60 aa without gaps AABCDA----BBCDA DABCDA----BBCBB BBBCDA-AA-BCCAA AAACDA-A--CBCDB CCBADA---DBBDCC AAACAA----BBCCC
BLOSUM • Each block represents a sequence alignment with different identity percentage • For each block the amino-acid substitution rates were calculated to create the BLOSUM matrix
BLOSUM Matrices • BLOSUMn is based on sequences that share at least n percent identity • BLOSUM62 represents closer sequences than BLOSUM45
Example : Blosum62 derived from block where the sequences share at least 62% identity
PAM vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences
Scoring system = substitution matrix + gap penalty
Gap penalty • We penalize gaps • Scoring for gap opening & gap extension: • Gap-extension penalty < gap-open penalty
Optimal alignment algorithms • Needleman-Wunsch (global) • Smith-Waterman (local)
Alignment Search Space • The “search space” (number of possible gapped alignments) for optimally aligning two sequences is exponential in the length of the sequences (n). • If n=100, there are 100100 = 10200 = 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 different alignments! • Average protein length is about n=250!
Searching a sequence database • Using a sequence as a query to find homologous sequences in a sequence database
Query sequence: DNA or protein? • For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences. • Which is preferable?
Protein is better! • Selection (and hence conservation) works (mostly) on the protein level:CTTTCA = Leu-SerTTGAGT=Leu-Ser
Query type • Nucleotides: a four letter alphabet • Amino acids: a twenty letter alphabet • Two random DNA sequences will, on average, have 25% identity • Two random protein sequences will, on average, have 5% identity
Conclusions • Using the amino-acid sequence is preferable for homology search • Why use a nucleotide sequence after all? • No ORF found, e.g. newly sequenced genome • No similar protein sequences were found • Specific DNA databases are available (EST)
Some terminology • Query sequence - the sequence with which we are searching • Hit– a sequence found in the database, suspected as homologous
How do we search a database? • Assume we perform pairwise alignment of the query against all the sequences in the database • Exact pairwise alignment is O(mn) ≈ O(n2)(m – length of sequence 1, n – length of sequence 2)