140 likes | 452 Views
Overview of sequence database searching techniques and multiple alignment. May 1, 2001 Quiz on May 3-Dynamic programming-Needleman-Wunsch method
E N D
Overview of sequence database searching techniques and multiple alignment • May 1, 2001 • Quiz on May 3-Dynamic programming-Needleman-Wunsch method • Learning objectives-Understand how to calculate percent similarity. Become familiar with the parameters that affect sequence searches. Be aware of strategies that you use to make sequence comparisons • Workshop-
How to calculate percent similarity once optimal alignment has been achieved • Homework assignment: determine percent similarity of two protein sequences using the Needleman-Wunsch method. Sequence 1: MPRCLCQRWNCEA Sequence 2: PERCKCRNWCWA
M P R C L C Q R W N C E A P E R C K C R N W C W A 7 8 6 5 5 4 4 3 3 2 2 1 0 7 7 6 5 5 4 4 3 3 2 1 2 0 6 6 7 5 5 4 4 4 3 2 1 1 0 5 5 5 6 5 5 4 3 3 2 2 1 0 5 5 5 5 5 4 4 3 3 2 1 1 0 4 4 4 5 4 5 4 3 3 2 2 1 0 3 3 4 3 3 3 3 4 3 2 1 1 0 3 3 3 3 3 3 3 3 2 3 1 1 0 3 3 3 3 3 2 2 2 3 2 1 1 0 2 2 2 3 2 3 2 2 1 1 2 1 0 1 1 1 1 1 1 1 1 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Path Score PAM250 GAP (PAM250) GAP Ext. Overall PAM250 score = 63-40-5=18 Percent Similarity = # of pos. scores/#of residues in the aligned portion = 8/12=66.7% Overall Percent Similarity = # of pos. scores/#of res. in longest seq. = 8/13=61.2%
Path Score PAM250 GAP (PAM250) GAP Ext. Overall PAM250 score = 48-40-5=3 Percent Similarity = # of pos. scores/#of residues in the aligned portion = 8/12=66.7% Overall Percent Similarity = # of pos. scores/#of res. in longest seq. = 8/13=61.2%
Why search sequence databases? • 1. I have just sequenced something. What is known about the thing I sequenced? • 2. I have a unique sequence. Is there similarity to another gene that has a known function? • 3. I found a new protein in a lower organism. Is it similar to a protein from another species? • 4. I have decided to work on a new that I read about. The people in the field will not give me the plasmid. I need the complete cDNA sequence to perform RT-PCR.
Perfect Searches • First “hit” should be an exact match. • Next “hits” should contain all of the genes that are related to your gene (either homologs or orthologs) • Next “hits” should be similar but are not homologs Note that in the archaebacterium Methanococcus jannaschuii more than 40% of the open reading frames could be assigned a function based on significant sequence similarities to proteins of known function
How does one achieve the “perfect search”? • Comparison Matrices (PAM vs. BLOSUM) • Database Search Algorithms • Databases • Search Parameters • Expect Value • Translation • Filtering
Comparison Matrices In general, the BLOSUM series is thought to be superior to the PAM series because it is derived from areas of conserved sequences. It is important to vary the parameters when performing a sequence comparison. Similarity scores for truly related sequences are usually not sensitive to changes in scoring matrix and gap penalty. Thus, if your hit list holds up after changing these parameters you can be more sure that you are detecting similar sequences.
Which Program should one use? • Most researchers use methods for determining local similarities: • Smith-Waterman (gold standard) • FASTA • BLAST } Do not find every possible alignment of query with database sequence. These are used because they run faster than S-W
Identify Unknown Protein BLASTP; FASTA3 General protein comparison. Use ktup=2 for speed; ktup=1 for sensitive search. When to use the correct program Smith-Waterman Slower than FASTA3 but provides maximum sensitivity TFASTX3;TFASTY3; TBLASTN Use if homolog cannot be found in protein databases; Approx. 33% slower Psi-BLAST Finds distantly related sequences. It replaces the query sequence with a position-specific score matrix after an initial BLASTP search. Then it uses the matrix to find distantly related sequences Problem Program Explanation
Identify new orthologs TFASTX3;TFASTY3 TBLASTN:TBLASTX Use a PAM matrix <=20 to avoid detecting distant relationships. Search EST sequences within the same species. Identify EST Sequence FASTX3;FASTY3; BLASTX;TBLASTX Always attempt to translate your sequence into protein prior to searching. Identify DNA Sequence FASTA;BLASTN Nucleotide sequence comparision When to use the correct program (cont. 1) Problem Program Explanation
Choosing the database • Remember that the E value increases linearly with database size. • When searching for distant relationships always use the smallest database likely to contain the homolog of interest. • Thought problem: If the E-value one obtains for a search is 12 in Swiss-PROT and the E-value one obtains for same search is 74 in PIR how large is PIR compared to Swiss-PROT? 74/12 = ~6
Filtering Repetitive Sequences • Over 50% of genomic DNA is repetitive • This is due to: • retrotransposons • ALU region • microsatellites • centromeric sequences, telomeric sequences • 5’ Untranslated Region of ESTs Example of ESTs with simple low complexity regions: T27311 GGGTGCAGGAATTCGGCACGAGTCTCTCTCTCTCTCTCTCTCTCTCTC TCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC