1 / 23

Rationale for searching sequence databases

Rationale for searching sequence databases. June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use of Psi-BLAST to determine sequence similarities. Use BLASTx to gain information on gene structure. FASTA (Pearson and Lipman 1988).

jubal
Download Presentation

Rationale for searching sequence databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rationale for searching sequence databases • June 25, 2003 • Writing projects due July 11 • Learning objectives- • FASTA and BLAST programs. Psi-Blast • Workshop-Use of Psi-BLAST to determine sequence similarities. Use BLASTx to gain information on gene structure.

  2. FASTA (Pearson and Lipman 1988) • This is a combination of word search and Smith-Waterman algorithm • The query sequence is divided into small words of certain size. • The initial comparison of the query sequence to the database is performed using these “words”. • If these “words” are located on the same diagonal in an array the region surrounding the diagonals are analyzed further. • Search time is only proportional to size of database not (database*query sequence)

  3. The FASTA program is the uses Hash tables. These tables speed the process of word search. Query Sequence = TCTCTC 123456 (position number) Database Sequence = TTCTCTC 1234567 (position number) You choose to use word size = 4 for your table (total number of words in your table is 44 = 256) Sequence (total of 256) ? Position w/in query Position w/in DB Offset (Q minus DB) TCTC 1,3 2,4 -1 or -3 or 1 CTCT 2 3 -1 TTCT 1

  4. FASTA Steps 2 Different offset values 1 Identical offset values in a contiguous sequence Diagonals are extended Local regions of identity are found Rescore the local regions using PAM or Blos. matrix 4 3 Create a gapped alignment in a narrow segment and then perform S-W alignment Eliminate short diagonals below a cutoff score

  5. Summary of FASTA steps 1. Analyzes database for identical matches that are contiguous (between 5 and 10 amino acids in length (same offset values)). 2. Longest diagonals are scored again using the PAM matrix (or other matrix). The best scores are saved as “init1” scores. 3. Short diagonals are removed. 4. Long diagonals that are neighbors are joined. The score for this joined region is “initn”. This score may be lower due to a penalty for a gap. 5. A S-W dynamic programming alignment is performed around the joined sequences to give an “opt” score. Thus, the time-consuming S-W step is performed only on top scoring sequences

  6. The ktup value • The ktup (for k-tuples) value stands for the length of the word • used to search for identity. • For proteins a ktup value of 3 would give a hash table of 203 • elements (8000 entries). • The higher the ktup value the less likely you will get a match unless it is identical (remember the dot plots). • The lower the ktup value the more background you will have • The higher the ktup value the faster analysis (fewer diagonals). The following rules typically apply when using FASTA: ktup analysis____________________ 1 proteins- distantly related 2 proteins- somewhat related (default) 3 DNA-default

  7. FASTA Versions FASTA-nucleotide or protein sequence searching FASTx/-compares a translated DNA query sequence FASTy to a protein sequence database (forward or backward translation of the query) tFASTx/-compares protein query sequence to tFASTy DNA sequence database that has been translated into three forward and three reverse reading frames

  8. FASTA Statistical Significance A way of measuring the significance of a score considers the mean of the random score distribution. The difference between the similarity score for your single alignment and the mean of the random score distribution is normalized by the standard deviation of that random score distribution. This is the Z-score. Higher Z-scores are better because the further the real score is from this mean (in standard deviation units) the more significant it is.

  9. FASTA Statistical Significance Z score for a single alignment= (similarity score - mean score from database) standard deviation from database  ( scores)2  scores2 - Stand. Dev. = Total#ofSequences Total#ofSequences

  10. Mean similarity scores of complete database Mean similarity scores of related records

  11. FASTA statistics (cont.) Using the distribution of the z-scores in the database, the FastA program can estimate the number of sequences that would be expected to produce, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the E() value. This value is the number of sequences you would expect to find with this score by searching a database of random sequences. Thus, when z the E()

  12. Evaluating the Results of FASTA Best SCORES Init1: 2847 Initn: 2847 Opt: 2847 z-score: 2609.2 E(): 1.4e-138 Smith-Waterman score: 2847; 100.0% identity in 413 overlap Good SCORES Init1: 719 Initn: 748 Opt: 793 z-score: 734.0 E(): 3.8e-34 Smith-Waterman score: 796; 41.3% identity in 378 overlap Mediocre SCORES Init1: 249 Initn: 304 Opt: 260 z-score: 243.2 E(): 8.3e-07 Smith-Waterman score: 270; 35.0% identity in 183 overlap

  13. BLAST • Basic Local Alignment Search Tool • Speed is achieved by: • Pre-indexing the database before the search • Parallel processing • Uses a hash table that contains neighborhood words rather than just random words.

  14. Neighborhood words • The program declares a hit if the word taken from the query sequence has a score >= T when a scoring matrix is used. • This allows the word size (W (this is similar to ktup value)) to be kept high (for speed) without sacrificing sensitivity. • If T is increased by the user the number of background hits is reduced and the program will run faster

  15. Comparison Matrices In general, the BLOSUM series is thought to be superior to the PAM series for detecting evolutionarily distant sequences to the because they are derived from areas of conserved sequences. It is important to vary the parameters when performing a sequence comparison. Similarity scores for truly related sequences are usually not sensitive to changes in scoring matrix and gap penalty. Thus, if your “hits list” holds up after changing these parameters you can be more sure that you are detecting similar sequences.

  16. Which Program should one use? • Most researchers use methods for determining local similarities: • Smith-Waterman (gold standard) • FASTA • BLAST } Do not find every possible alignment of query with database sequence. These are used because they run faster than S-W

  17. Identify Unknown Protein BLASTP; FASTA3 General protein comparison. Use ktup=2 for speed; ktup=1 for sensitive search. When to use the correct program Smith-Waterman Slower than FASTA3 and BLAST but provides maximum sensitivity TFASTX3;TFASTY3; TBLASTN Use if homolog cannot be found in protein databases; Approx. 33% slower Psi-BLAST Finds distantly related sequences. It replaces the query sequence with a position-specific score matrix after an initial BLASTP search. Then it uses this matrix to find distantly related sequences Problem Program Explanation

  18. When to use the correct program (cont. 1) Problem Program Explanation Identify new orthologs in closely related species TFASTX3;TFASTY3 TBLASTN:TBLASTX Use PAM matrix <=20 or BLOSUM90 to avoid detecting distant relationships. Search EST sequences w/in the same species. Always attempt to translate your sequence into protein prior to searching. Identify EST Sequence FASTX3;FASTY3; BLASTX;TBLASTX Nucleotide sequence comparision Identify DNA Sequence FASTA;BLASTN TBLASTX-nucleotide query-translated nucleotide DB BLASTX-nucleotide query-protein DB

  19. Choosing the database • Remember that the E value increases linearly with database size. • When searching for distant relationships always use the smallest database likely to contain the homolog of interest. • Thought problem: If the E-value one obtains for a search is 12 in Swiss-PROT and the E-value one obtains for same search is 74 in PIR how large is PIR compared to Swiss-PROT? 74/12 = ~6

  20. Filtering Repetitive Sequences • Over 50% of genomic DNA is repetitive • This is due to: • retrotransposons • ALU region • microsatellites • centromeric sequences, telomeric sequences • 5’ Untranslated Region of ESTs Example of ESTs with simple low complexity regions: T27311 GGGTGCAGGAATTCGGCACGAGTCTCTCTCTCTCTCTCTCTCTCTCTC TCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC

  21. Filtering Repetitive Sequences (cont. 1) • Programs like BLAST have the option of filtering out low complex regions. • Repetitive sequences increase the chance of a match during a database search

  22. PSI-BLAST • PSI-position specific iterative • a position specific scoring matrix (PSSM) is constructed automatically from multiple alignment of initial BLAST search. Normal E value is used • This profile is used to perform a second BLAST search. Low E value is used E=.001. • Result-1) obtain distantly related sequences 2) find out the important residues that provide function or structure.

More Related