Database Searching:BLAST and FASTA

Database Searching:BLAST and FASTA

Biological Sequence Databases • NCBI • Nucleotide • GenBank, EMBL, DDBJ, GSDB, patent sequences • Protein • SwissProt, PIR, PRF, PDB (structural) • PopSet • GDB, FlyBase, OMIM, HGMD, MGD, GED • There are other types of biological databases : EcoCyc, WIT/WIT2, others

Total amount of non-redundant, well-maintained sequence data is 10- 15 GB and growing expontially • This does not include lots of other easily accessible data like EST’s (expressed sequence tags)

In-class exercise: GenBank flatfile format • Open Netscape browser • URL: http://www.ncbi.nlm.nih.gov/ • Make sure GenBank appears in Search window; type in calmodulin; select Go • Pick any of the returned files; double click to select it • Observe output “flat file” • Remember: the database is not a collection of flatfiles!

Database searching: the problem • Experiments yield a sequence, and you want to know what kind of gene/protein it is (or you have a known gene sequence from one organism and you want to know if another organism has a homologous gene) • Databases have millions of sequences • Need method to compare query sequences with all those in databases: speed vs. accuracy

The real problem … • How do we compare sequences? • Seq 1: CTGCACTA • Seq 2: CACTA • or C---ACTA

The real problem … • How do we compare sequences? • Seq 1: CTGCACTA • Seq 2: CACTA • or C---ACTA • Scoring tries to approximate evolution: scores for substitutions and for gaps (insertions/deletions) • Scores = sum of terms for substitutions and for gaps (sequence as character string) 41 17

Sequence alignment I • Simplest scoring: 1 for match, 0 for no match • CTGCACTA • CACTA • CTGCACTA • C---ACTA Score = 5 Score = 5

Sequence alignment II • Slightly more advanced scoring: +1 for match, 0 for no match, -1 for gap • CTGCACTA • CACTA • CTGCACTA • C---ACTA Score = 5 Score = 2

G C A T G 1 0 0 0 C 0 1 0 0 A 0 0 1 0 T 0 0 0 1 G C A T G 1 -1 -1 -1 C -1 1 -1 -1 A -1 -1 1 -1 T -1 -1 -1 1 Identity scoring matrices: top, simple form; below, with mismatch penalty

In-class exercise II • Using the “advanced scoring method” calculate the scores for the following pairs of nucleotide sequences:

Linear vs. affine gap penalties • Linear gap penalty: same penalty subtracted from each space in the gap • Affine gap penalty: first space in the gap has a larger score than subsequent spaces in the gap; i.e., easier to lose/gain more subunits from a gap than to start a new gap/insertion (this makes sense evolutionarily)

Linear vs. Affine: example • Match = +1, mismatch = -1, gap = -2; • CCTGGGCTATGC • CC-GG-TT-TGC • Same as above but with affine penalty = -1 • CCTGGGCTATGC • CC--GGTT-TGC Score = 1 Score = 2

What about proteins? • Chemistry of amino acids means that some substitutions in the sequence are better than others • Substitution matrix: empirically derived scores for frequency of substitution of each amino acid for all 19 others.

BLOSUM 62 Substitutionmatrix

In-class exercise III • Using the BLOSUM62 substitution matrix and a gap penalty of -2, score the following pairs of protein sequences (do not penalize end gaps)

In-class exercise IV • Using the BLOSUM62 substitution matrix, find the scores of the following alignment using • A) linear gap penalty of d = 8 • B) affine gap penalty of d = 8, e = 2 • AVAHV---D--DMPNALS • AAIQLQVTGVVVTDATLK

Search algorithms • Early search algorithms scored the whole query sequence against every library sequence; very accurate but very slow • FASTA and BLAST search algorithms use the idea that similar sequences are likely to have small areas of exact matches

Original BLAST Algorithm • Make a list of short “words” = wmers from the query sequence: • For nucleotide sequences wmers are used “as is” • for protein sequences wmers are all words which score higher than some threshold T using a substitution matrix • Choose all possible wmers by sliding a window of length w down the query sequence

DNA CACTAGCTAAA For w = 6. CACTAG ACTAGC CTAGCT etc. Protein WRKRKKRTGLE For w=3, T=11 WRK WRR WKR RKR RKK, etc. W-mers: DNA vs. protein

extension wmer Original BLAST algorithm cont. • Scan the library for sequences that match wmers generated in step 1 • Extend hits; any extensions must increase the score over that of the original hit; extension stops when no increase in score

Original BLAST algorithm cont • BLAST results can give more than one area of sequence similarity per pair of proteins compared • BLAST results are more amenable to statistical analysis than FASTA

Gapped BLAST improvements • BLAST now requires two hits within distance A on the same diagonal to trigger extension (overlapping hits are ignored); to keep sensitivity, T is lowered, increasing the number of hits; but since computation time is mostly in the extension phase the overall computation time is much reduced.

Gapped BLAST improvement, cont • Gapping is now allowed. This is handled by having a moderate cutoff score for HSP’s that triggers dynamic programming algorithm (discussed in two weeks) to align the two sequences in question • Statistical analysis somewhat compromised, but still very good.

In-class exercise IV • Open browser; go to www.ncbi.nlm.nih.gov • In “search” window, use pulldown menu to select proteins • Type cytochrome P450 in “for” window; select go • Select a protein by double-clicking • In the database flatfile window, where it says “Default View”, open pulldown menu and select FASTA; then select the Display button

In-class exercise IV, cont. • Paint sequence starting with the first amino acid; select Editcopy; • From top menu bar select Protein; in next window select BLAST from sidebar • In BLAST window, under protein BLAST, select standard protein-protein BLAST • Paste sequence into “search” box; keep the defaults • Select BLAST

In-class exercise IV, cont. • In the next window, you will see information about the conserved domain (CD) search that is done by default; you should select Format first, then you can play with the CD results for a while. If you click on the CD image first, you can’t get back and have to start the process over. • The CD search uses profile searching, which we will discuss later in the course • The Format button will open a new window that will display results when the search is over.

BLAST results • Bar graph with matches; list of matches with E values (more next time) • Small and intermediate E values are most similar

FASTA algorithm • First, look for all identities between small “word” = ktup and every sequence in database. Ktup size determines how many letters must be identical (e.g., 3) • CTGCACTA • CTG • TGC • GCA • etc AGCTGACGCA CTG GCA

FASTA, cont • Ktup matches can be depicted in a matrix; diagonals indicate matches. For every library sequence, the 10 best diagonals are constructed from the ktup matches using a distance formula

FASTA, cont. • The top 10 diagonals are rescored using substitution matrices and identities smaller than ktup; each of these rescored diagonals is called an initial region.

FASTA, cont. • Initial regions are then joined using a joining penalty (like a gap penalty). The highest score of the joined initial regions is then the score for that library sequence. The library sequences are ranked by this score.

FASTA, cont. • In the last step of the FASTA algorithm, library sequences that scored above some threshold value are aligned with the query sequence and each other using alignment methods (dynamic programming) we will discuss later.

In-class exercise V • Get into seqlab, open bioinfI.list Select FunctionsDatabase Reference SearchingLookup • Select Search the chosen sequence libraries; select GenBank • In the all text box, type hsp70 • Select Run; do NOT close the green output window

In-class exercise, cont • Making sure nothing in main list is selected, toggle to editor (blank screen) • Select FileAdd Sequences FromDatabases. • Paint the name of the entry GB_BA1:BORHSP70 in the green output window with the left mouse button; click in the Database specification window with the left button, then click with the right button to paste. Select Add to main window.

In-class exercise • Close the Database browser window and the green output window • Select the name of the sequence in the editor; select functionsDatabase Sequence SearchingFASTA • Select Search set; then Add database sequences; then Primate under GenEMBL; then click on Add to search set; then close.

In-class exercise cont • If there is anything besides the primate library in the search set box, delete it; you can save the primate search set if you want to; then close the search set window. • Select Run on the FASTA window; this will take at least 5 minutes

Interpreting FASTA results • FASTA results are reported in a histogram of expected values compared to a random search set. We will discuss this more next time. • The bottom part of the histogram contains the matches of interest (lowest probability of being random)

Practical considerations in searches • FASTA: varying ktup • BLAST: varying S or E (w optimized by default) in advanced BLAST • Secondary consideration: gap initiation and extension penalties (usual is about 6 and 2 or 6 and 1)

PSI-BLAST • PSI-BLAST (position-specific-iterated BLAST) is an example of a search method that uses profiles, which we will discuss later in the course • PSI-BLAST starts off with a user-supplied query sequence, and a normal BLAST search is performed

If the user decides to iterate (e.g., if sequences with good E-values are returned in the first step), then the PSI-BLAST program aligns each returned sequence with the query sequence • The PSI-BLAST program stacks up or aggregates all these individual alignments, producing something that looks like a multiple alignment but is really not

From this stack of pairwise alignments, a scoring matrix is calculated by first finding the frequency of amino acids in each column, and then weighting that frequency to take into account very frequent and less frequent amino acids • Then the database is searched again, using this scoring matrix, which takes the variation at each sequence position of the query sequence into account • This procedure can be repeated many times to expand the sequences retrieved by the original query sequence

Caveats for PSI-BLAST • The results of a PSI-BLAST search can be skewed greatly by sequences that are recruited into the initial set; “greedy algorithm” • Different but related initial query may give different results (though there is likely to be overlap)

Other database search methods • SSEARCH: Based on dynamic programming algorithm; we will talk about this in two weeks. • We will talk about PHI-BLAST, BLOCKS, PROBE, BAYES ALIGNER, and other profile-based methods later in the course

Database Searching:BLAST and FASTA