Applied Bioinformatics

Applied Bioinformatics Week 4

Similarity Searching • Heuristic Algorithms • FASTA • BLAST

FASTA Algorithm • Algorithm? • Pearson and Lipman 1988

FASTA - Algorithm - High level algorithm Let q be a query max  0 For each sequence, s in DB compare q with s and compute a score, y if max < y max  y; bestSequence  s ; Return bestSequence

FASTA Hashing • Hashing based on k in k-tuple (sequence of size k) • K e.g. 1 character .. n character 0123456789012345678 Q: ACCGCGACCCTGACGAATA D: ACCGCGATGACGAATA

Second Step (diff)

Third Step (Freq Dist) • Calculate frequency distribution • Histogram • Find most frequent offset • Shift query against sequence by that offset • Outcome: exact matches

FASTA - Heuristic - • Heuristic Good local alignment should have some exact match subsequence. FASTA focus on this area

FASTA - Algorithm - • Step 1 Find all hot-spots (remember hashing) // Hot spots is pairs of words of length k that exactly match Sequence 1 Hot Spots Sequence 2

FASTA - Algorithm - • Step 2 Score the Hot-spot and locate the ten best diagonal run.

FASTA - Algorithm - • Step 3 Combine sub-alignments into one alignment with GAP GAP One of local alignment

FASTA - Algorithm - • Step 4 Consider weighted direct graph. Let node be a sub-alignment found in step 1 Let u and v be nodes Edge (u,v) exists if alignment u is before in the sequence. Each edge has gap penalty (negative) Find the maximum weight path Sub-sequence Edge One Sequence

One of Sequence FASTA - Algorithm - • Step 4 in detail GAP Sub-alignment Gap -5 -3 -3 Max Weight Path

FASTA - Algorithm - • Step 5 Use the dynamic programming in restricted area around the best-score alignment to find out the the best-score alignment Width of this band is a parameter

FASTA - Algorithm - • Summary of Algorithm 1: Find all hot-spots // Hot spots are pairs of words of length k that exactly match 2: Score the Hot-spot and locate the ten best diagonal run. 3: Combine sub-alignments into one alignment 4: Score Each alignment with gap penalty and pick up the best-score alignment 5: Use dynamic programming in restricted area around the best-score alignment to find out the best-score alignment.

FASTA Rumors • Is said to be more sensitive for nucleotide sequences than BLAST • Is supposed to be slower than BLAST • Is mostly found in European institutes • Who’s job is it to confirm or reject these assumptions?

End Theory I • Mind mapping • 10 min break

Practice I

FASTA Hashing • Apply the FASTA hashing algorithm to the following two sequences • AGTATGTGATGTAGAT • TGATG • Show the histogram • Interpret the histogram in context of the two sequences

FASTA Query Select a nucleotide sequence of your interest Copy the first 100 nucleotides into a text file and add a definition line to turn it into FASTA format Copy and paste the sequence 10 times and remove 10 nt each time Change the definition line accordingly Outcome: 10 sequences (10 .. 100 nt)

FASTA Query Copy the 10 sequences again and add them to the end of the file Change the definition lines Add mutations (substitutions) to the sequences

End Practice I • 15 min break

Similarity Searching • Heuristic Algorithms • FASTA • BLAST

BLAST - Heuristic - • Another Heuristic algorithm • Heuristic but evaluating the result statistically. Homologous sequence are likely to contain a short high scoring word pair, a hit. BLAST tries to extend it on both sides to get larger matches A T T A G ……………. Sequence Short high score Word

BLAST - Algorithm - Neighbourhood Word • Step 1: pre-processing Query Compile the short-hit scoring word list from query. The length of query word w, is 3 for scoring Threshold T is 13

BLAST - Algorithm - • Step 1 – 2 Create neighbourhood words for each query word Query Word Neighbourhood words

BLAST - Algorithm - • Step 2: Scanning DB For each words list, identify all exact matches with DB sequences Neighbourhood Word list Query Word Sequences in DB Sequence 1 Sequence 2 Step 2 Step 1 The purpose of Step 1 and 2 is as same as FASTA

Statistical Assessment • Combine matches • Calculate statistics for each alignment • Bit Score • E-value • Report results

FASTA vs. BLAST BLAST Compare the query and sequences in DB with the same threshold. FASTA compare the query and a sequence one by one And compare each result. DB DB Query What does this mean?

End of Theoretical Part 2 • 5 mind mapping • 10 min break

Practical Part 2

EBI FASTA Use the FASTA file you created before Run your query on EBI using the fasta algorithm with the default settings Change the settings and keep track of which settings you use and the number of queries that have the correct result as the top hit Use Excel (settings, %correct)

NCBI BLAST Use the FASTA file you produced before and do the same research using NCBI BLAST that you did for EBI fasta Use blastn Select the proper database Finish EBI FASTA if you couldn't before

Homework Find a query that you can find with FASTA but not with BLAST and vice versa Submit the queries to bioinformatics@allmer.de

Applied Bioinformatics