Based on a Lecture by Gary Benson, Departments of Computer Science and Biology,

Searching for Similarities in Geneticand Proteomic SequencesDr. Jaume BacarditProf. N. Krasnogor Based on a Lecture by Gary Benson, Departments of Computer Science and Biology, Boston University, given at the 2003 ISSCB. Examples from P.A. Pevzner’s “Computational Molecular Biology”, D.A.Krane & M.L. Raymer’s “Fundamental Concepts of Bioinformatics” and from C. Gybas & P.Jambeck’s “Developing Bioinformatics Computer Skills”

(i,j)=(Y,X) • Initiate first raw and column to multiples of the gap • For position (2,2) either: • Gap in 1st sequence • Gap in 2nd sequence • An alignment • So three options: • Use (2,1) and add a gap penalty, representing a gap along the left sequence • Use (1,2) and add a gap penalty, representing a gap along the top sequence • Use(1,1) and add a (mist)match score, representing an alignment • Pick the one resulting in the maximum value • Repeat filling tableaux

G G TCG TAG - TCG GTAG AC - -TCG ACAGTAG

Alignment Examples Using ALIGN (I) Alignment of two closely related proteins: two hemoglobins from a sea lamprey and from hagfish

Alignment Examples Using ALIGN (II) Alignment of two distantly related proteins: two hemoglobins from a sea lamprey and from rice

Alignment Examples Using ALIGN (II) Alignment of two unrelated proteins: a lamprey hemoglobin and a human retinol binding protein

Local Sequence Alignment • LCS and GSA seek similarities among entire strings • Useful when similarity is expected to extend over the entire strings, e.g., protein sequences of the same protein family • These protein sequences are well conserved even among different species having almost the same length among humans vs fruit flies. • There are other applications where the alignment score of some substrings of the query pair can be better than the alignment of the whole query pair. • Homeobox appear in many species with large variations but with a highly conserved region called homeodomain. • How do we find the common region ignoring the variable region?

Local Alignment Problem Find the best local alignment between two strings Input:Strings X,Y and a scoring matrix δ Output:Substrings of X and Y whose global alignment score (as defined by δ) is maximal among all possible alignments of all substrings of X and Y

Consider AACCTATAGCT and GCGATATA • Using GSAP algorithm we would get: • A A C – C T A T A G C T • - G C GA T A T A - - - • This is not too good an alignment as the first 5 positions and the last 3 are gaps or mismatches • However the TATA box has been identified • We can use Smith-Waterman to only identify the commonalities ignoring gaps and mismatches at the beginning and/or end

0 si-1,j + δ(Xi,-) si,j-1 + δ(-,Yj) si-1,j-1 + δ(Xi,Yj) Smith-Waterman Algorithm si,j=max • Dynamic Programming • Modification to the LCS & GSAP algorithms

0 si-1,j -1 si,j-1 -1 si-1,j-1 + 1, if Xi=Yj si-1,j-1 - 1, if Xi ≠Yj si,j=max • That is, we place a zero in any position in the table if all other considerations result in scores smaller than zero. • Find the maximum partial alignment score and work backwards as before until one reaches a zero

LALIGN, BLAST • local alignment method • gives multiple alternative alignments • defines the gap penalty as q + r (k-1). Sources: • http://fasta.bioch.virginia.edu/fasta/lalign.htm • www.ch.embnet.org/software/LALIGN_form.html • www-bio.unizh.ch/cgi-bin/man-cgi?lalign

Approximate alignment methods: Introduction The dynamic programming method for sequence alignment finds always the best alignment between two sequences However, it is very costly to compute. The time spent by the method grows proportionally to n·m Imagine a situation where we would like to align a query sequence against a whole database containing millions of sequences !!! An alternative to these methods are the approximate methods, using some kind of heuristic. These methods are not guaranteed to find the optimal alignment, specially for distant relationships, but will work well most of the time, and are significantly faster.

Approximate methods: k-tuple methods • One family of approximate methods are the k-tuple methods • These methods take a query sequence and construct all possible subsequences of size k (k=3 for protein sequences) and search these substrings in the database • The aim of this step is to thin out the list of candidate strings • Afterwards, the method takes the rest of sequences that had a good score in these small alignments of size k and attempts to extend the alignment as much as possible • This step is costly, but because it is only performed on a subset of the sequences in the database, the overall method is very efficient

Approximate methods • BLAST (Basic Local Sequence Alignment Tool) is the most well-known of the approximate local alignment methods • The expression “to blast” has become a synonym of “to align” • The initial BLAST method was proposed in 1990, and it could only produce alignments without gaps • Alignments with gaps were possible by combining several non-gapped alignments identified for the same database sequence, but only if all of them were identified • A new version was proposed in 1997, that could also explicitly produce alignments with gaps • The gapped blast can be ~3 times faster than the original blast, and more than 100 times faster than dynamic programming

Approximate methods. Steps of BLAST • Step 1: Make a list of 3 letter words of Protein seq. MATCHES MAT ATC TCH CHE HES • Step 2: Align each word (without gaps) with all possible 3-letter words. That is, make 8000 alignments (20x20x20)

Approximate methods. Steps of BLAST • Step 3: Compute the score (similarity score) of all these 8000 alignments using the BLOSUM62 substitution matrix. • Discard those that have a score less than T • T = neighborhood word score threshold • This reduces the list from 8000 to ~50 • Step 4: Repeat this process for all the words in the query protein • For a 250 aa seq. => 12,500 words

Approximate methods. Steps of BLAST • STEP 5. Scan database for exact matches of words  hits • STEP 6. Ungapped aligments • This step is performed if, for the same sequence in the database, two non-overlapped hits are found in the same diagonal of the dotplot that are less than A residues apart (usually A=40) • If this situation is found, the alignment is extended (without gaps) starting with these hits as long as the score is at most Xu units less than the best score found so far

Approximate methods. Steps of BLAST • Step 7. Gapped alignments • The dynamic programming method is used for the gapped alignments • Because the exact method is very costly, BLAST performs very few of these gapped alignments • It will only be activated if in the previous step an alignment with a normalized score of at least 20 bits is found • Exploration will start in the central residue of the length-11 segment with highest alignment score within the high scoring alignment  the seed • The trajectory will be extended from the seed to all possible directions, but any direction that has an score Xg units less than the best score found so far will be dropped

Approximate methods: Evaluating BLAST results • A BLAST search in a sequence database might produce hundreds of candidate alignments. • How to know which are meaningful, i.e. homologous? • BLAST provides several metrics: • Raw scores • Bit scores • E-values

Approximate methods: Evaluating BLAST results • Raw scores: the sum of the scores of the maximal-scoring hits that makes up the alignment. Differences between scoring matrices raw scores are not directly comparable • Bit scores: these are raw scores that have been converted from the log base of the scoring matrix that creates the alignment to log base 2. This rescaling allows bit scores to be comparable. • E-scores: is the likelihood that a given sequence alignment is significant. The e-value indicates the number of alignments one expects to find with a score equal or greater to the given one in a search against a random database. • Large e-value (5 or 10) indicates that the alignment is probably by chance • E-values of 0.1 or 0.05 are typical cuttoff values for data base search • Proteins with less than 25% similarity are not similar enough for a reliable BLAST analysis and structural comparison must be used.

Basic Local Alignment Sequence Tool Advantages: • very good for searching large databases Disadvantages of BLAST: • Needs islands of strong homology • The variants blastx, tblastn, tblastx use 6 frame translations and will miss sequences with frameshifts • Finds ONLY local alignments

Multiple Sequence Alignments • The goal of protein sequence comparison is to discover structural or functional similarities among proteins • Biologically similar proteins might not be too similar sequence-ways but we still want to detect whatever similarity there is • Pairwise similarity is not sensitive enough to detect weak commonalities but multiple alignments oftentimes allows to detect weak similarities.

Multiple Alignment of three Sequences • The algorithm for multiple sequence alignment is a generalization of the algorithm for LAP • The scoring matrix δ for a three-wise alignment would be a 3-dimensional array that will specify the scores and penalties associated to combinations of 3 symbols either from the sequences or the gaps. • The tableaux to be filled will also be 3-d

To get cell (i,j,k) in the tableaux you could use any one of the following predecessors:

The recurrence relation becomes: (1) (2) (3) (4) (5) (6) (7) Remember that branch 7 really gives you several alternatives depending on whether: v=w=u, v=w != u, v=u !=w, w=u !=v, v!=w!=v , each with its own (possibly distinct) score!

The running time , in the case of k sequences, is O((2n)k) • Many heuristics are, in practice, used as to reduce this time: • Compute all (K,2) optimal pairwise alignments and then combine them together in such a way that pairwise alignments induced by the multiple alignment are close to the optimal one. • Not always possible to do

Another heuristic: • Progressive Multiple Alignment: • Choose a pair of strings with strong similarity • Produce a strong alignment based in these 2 strings • Iteratively add the remaining k-1 sequences into the alignment • ClustalW is an example of a famous program that does something similar

To effectively solve MSA one needs to resort • to a variety of heuristics (1) at the biology level • and (2) at the search & optimisation level. • At the Biology Level: • Phylogenetic Trees • At the Search & Optimisation Level: • Genetic Algorithms • Find our How ClustalW works • (self Study)

Some Help for Self-Study

How ClustalW works Based on Progressive Pairwise Alignment (PPA) • Perform pair-wise alignment of all the sequences • Use FASTA or full Dynamic Programming • Use the alignment scores to produce a phylogenetic tree • Genetic distance between sequences is computed: the number of mismatched positions in an alignment divided by the total number of matched positions (positions oposite a gap are not scored) • Using the genetic distance matrix compute a tree • Align the sequences sequentially using dynamic programming algorithm guided by the phylogenetic relationship indicated by the tree

When to use ClustalW ? ClustalW performs well when: • aligning sequences of similar lengths • aligning small to large protein families of similar sequences • few divergent sequences may be included

Sources for ClustalW • http://dot.imgen.bcm.tmc.edu:9331/multi-align/Options/clustalw.html • http://www.bionavigator.com • http://www2.ebi.ac.uk/clustalw/ • http://bioweb.pasteur.fr/intro-uk.html • http://pbil.ibcp.fr/ • http://www.clustalw.genome.ad.jp/

How Dialign works ? • Local alignment approach • identify gap-free fragments (called diagonals) of high similarity • built segments into multiple alignment using an iterative approach • works with DNA and proteins

When to use Dialign ? Dialign performs well when: • sequences have long terminal extensions • sequences have large insertions • useful for finding conserved blocks within a set of sequences

Sources for Dialign http://www.hgmp.mrc.ac.uk/ http://mep.bio.psu.edu/alignment.html http://genomatix.gsf.de/ http://bibiserv.techfak.uni-bielefeld.de/

Sequence Alignment by Genetic Algorithms (SAGA) • The Genetic Algorithm is a stochastic heuristic search algorithm • It is loosely based on the concept of evolution by natural selection • Evolving genotypes represent solutions to a problem • As evolution progresses genotypes evolve towards better solutions • SAGA was introduced by Notredame and Higgins in 1996

The Algorithm • An initial population of size 100 is generated with sequences randomly aligned to have up to 50 residues long subsequences (out of 200) aligned. The ends are padded with gaps: xxxxxxxxxxxxx-------------- -----xxxxxxxxxxxxxxxxxxx -------xxxxxxxxxxxxxxx---- ---xxxxxxxxxxxxxxxxxx--

Each individual, representing a MSA, is scored using the sum of pairs method with both natural and quasi natural gap scoring schemes. Standard amino acid scoring matrices and gap opening and extension penalties are used by SAGA • The population is ranked and the lowest half (better half) is copied unchanged into the next generation. From the upper half (worst half) individuals are chosen randomly with probabilities inversely proportional to their fitnesses • The selected individuals undergo mutation with a probability Pm. • From the mutated individual new offspring are generated using a crossover operator • The next generation compose of the 1st half and the newly created individuals are evaluated as in step 2. • Steps 2 to 6 is typically repeated for 100 generations. • The entire GA is repeated several time and the overall best MSA is returned

The Mutation Operators • The sequences themselves are not changed • Gaps are inserted and rearranged in an attempt to obtain better scoring MSA. • Using an estimated phylogenetic tree sequences are divided into two groups and gaps of random length are inserted into random positions in the alignment. • A Memetic & SA versions exist were hill-climbing takes place: after the division is made gaps are inserted in each group as to maximize the score.

Other mutations exist: xxx--xxxxx xx--xxxxxx xxx--xxxxx xxxxx--xxx xxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xx--xxxxx x--xxxxxxx xxx--xxxxx xx-xx-xxxx xxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx Starting Block Whole Block Move (L) Split Block Horiz Split Block Vert

Recombination: xxGxxxxDxx xxGxx-xDxx xxGxx-xDxx xxGx-xxDxx xxGxxxxDxx xxGxxxxDxx xxGxx-xDxx xxGxxxxDxx xxGxxxxDxx xxGxxxxDxx xxGx-xxDxx xxGx-xxDxx Parent B Alignment Parent A Alignment Child

Exercises • Familiarise yourself with: • ALIGN • BLAST • FASTA • CLUSTALW • Install in your pc, use through the web, get to know the format for input/output

Based on a Lecture by Gary Benson, Departments of Computer Science and Biology,

Based on a Lecture by Gary Benson, Departments of Computer Science and Biology,

Presentation Transcript

Connections between Computer Science and Biology

A Model Academic Ethics and Integrity Policy for Computer Science Departments

Based on the Novel, Hatchet, by Gary Paulsen

Lecture 2: Computer Architecture: A Science ofTradeoffs

Science and Chemistry of Biology

Biology Blended with Math and Computer Science

Lecture 5 of Computer Science II

Lecture on Computer Science as a Discipline

By: Dee Benson

The field of science in which biology, computer science, and information technology

Computational Biology and Bioinformatics in Computer Science

Based on a Lecture by Gary Benson, Departments of Computer Science and Biology,

Challenges for computer science as a part of Systems Biology

Computer Science 101 A Survey of Computer Science

Where Computer Science meets Biology

Computer Science 101 A Survey of Computer Science

Lecture 5 of Computer Science II

Lecture 5 of Computer Science II

Material taken (1) from a Lecture by Gary Benson, Departments of Computer Science and Biology,

Lecture 7 of Computer Science II

Panagiotis Papapetrou*, Gary Benson ** and George Kollios * *Department of Computer Science

BIOLOGY DEPARTMENT @CSI csi.cuny/departments/biology