Sequence Alignment

Sequence Alignment

Two general methods for sequence alignment: • Global alignment:considers similarity across the full extent of the sequences, e.g. MegAlign • Local alignment:focuses on regions of similarity in parts of the sequences only, e.g. BLAST programs.

Questions: • How similar are two sequences? • What is the best alignment between the two sequences? • How should alignments be scored? • And, if gaps are allowed, how should they be scored? • Three things are required : • a means of scoring matches and mismatches, • a means of scoring gaps, and • a method of using the two to evaluate numerous possible alignments.

Sequence 1 ALCPQCDIE ALC +CD+E Sequence 2 ALCAKCDVE

Grouping of amino acids based on physico-chemical properties important in protein structures.

Commonly used substitution matrices are: • Point Accepted Mutation matrix (PAM) • PAM250 • BLOcks SUBstitution Matrix (BLOSUM) BLOSUM62

Gap penalties Mutational events include not only substitutions but also insertions and deletions. • Affined gap penalties impose an 'opening' penalty for a gap and an 'extension' penalty that decreases the relative penalty for each additional position in an already opened gap. Sequence 1 ALCPQCDIE ALC CD+E Sequence 2 ALCA--DVE

Sequence Search

SensitivityversusSpeed • FASTAlooks for exactly matching 'words‘. • BLASTuses a scoring matrix.

BLAST(Basic Local Alignment Search Tools) • The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity. • Include a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. • The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits. • Local alignment may produce more biologically meaningful and sensitive results.

Dynamic programming • First described in the 1950s. • First applied in this context by Needleman and Wunsch in 1970. • Breaking the original problem into smaller and smaller subproblems until the subproblems have a trivial solution, and then using those solutions to construct solutions for larger and larger portions of the original problem.

MNGPLSSSGQTSTSPH PLSSSGQ PLSSSGQ PLSSSGQ All BLAST programs take the following steps: • The query is divided to overlapping, short “word sizes”, (e.g. 3 for amino acid sequence, 11 for nucleotide sequence). • Words with simple compositions are filtered out. • The remaining words are searched for in the databases. • After finding the best matching sequence with each word, the matching is extended in both direction until the highest scoring pairs (HSP) are found. • HSPs are reported to the client. MNPLSSSGQPHTLM MNP SGQ NPL GQP PLS QPH LSS PHT SSS HTL SSG TLM MNGPLSSSGQTSTSPH LSS

BLAST Programs • BLASTN: • Compares a nucleotide query sequence against a nucleotide sequence database. • BLASTP: • Compares an amino acid query sequence against a protein sequence database. • BLASTX: • Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.

tblastn: • Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. • tblastx: • Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

Length Database Purpose BLAST Program 20 bp or longer Nucleotide Identify the query sequence MEGABLAST (accept batch queries) Standard BLAST (blastn) Find sequences similar to query sequence Standard BLAST (blastn) Find similar proteins to translated query in a translated database Translated BLAST (tblastx) Protein Find similar proteins to translated query in a protein database Translated BLAST (blastx) 7 - 20 bp Nucleotide Find primer binding sites or map short contiguous motifs Search for short, nearly exact matches If your sequence is NUCLEOTIDE

If your sequence is PROTEIN Length Database Purpose BLAST program 15 residues or longer Protein Identify the query sequence or find protein sequences similar to query Standard Protein BLAST (blastp) Find members of a protein family or build a custom position-specific score matrix PSI-BLAST Find proteins similar to the query around a given pattern PHI-BLAST Conserved Domains Find conserved domains in the query CD-search (RPS-BLAST) Conserved Domains Find conserved domains in the query and identify other proteins with similar domain architectures Domain Architecture Retrieval Tool (DART) Nucleotide Find similar proteins in a translated nucleotide database Translated BLAST (tblastn) 5-15 residues Protein Search for peptide motifs Search for short, nearly exact matches

BLAST search examples

Sequence Alignment