Some frequently-used Bioinformatics Tools

Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram

Outline • Pairwise Alignment • Global/Local, Scoring • BLAST, BLAT, SIM, LALIGN, Dotlet, Ublast • Multiple Sequence Alignment • ClustalW, Kalign, MAFFT, Muscle, T-Coffee, MSA, DIALIGN, Match-Box, Multalin, MUSCA • Phylogenetic analysis and tree construction • BIONJ, DendroUPGMA, PHYLIP, PhyML, Phylogeny.fr, POWER, BlastO, TraceSuite II • HMM • Protein family profiles http://expasy.org/tools/

Alignment • Insert spaces in arbitrary locations -> same length and no two spaces in the same position. • Find arrangement of two sequences to identify regions of similarity

Alignment methods: Dot plots

Global vs Local alignment • Global alignment: An alignment that assumes that the two sequences are basically similar over the entire length of one another • Local alignment: An alignment that searches for segments of the two sequences that match well • It may seem that one should always use local alignments! However each has its application

Substitution matrices http://www.russelllab.org/aas/

Scoring an alignment

Global alignment S1=HGSAQVKGHG S2=KTEAEMKASEDLKKHGT

KTEAEMKAESEDLKKHGT --HG--SA--Q-VKGHG-

Local Alignment

How BLAST works Query MLVTTILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGYCGSTDPYCGTGCQSQCGGGG Subject (database) Common 3mer VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCGSTIDYCGPGCQSQCGG extend GCQSQCGG ++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG HSP Score = 66.6 bits (161), Expect = 3e-12, Method: Compositional matrix adjust. Identities = 32/53 (60%), Positives = 39/53 (74%), Gaps = 0/53 (0%) Query 6 ILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGYCGSTDPYCGTGCQSQCGG 58 ++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG Sbjct 15 VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCGSTIDYCGPGCQSQCGG 67

Types of Blast Query Database Nucleic acids sequence database blastn Nucleic sequence: atcgatatatatagactgactgact 6 frame translation 6 frame translation tblastx blastx tblastn Protein seqeunces database blastp Protein sequence: MTAVYHILRALRARARVARARVH

Exact multiple alignment by dynamic programming • Compexity= O(nS2SS2) • N: length of sequences • S: number of sequences • Only feasible for 4-5 sequences max.

Neighbor Joining

Unrooted NJ tree

Comparison of Multiple sequence alignment programs

Primary sequence changes:

Profiles CGGSV 0.8 * 0.4 * 0.8 * 0.6 * 0.2 = .031 ln(0.8)+ln(0.4)+ln(0.8)+ln(0.6)+ln(0.2) = -3.48

Hidden Markov Models • Assumptions: • Observations are ordered • Random process can be represented by a stochastic finite state machine with emitting states Probabilistic parameters of a Hidden Markov Model x – states, y – possible observations a – state transition probabilities, b –output/emision probabilities

HMM estimation, usage & applications Training/Estimation • Feed an architecture (given in advance) a set of observation sequences • The training process will iteratively alter its parameters to fit the training set • The trained model will assign the training sequences high probabilities Usage • Evaluate the probability of an observation sequence given the model (Forward) • Find the most likely path through the model for a given observation sequence (Viterbi) Applications • Gene finding • Protein family modeling • …

Profile HMMs • Families of functional biological sequences • Primary sequences have diverged due to evolution, while maintaining structure/function. • Questions: • Does a biological sequence belong to a certain protein family? For example is a given protein (sequence) a globin? • Given a set of sequences, find more sequences of the same family

Trade offs

Questions?

Some frequently-used Bioinformatics Tools