Computational Biology

Computational Biology Prepared By: SyedKhaleelullaHussaini

Outline • Proteins • DNA • RNA • Genetics and evolution • The Sequence Matching Problem • RNA Sequence Matching • Complexity of the Algorithms

DEFINITION • Computational Biology encompasses all computational methods and theories applicable to molecular biology and areas of computer based techniques for solving biological problems.

Protiens • They are building blocks of living organism • It is a large molecule that is composed of sequences of amino acids • There are 20 amino acids which are divided into classes hydrophobic(h-phob) hydrophillic(h-phil) polar(pos,neg)

Amino acid codes Name, 3-letter & single-letter codes • Aspartic Acid Asp D Glutamic Acid Glu E • PhenylaninePhe F GlycineGly G • Alanine Ala A CystineCys C • Histidine His H Isoleucine Ile I • Lysine Lys K LeucineLeu L • Methionine Met M AsparagineAsn N • Proline Pro P Glutamine Gln Q • ArginineArg R Serine Ser S • ThreonineThr T Valine Val V • Tryptophan Trp W Tyrosine Tyr Y

DNA(Deoxyribonucleic acid) • Blueprint of living organisms • DNA is composed of two strands hold by a weak hydrogen bond • Each strand is a sequence of nucleotides • DNA has four bases which are classified as two chemical types BASE SYMBOL TYPE Adenine A Purine Thymine T Purine Sytosine C Pyrimidine Guanine G Pyrimidine

DNA Double Helix

RNA • RNA is chemically very similar to DNA • There are two important differences • Four bases present in RNA are: Adenine(A) Guanine(G) Cystosine(C) Uracil(U) • RNA nucleotides contain a different sugar molecule(ribose)

Genetics and Evolution • Mutation • The changing of the structure of a gene, resulting in a variant form that may be transmitted to subsequent generations, caused by the alteration of single base units in DNA. • Natural selection • The process whereby organisms better adapted to their environment tend to survive and produce more offspring. • Genetic Drift • Variation in the relative frequency of different genotypes in a small population.

Sequence matching problem • Proteins are longer and DNA strands are even longer • We match them by breaking them in to shorter subsequences • Breaking and matching is done by notion of alignment.

Sequence matching example • Consider two amino acid sequences: ACCTGAGAG ACGTGGCAG sequence alignment A C C T G A G – A C A C G T G – G C A C

Finite state machines in blast • It is used to find out which of the sequences in a database are related to the new given sequence using BLAST • The BLAST system is a three step process: 1. Examine the query string and select set of substrings of length w(between 4 and 20) which are good for producing matches 2. Build a DFSM that uses set of substrings and find the sequences with the highest local matches in the database 3. Examine the matches found in step2 and try to build a longer matching sequences

Regular expressions specify protein motif • Aligning collection of related proteins we can define a motif Example: E S G H D T Y Y NKN R M D T TTTT S W QS R G S D T TT P D MT A G P T TW R NT Once an motif is defined we can search for the occurrences of it in other protein sequence by using regular expressions

HMM for sequence matching • HMM’s are used when sequences become fairly diverse • We can capture the variations among the members of the family and the probabilities associated with them • So by using HMM’s we can find the best alignment between two sequences and from which family does a given new sequence belongs to

HMM profile is given by M = (K,O,π,A,B) • K is a set of n states, one for each position in the sequence • O is the output alphabet • Π contains the initial state probabilities • A contains the transition probabilities • B contains the output probabilities

Example of HMM describing protein sequence family

RNA sequence matching and secondary structure prediction using the tools of context-free languages • In RNA a change to a single nucleotide in a stem region could completely alter the molecules shape and its function • So an change in the stem must be matched by a corresponding change in the paired nucleotide • Context free languages are used describe these nested dependencies and secondary structure

Example:

Complexity of algorithms used in computational biology • Approaches to many of the problems described here are computational like breaking up of large protein and DNA molecules into substrings • NP-hard • Conversion to decision problem SHOERTEST-SUPERSTRING(<S,K> ): S is a set of strings and there exists some superstring T such that every element of S is a substring of T and T has length less than or equal to K) – NP-complete

Reference • http://en.wikipedia.org/wiki/Computational_biology • http://www.google.com • http://www.cs.utexas.edu/~ear/cs341/automatabook/

Thank you…

Computational Biology