210 likes | 341 Views
Computational biology. Outline. Proteins DNA RNA Genetics and evolution The S equence M atching Problem RNA Sequence Matching Complexity of the Algorithms. DEFINITION.
E N D
Outline • Proteins • DNA • RNA • Genetics and evolution • The Sequence Matching Problem • RNA Sequence Matching • Complexity of the Algorithms
DEFINITION • Computational Biology encompasses all computational methods and theories applicable to molecular biology and areas of computer based techniques for solving biological problems.
protiens • Building blocks of living organism • Large molecule that is composed of sequences of amino acids • There are 20 amino acids which are divided into classes hydrophobic(h-phob) hydrophillic(h-phil) polar(pos,neg)
dna • Blueprint of living organisms • DNA is composed of two strands hold by a weak hydrogen bond • Each strand is a sequence of nucleotides • DNA has four bases which are classified as two chemical types
RNA • RNA is chemically very similar to DNA • There are two important differences • Four bases present in RNA are adenine(A) guanine(G) cystosine(C) uracil(U) • RNA nucleotides contain a different sugar molecule(ribose)
Genetics and evoltion • Mutation • Natural selection • Genetic drift
Sequence matching problem • Matching DNA,RNA, or Protein sequence between a diseased organism and a healthy organism • Proteins are longer and DNA strands are even longer • We match them by breaking them in to shorter subsequences • Breaking and matching is done by notion of alignment.
Sequence matching example • Consider two amino acid sequences: ACCTGAGAG ACGTGGCAG sequence alignment A C C T G A G – A C A C G T G – G C A C
Finite state machines in blast • It is used to find out which of the sequences in a database are related to the new given sequence using BLAST • The BLAST system is a three step process 1. Examine the query string and select set of substrings of length w(between 4 and 20) which are good for producing matches 2. Build a DFSM that uses set of substrings and find the sequences with the highest local matches in the database 3. Examine the matches found in step2 and try to build a longer matching sequences
Regular expressions specify protein motif • Aligning collection of related proteins we can define a motif Example: E S G H D T Y Y NKN R M D T TTTT S W QS R G S D T TT P D MT A G P T TW R NT Once an motif is defined we can search for the occurrences of it in other protein sequence by using regular expressions
Hmm for sequence matching • HMM’s are used when sequences become fairly diverse • We can capture the variations among the members of the family and the probabilities associated with them • So by using HMM’s we can find the best alignment between two sequences and from which family does a given new sequence belongs to
HMM profile is given by M = (K,O,π,A,B) • K is a set of n states, one for each position in the sequence • O is the output alphabet • Π contains the initial state probabilities • A contains the transition probabilities • B contains the output probabilities
Rna sequence matching and secondary structure prediction using the tools of context-free languages • In RNA a change to a single nucleotide in a stem region could completely alter the molecules shape and its function • So an change in the stem must be matched by a corresponding change in the paired nucleotide • Context free languages are used describe these nested dependencies and secondary structure
Complexity of algorithms used in computational biology • Approaches to many of the problems described here are computational like breaking up of large protein and DNA molecules into substrings • NP-hard • Conversion to decision problem SHOERTEST-SUPERSTRING(<S,K> : S is a set of strings and there exists some superstring T such that every element of S is a substring of T and T has length less than or equal to K) – NP-complete
reference • Automata, computability, and complexity|Theory and Applications [book] by Elaine Rich. • http://en.wikipedia.org/wiki/Computational_biology