440 likes | 561 Views
Computational Molecular Biology. Introduction and Preliminaries. Preliminaries in Computer Science. Strings and alphabet Basic notations in graph theory Algorithms and Complexity. Strings. Consist of a sequence of letters: DNA: four nucleotides A, C, G, T
E N D
Computational Molecular Biology Introduction and Preliminaries
Preliminaries in Computer Science • Strings and alphabet • Basic notations in graph theory • Algorithms and Complexity My T. Thai mythai@cise.ufl.edu
Strings • Consist of a sequence of letters: • DNA: four nucleotides A, C, G, T • Proteins: 20 symbol alphabet of animo acids • Given a string s, we have the following notations: • Length: |s| • Substring: ACT is a substring of ATGACTG • Superstring: ATGACTG is a superstring of ACT • Index and interval: s[i] and s[i..j] • Prefix and suffix: s[1..j] and s[i..|s|] My T. Thai mythai@cise.ufl.edu
Graphs • G = (V, E) where V is a set of vertices and E is a set of edges • Undirected graph: edges are undirected • Directed graph: edges are directed • Weighted graph G = (V, E, w) where each edge has some weight • Some special graphs: complete graph, bipartite graph, tree, and interval graph • Subgraph, spanning tree, steiner tree My T. Thai mythai@cise.ufl.edu
Interval Graphs • Intersection graph of a set of intervals on the real line • A vertex represents an interval and an edge (u, v) exists if intervals u and v intersect My T. Thai mythai@cise.ufl.edu
Some Problems in Graphs • Euler circuit: Given a graph, find a cycle that passes through each edge exactly once • Hamiltonian circuit: Given a graph, find a cycle that passes through each vertex exactly once • Minimum Spanning Tree: Given a weighted undirected graph, find a spanning tree with minimum total weight • Maximum Matching: Given an undirected graph, find a maximum cardinality matching, which is a subset of edges such that no two edges in the subset share an endpoint My T. Thai mythai@cise.ufl.edu
P vs. NP • Class of P: Set of problems solvable by polynomial-time algoirthms • Class of NP: Set of problems whose solutions, once found, can be verified in polynomial time • NP-complete (NP-hard) problems: cannot obtain an optimal solutions in polynomial time My T. Thai mythai@cise.ufl.edu
Some approaches for NP-complete Problems • Special-case method: Work on the problem with a restricted class of inputs • Exhaustive search: Design an exponential-time algorithms that may perform well in practice • Approximation algorithms: Design a polynomial-time algorithm that is guaranteed to find near-optimal solutions (with a good approximation ratio) • Heuristics: Fast algorithms that produce satisfactory solutions most of the time but without guarantee My T. Thai mythai@cise.ufl.edu
Preliminaries in Molecular Biology My T. Thai mythai@cise.ufl.edu
DNA and Base Pairs • Double helix consisting of two dual strands • Has four types of nucleotides: Adenine, Thymine, Guanine, Cytosine • Base Pairs: A↔T, C↔G • Two ends of a strand are marked with 3’ and 5’ • The entire DNA of a living organism is called its genome My T. Thai mythai@cise.ufl.edu
DNA Sequences My T. Thai mythai@cise.ufl.edu
DNA Replication • Strands are separated • Each strand is replicated using one of the parental strands as a template My T. Thai mythai@cise.ufl.edu
Cell, Chromosome, and DNA My T. Thai mythai@cise.ufl.edu
Cell Classification My T. Thai mythai@cise.ufl.edu
Chromosomes • Consists of a DNA molecule associated with proteins that fold and pack the DNA thread into a more compact structure and proteins required for the process of gene expression, DNA replication and DNA repair. • Human genome is distributed over 24 chromosomes • Each cell contains 46 chromosomes • 22 pairs common to both males and females • 2 sex chromosome X and Y in males and two Xs in female My T. Thai mythai@cise.ufl.edu
Genes • Segments of DNA • Functional and physical unit of heredity passed from parent to offspring • Contain the information for making a specific protein My T. Thai mythai@cise.ufl.edu
Proteins • Shorts strings in the amino acid 20-letter alphabet • Human genome: about 100,000 proteins, with each protein a few hundred amino acids long • Bacteria make 500-1500 proteins • Made by genes (fragments of DNA) that are roughly three times longer than the corresponding proteins. • Why? Every 3 nucleotides in the DNA alphabet code one letter in the protein alphabet of amino acids My T. Thai mythai@cise.ufl.edu
Central Dogma of Molecular Biology My T. Thai mythai@cise.ufl.edu
Transcription My T. Thai mythai@cise.ufl.edu
Translation • Translation • mRNA (after exported out of the nucleus and reaching the cytosol) directs the synthesis of the protein by joining together amino acids in the order encoded by the mRNA • Genetic code • Defines a mapping between codons and amino acid. • Codon • Triplet of nucleotides specifies a single amino acid in a corresponding protein • 64 codons and 20 amino acids • Translation is carried out by ribosomes My T. Thai mythai@cise.ufl.edu
Polymerase Chain Reaction (PCR) • Primer • Nucleic acid strand • Serves as a starting point of DNA replication My T. Thai mythai@cise.ufl.edu
Plasmid Vector • Vector • an agent that can carry a DNA fragment into a host cell • Plasmid • Circular and double-stranded DNA • Antibiotic resistance • Automatic replication • Exists in bacteria My T. Thai mythai@cise.ufl.edu
DNA Cloning Using Plasmids as Vectors • (a) DNA recombination • (b) Transformation My T. Thai mythai@cise.ufl.edu
DNA Cloning Using Plasmids as Vectors (Cont) • (c) Selective amplification • (d) Isolation of desired DNA clones My T. Thai mythai@cise.ufl.edu
DNA Library Screening • Probe: • Labeled with radioisotope or fluorescence • Used to detect specific DNA sequences by hybridization • Hybridization: • Binding of two nucleic acid chains by base paring • DNA Library Screening • To identify each clone whether it contains a probe from a given set of probes • Positive clone:contains a probe My T. Thai mythai@cise.ufl.edu
Some Computational Problems • Pooling Design • Non-unique probe selection • Sequence Alignment, Multi Sequence Alignment • DNA sequencing • Genome Rearrangement • Protein Structure Prediction and Recognition • Protein-Protein Interactions • Functional Groups, Modules My T. Thai mythai@cise.ufl.edu
Pooling Designs • Problem Definition • Given a set of n clones with at most d positive clones • Identify all positive clones with the minimum number of tests • Pool:a subset of clones • Positive pool: a pool contains at least one positive clone My T. Thai mythai@cise.ufl.edu
Pooling Designs clones c1 c2 cj cn p1 0 0 … 0 … 0 … 0 … 0 0 p2 0 1 … 0 … 0 … 0 … 0 1 pools . . . . pi 0 0 … 0 … 1 … 0 … 0 1 . . . . pt 0 0 … 0 … 0 … 0 … 0 0 txn tx1 M[i, j] = 1 iff the ith pool contains the jth clone Decoding Algorithm: Given M and V(D), identify all positive clones V(D) Testing Mtxn = My T. Thai mythai@cise.ufl.edu
Challenges • Challenge 1: How to construct the binary matrixM such that: • Outputs of any union of d columns are distinct • Challenge 2: How to design a decoding algorithm with efficient time complexity [O(tn)] My T. Thai mythai@cise.ufl.edu
Probe Selection • Problem Definition: • Given a biological sample (e.g., blood) and a set of probes • Identify the presence (or absence) of some biological objects (e.g., viruses or bacteria) with the minimum number of probes My T. Thai mythai@cise.ufl.edu
Unique Probes VS. Non-unique Probes • Unique probes • Gene-specific probes or signature probes. • Difficult to find • Non-unique probes • Hybridize to more than one target. • Difficult to decode the results My T. Thai mythai@cise.ufl.edu
Probe-Target Matrix • 12 probe candidates. • 4 targets (genes). • For target set S, define P(S) as set of probes reacting to any gene in S. • P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}. • P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}. • Symmetric set difference: P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes that separate two sets. My T. Thai mythai@cise.ufl.edu
Sequence Alignment • Problem Definition: • Given: 2 DNA or protein sequences • Find: Best match between them • What is an Alignment: • Given: 2 Strings S and S’ • Goal: The lengths of S and S’ are the same by inserting spaces into these strings My T. Thai mythai@cise.ufl.edu
Matches, Mismatches and Indels • Match: two aligned, identical characters in an alignment • Mismatch: two aligned, unequal characters • Indel: A character aligned with a space A A C T A C T -- C C T A A C A C T -- -- -- -- C T C C T A C C T -- -- T A C T T T 10 matches, 2 mismatches, 7 indels My T. Thai mythai@cise.ufl.edu
Basic Algorithmic Problem • Find the alignment of the two strings that: • Max m where m = (# matches – mismatches – indels) • m defines the similarity of the two strings, also called Optimal Global Alignment • Biologically: a mismatch represents a mutation, whereas an indel represents a historical insertion or deletion of a single character My T. Thai mythai@cise.ufl.edu
Multiple Sequence Alignment • Problem Definition: • Similar to the sequence alignment problem but the input has more than 2 strings • Challenges: • NP-hard • Guarantee factor: 2 – 2/k where k is the number of the input sequences. • More work to reduce the time and space complexity My T. Thai mythai@cise.ufl.edu
DNA Sequencing • Problem Definition: • Given a set of fragments that are contained in a DNA string S • Goal: Determine the string S • NP-complete • Further complicated due to the existence of repetitive sequences in the genome • Can cast this as a Hamiltonian path or Euler path problem (was introduced by Pavel Pavzner) My T. Thai mythai@cise.ufl.edu
Genome Rearrangement • Problem Definition: • Given genomes of 2 different species • Goal: Find a sequence of evolutionary events that turn the first genome to the second one. • Biological reasons: How close between these species, how much evolution separate these species. • E.g.: We usually test new drugs on mice before humans. However, how close is a mouse to a human? My T. Thai mythai@cise.ufl.edu
Genome Rearrangement • Can we use the solutions of sequence alignment to solve this problem? • Answer: NO, because: • Genome is a very long strings (3 million letters for a human genome • Model of sequence alignment is not appropriate for human genome comparison since the differences are not in terms of insertions/deletion/mutations of a nucleotide, but a rearrangement of a long DNA regions • The basic comparison is gene My T. Thai mythai@cise.ufl.edu
An Example • If we compare these two strings by sequence alignment, it’s impossible • However, the second string is the first string after reverse the fragment AATGGT…CCC. My T. Thai mythai@cise.ufl.edu
Main Evolutionary Events • Deletions: A fragment is removed • Duplications: create many copies of a fragment and insert into different positions • Transpositions: A fragment is removed and re-inserted into a different position • Inversions: A fragment is removed, reversed, and then reinserted into the same position • Translocations: A pair of fragments are exchanged between the ends of two chromosomes My T. Thai mythai@cise.ufl.edu
My T. Thai mythai@cise.ufl.edu
Protein Structure Prediction • Problem Definition: • Given: A sequence of amino acids • Goal: Predict the 3D structure of the protein • Some approaches: • Determine the position of a protein’s atoms so as to minimize the total free energy • Find the similarities to some known proteins My T. Thai mythai@cise.ufl.edu
Community Structure • Problem Definition: • Given a graph G = (V, E) representing a network • Partition G into a set of subgraph (community structure) so that nodes in each subgraph are highly connected • Biological reason: Genes with similar expression data may have similar functions. Identify the community structure can help us to reduce the number of tests • Others: Community structure is also studied in different fields My T. Thai mythai@cise.ufl.edu