450 likes | 615 Views
Encoding Information for DNA computing. Shinnosuke Seki. Purpose. What’s an advantage of encoding? To make a “ good ” or tractable code set for DNA computing. Development of polynomial-time algorithms which decide whether a given code set is “good” or “bad”. Claude Elwood Shannon.
E N D
Encoding Information for DNA computing Shinnosuke Seki
Purpose • What’s an advantage of encoding? • To make a “good” or tractable code set for DNA computing. • Development of polynomial-time algorithms which decide whether a given code set is “good” or “bad”.
Claude Elwood Shannon • The father of information theory (Shannon’s entropy) • Boolean algebra with binary arithmetic makes it possible to simplify electromechanical relays • In “A mathematical theory of communication” [Sha48],he showed that we can send error-free information even on noisy channel. • Chess program using minimax evaluation procedure • etc. …
Shannon’s information channel Positive Noise • R > C overflow • R ≤ C We can make the error rate as small as possible. • To attain R = C in the noisy channel, we need to find a ‘good’ code. capacity C sender encoder decoder receiver Information flow R Negative Noise
Biological perspective • Every biological reaction is an information channel model. • example The case of heredity • For billions of years, Mother Nature has developed wonderful code system? • Biology -> Computer Science Natural Selection heredity parent DNA DNA child Mutation
Review:in vitro DNA computing • Encode a given problem into single or double-stranded DNAs (ssDNAs, dsDNAs) • Computation by a succession of bio-operations. • Decode the resulting solution and extract its output.
A T C G 5’ - A T C G G T C A A C T G C C C T A A T G 3’ 3’ T A G C C A G T T G A C G G G A T T A C - 5’ Review: WK-complementarity • Hydrogen bonds • Two strands which are • complementary to each other • with opposite directions can form a (complete) dsDNA. • Example
Adleman’s first trial • Find a solution of Hamiltonian path problem in a solution in polynomial time order of the input graph. • The solution is filled with encoding oligonucleotides. 1 3 1 2 3 4 ACG CTT ATA GAT CGG TTA ACT TAA GAA TAT CTA GCC AAT TGA 1 -> 2 2 -> 3 3 -> 4 2 4
What’s a good code set? • Each code word (oligonucleotide) shouldn’t form any undesirable structure. • This may make itself inert. • Code words don’t interact with each other in an undesirable way. • Structure formation is due to • WK-complementarity • Gibbs free energy A A T 2 ATA GAT T A G
What’s a good code set? (cont.) • Uniform melting temperature • Preventing undesirable hybridizations • Other constraints • Avoiding repeated bases • Forbidden subsequences • Using a restriction enzyme, its corresponding recognition site should appear only in intended sites • Using only 3 types of nucleotides A, C, T
Melting temperature • Melting temperature Tm of a dsDNA is • the temperature at which half of the dsDNAs is denatured. • The higher Tm is, the more stable the dsDNA is. • R: gas constant, • Ct: total oligo concentration, • ΔH & ΔS : enthalpy & entropy • α: 1 for self-complementary and 4 for non-self
Nearest-neighborhood method Refer to [AlSa97], [TKY04] ([8], [9] in this table)
Melting temperature (cont.) • Uniform melting temperature • To uniform Tm can eliminate a bias of hybridization. • GC content • The ratio of the # of G’s and C’s over the total # of nucleotides in a sequence • G-C pair is more stable than A-T pair. • Higher GC content implies higher Tm. • Sequences are designed with 50% GC content.
Gibbs free energy (ΔG) • A well-known indicator of stability for DNA structures • A structure with lower ΔG is more stable. • The ΔG of entire structure is the sum of ΔG of each substructures[ZuSt81].
Template method[ArKo02] • Prepare 2 bit sequences, each of which has some desirable property • (e.g., 50%-GC content, error-correction). • Using convert rule, from these 2 sequences, we construct a sequence.
Template method (cont.) • Design criteria • Template • An element x should have at least d-mismatches with xR, xx, xR xR, xxR, xRx. • An exhaustive search to find a good template • Map (error-correcting code) • A code whose words have at least k-mismatches. • e.g. BCH code • Drawback • It cannot prevent sequences from forming secondary structures.
GC-template Template contains the same # of 0’s and 1’s (50% GC-content) Map is an error correcting code. AG-template Map is constant weight codes (50% GC-content) Results in the bigger set of sequences AG-templates, GC-templates[KKA03]
Other approaches • DNASequenceGenerator[FBR00] • A software with GUI • Create a sequence with melting temperature, GC-content, no palindromes, start codons, nor restriction sites.
Other approaches • Suyama’s approach[YoSu00] • To generate sequences randomly, add it into a sequence set iff it satisfied all of the following constraints: • Uniform melting temperature • No mis-hybridization • No formation of stable secondary structure • Drawback is to fall into local optima easily.
Other approaches • Hybrid randomized neighborhoods[TuHo03] • Stochastic local search (SLS) algorithm • Searches neighbors by mutating current best sequences randomly with a probability ε. • It moves to the direction where the # of constraint conflicts is maximally decreased with a probability 1-ε.
Other approaches • GA (genetic algorithm)-based approach[ANH00] • Use GAs to evaluate fitness of solutions • As criteria • Restriction sites • GC-content • Hamming distance • Same base repetition
Other approaches • Gibbs free energy base approach • Taking thermodynamics into consideration • Gibbs free energy as a stability measure • Advantage • Greater accuracy because it takes into account stability of loops or stacking between base-pairs • Disadvantage • More computational time to calculate free energy • How to decrease this computational complexity? • See [TKY05], [KNO08]
A formal language approach • Design a set of structure-free codes in terms of WK-complementary. • Advantage • More reliable codes than Free-energy approach • More efficient algorithm for decision problems • Disadvantage • Need to consider each structure separately.
TCATCCGATTTCGGG AGTAGGCTAAAGCCC A formal language approach (cont.) • Abstracts of concepts • {A, C, G, T} → an alphabet V, • WK-complementarity → an antimorphic involution • Involution • A mapping θs.t. θ2 is identity (symmetry). • Antimorphism • θ(xy) = θ(y)θ(x) (opposite direction). • e.g. (TCATCCGATTTCGGG) = CCCGAAATCGGATGA
Bond-free properties[KKS05] • θ-non-overlapping: • θ-compliant: • Strictly (a) : a property (a) with θ-non-overlapping
Bond-free properties[KKS05] • θ-p-compliant: • θ-s-compliant:
Bond-free properties[KKS05] • θ-free: • θ-sticky-free:
Bond-free properties[KKS05] • θ-3’-overhang-free: • θ-5’-overhang-free: • θ-overhang-free: both of these
Decidability [KKS05] • Theorem • the following problem is decidable in quadratic time w.r.t. |A| • Input: an NFA A, • Output: Yes/No depending on whether L(A) satisfies any of the properties (or their strictly versions): • θ-compliant, θ-p-compliant, θ-s-compliant, • θ-sticky-free, • θ-3’-overhang-free, θ-5’-overhang-free, θ-overhang-free.
Decidability and maximality[KKS05] • Theorem • Let M be a regular language and L is a regular subset of M with a property ρ: • ρ is one of the followings: • θ-compliant, • θ-p-compliant, • θ-s-compliant, or • θ-sticky-free • Then it is decidable whether L is a maximal subset of M satisfying ρ.
Secondary structure prevention • Secondary structures: • Hairpin-loop (or simply hairpin) • Internal loop • Multiple-branch loop • Pseudoknot • They can be undesirable • e.g. for Adleman’s encoding technique for Hamiltonian Path Problem (HPP).
Hairpin Hairpin frame (multiple loop) 5’ 3’ 5’ Internal loop 3’ 5’ A C G T 3’ 3’ 5’ G C C Secondary Structures
TAA---ACG---CGTTA---CGT---CGGT Hairpin-free language • A formal model of hairpin: x v y θ(v) z. • Hairpin freeness • Intuitively it’s almost impossible to prevent hairpins of short stack length (say 2 or 3). • Our desire is to prevent any hairpin of stack length no less than some given parameter k. x v y θ(v) z
Hairpin-free language [KKL06] • A word w is (θ, k)-hairpin-free (abbr. hp(θ, k)-free) iff • hpf(θ, k) : the set of all hp(θ, k)-free words on Σ* • hp(θ, k) : Σ* - hpf(θ, k). • A language L is called (θ, k)-hairpin-free iff
X X X w θ(w) Regularity of hairpin languages • hp(θ, k) and hpf(θ, k) are regular. • For a hp(θ, k)-free language L, there exists a finite automaton M s.t. L = L(M).
Hairpin Freedom Problems • Hairpin-Freedom problem • Maximal Hairpin-Freedom problem Input: A nondeterministic automaton M, Output: Y/N depending on whether L(M) is hp(θ, k)-free. Input: A deterministic automaton M1, and NFA M2. Output: Y/N depending on whether there is a word s.t. is hp(θ, k)-free.
Decidability • The hairpin-freedom problem for regular languages is decidable in time. • The maximal hairpin-freedom problem for regular languages is decidable in time.
Hairpin Frames • So-called Multiple loop • hp-frame of degree n: • Figure is an example of hp-frame of degree 3. • A word u is hp(fr, j)-word if it contains a hp-frame of degree j.
Regularity & decidability • hp(θ, fr, j) : the set of all hp(fr, j)-words on Σ* • hpf(θ, fr, j) : its complement in Σ* • The languages hp(θ, fr, j) & hpf(θ, fr, j) are regular. • The hp(fr, j)-freedom problem is decidable in linear time. • The maximal hp(fr, j)-freedom problem is decidable in time.
Application : DNA-HRAMs C G • n-bit DNA-HRAM consists of n hairpins. • Each hairpin stores 1-bit information by forming and deforming a hairpin as shown above. A T G C opening T A --A-C-T-G-T-C-G-A-C-A-G-T-- C G A T closing 0 1
n-bit DNA-HRAM • Concatenation of n 1-bit RAM, which is equivalent to hp-frame of degree n. • In order for this word to work as n-bit RAM, the following subword should be hpf(θ, 20)-free. • DNA memory with 4 hairpins was proposed in [KYO08].
Reference • [AlSa97] Allawi, HT., SantaLucia, J.: Thermodynamics and NMR of internal G T mismatches in DNA. Biochemistry 36(34) (1997) 10581-10594 • [ArKo02] Arita, M., Kobayashi, S.: DNA sequence design using templates. New Generation Computing 20 (2002) 263-277 • [ANH00] Arita, M., Nishikawa, A., Hagiya, M., Komiya, K., Gouzu, H., Sakamoto, K.: Improving sequence design for dna computing. Proc. Genetic and Evolutionary Computation Conference (2000) 875-882. • [FBR00] Feldkamp, U., Saghafi, S., Rauhe, H.: A DNA sequence compiler. Proc. DNA6, (2000) • [KKS05] Kari, L., Konstantinidis, S., Sosik, P.: Preventing undesirable bonds between DNA codewords. Prof. DNA10, LNCS 3384 (2005) 182-191. • [KKL06] Kari, L., Konstantinidis, S., Losseva, E., Sosik, P., Thierrin, G.: A formal language analysis of DNA hairpin structures. Fundamenta Informaticae 71 (2006) 453-475 • [KKA03] Kobayashi, S., Kondo, T., Arita, M.: On template method for DNA sequence design. Proc. DNA8, LNCS 2568 (2003) 205-214
Reference (cont.) • [KNO08] Kawashimo, S., Ng, Y-K., Ono, H., Sadakane, K., Yamashita, M.: Speeding up local-search type algorithms for designing dna sequences under thermodynamical constraints. Proc. DNA14 (2008) 152-161 • [KYO08] Kameda, A., Yamamoto, M., Ohuchi, A., Yaegashi, S., Hagiya, M.: Unravel four hairpins! Natural Computing 7 (2008) 287-298 • [RFL01] Ruben, A. J., Freeland, S. J., Landweber, L. F.: PUNCH: An evolutionary algorithm for optimizing bit set selection. DNA7 (2001) 150-160 • [Sha48] Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27 (1948) 379-423, 623-656 • [TKY04] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.: Thermodynamic parameters based on a nearest-neighbor model for DNA sequences with a single-bulge loop. Biochemistry 43(22) (2004) 7143-7150 • [TKY05] Tanaka, F., Kameda, A., Yamamoto, M., Ohuchi, A.: Design of nucleic acid sequences for DNA computing based on a thermodynamic approach. Nucleic Acids Res. 33(3) (2005) 903-911
Reference (cont.) • [TuHo03] Tulpan, D., Hoos, H.: Hybrid randomised neighbourhoods improve stochastic local search for dna code design. In Advances in Artificial Intelligence: 16th Conference of the Canadian Society for Computational Studies of Intelligence, 2671 (2003) 418-433 • [YoSu00] Yoshida, H., Suyama, A.: Solution to 3-sat by breadth first search. Proc. the 5th DIMACS Workshop on DNA Based Computers, 54 (2000) 9-22 • [ZuSt81] Zuker, M., Stiegler, P.: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9(1) (1981) 133-148