1 / 16

Linear-Time Encoding/Decoding of Irreducible Words for Codes Correcting Tandem Duplications

Explore the efficient encoding and decoding methods for correcting tandem duplications in genetic data using irreducible words. Our research focuses on developing optimal codes to address errors caused by biological mutations in living organisms. The study includes the formulation of a code construction goal, using a linear-time encoder, and providing upper bounds for optimal codes.

Download Presentation

Linear-Time Encoding/Decoding of Irreducible Words for Codes Correcting Tandem Duplications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear-Time Encoding/Decoding of Irreducible Words forCodes Correcting Tandem Duplications tUAN THANH NGUYEN Nanyang Technological University (NTU), Singapore • Joint work with: • Yeow Meng Chee • Han Mao Kiah • Johan Chrisnata

  2. Our motivation • Applications that store data in living organisms • Shipman et al. (2017) : CRISPR-Cas, encoding of a digital movie into the genomes of a population of living bacteria.

  3. Our motivation • Errors due to the biological mutations • Deletion A C G A T G C A C G A T G A T G C • Insertion • is one of the two common repeats found in the human genome (More than 50%) • Substitution G A T G A T Duplication • Duplication • plays an important role in determining an individual’s inherited traits • is believed to be the cause of several disorders • Inversion G A T • Translocation

  4. Problem Classification The number of errors The duplication length 1.1 Bounded 1. Fixed-length duplications A C G A C G A G C A G C A T 1.2 Unbounded Tandem Duplication A C G A G C A T Given A A C GC GA G C A G C A T T 2.1 Bounded 2. Variable-length duplications 2.2 Unbounded We focus on the worst-case scenario !

  5. Notation Given alphabet an integer 012 -irreducible 0012 012012 -descendant cone of 01122 01212012 -descendants of 00112012212012

  6. Problem Formulation Goal: Given , construct a code such that “For all ” Previous Works 0122 0112 • Optimal codes are found when (Jain et al. 2017) • A method to construct codes when is provided (Jain et al. 2017). • Main idea: using “irreducible words” • There is no known result when 01122

  7. Previous Work The code is optimal 0121 1201 1210 0120 d b c a For different irreducible words generate different descendants!

  8. Previous Work 0120 0121 1201 1210 d D b A C c a For we can choose more than one codewords in each cone! Irreducible words form an “almost optimal” code!

  9. Our Main Results • Detailed analysis on constructed codes based on irreducible words when such codes are denoted by • Provide an explicit formula to compute the size and asymptotic rate • Provide an upper bound for optimal code and hence conclude that is almost optimal • Linear-time encoder for • The extension of this encoder provides the first known encoder for previous constructed codes. Publication: IEEE International Symposium on Information Theory 2018.

  10. Encoder of for (Sketched idea) Duplication channel encoder Decoder Error-decoder Irr-decoder Input output x x y y' irreducible word For o achieve encoding rates at least optimal rate, we only require For we define the neighbours of … x Irr Irr Irr … y

  11. Example 20101 1 0 1 0 2 010 212 010 120 210 120 120 210 010 212 212010210120120

  12. Recent Work: Special attention on The GC-contentof a DNA string refers to the number of nucleotides that corresponds to G or C, and DNA strings with GC-content that are too high or too low are more prone to both synthesis and sequencing errors. Many recent works use DNA strings whose GC-content are close to 50% or exactly 50%. This is referred as “GC-balanced constraint”. Our updated encoder: Irreducible GC-balanced Irreducible ATGCTACG ATACTA AAAA

  13. Knuth Balancing Method Modified Knuth Method Irreducible + GC-balanced 0 1 0 0 0 1 0 0 Input Input AT C A T G A T Flip Flip 1 0 1 1 0 1 0 0 G C T GT G A T 1 0 1 1 0 1 0 0 G C T GT G A T Output codeword Output codeword Redundancy: (to encode the index t+ a look-up table) Redundancy: (linear-time encoding + no need a look-up table)

  14. Recent Work: Design codes when The size of our code is at least In term of rate:

  15. Summary Goal: “Given , construct the largest code where each codeword is of length over -ay alphabet that can correct unbounded tandem duplications of length at most .” Previous Works Our work Further work • Optimal codes when (Jain et al. 2017) • A method to construct codes when • Provide upper bound and lower bound for codes when • Linear-time encoder for known TD codes ( • (IEEE ISIT 2018) • Linear-time encoder for TD GC-balanced code • A method to construct codes when • Find optimal codes when • Reduce the redundancy of the encoder for TD GC-balanced code • Design better codes when

More Related