Shortest Superstring (SS)

Shortest Superstring (SS) s shortest superstring s1 pref(s1, s2) s2 pref(s2, s3) s3 pref(s3, s4) s4 pref(s4, s5) s5 s = pref(s1, s2) + pref(s2, s3) + pref(s3, s4) + pref(s4, s5) + s5

SS rewritten s shortest superstring s1 pref(s1, s2) s2 pref(s2, s3) s3 pref(s3, s4) s4 pref(s4, s5) s5 pref(s5, s1) s1 overlap(s5, s1) s = pref(s1, s2) + pref(s2, s3) + pref(s3, s4) + pref(s4, s5) + s5 s = pref(s1, s2) + pref(s2, s3) + pref(s3, s4) + pref(s4, s5) + pref(s5, s1) + overlap(s5, s1)

TSP ≈ SS TSP on a digraph with vertices siand distances pref(si, sj) TSP = pref(s1, s2)+pref(s2, s3)+pref(s3, s4)+pref(s4, s5)+pref(s5, s1) SS = pref(s1, s2)+pref(s2, s3)+pref(s3, s4)+pref(s4, s5)+pref(s5, s1) + overlap(s5, s1)

Approximate TSP Cycle cover of the digraph CC1 = pref(s1, s2)+pref(s2, s3)+pref(s3, s1) CC2 = pref(s4, s5)+pref(s5, s4) + overlap(s3, s1) + overlap(s5, s4) approx SS=pref(s1,s2)+pref(s2,s3)+pref(s3,s1)+overlap(s3,s1)+pref(s4,s5)+pref(s5,s4)+overlap(s5,s4)

Estimating the error size How big is overlap(s3, s1) compared to CC ? Pretty big in the worst case. E.g. s1 = abcabcabc s2 = bcabcabca s3 = cabcabcab pref(s1,s2)=a pref(s2,s3)=b pref(s3,s1)=c CC = “abc” + overlap(s3,s1) = “abc” + “abcabcab”

Three upper bounds for overlap • A trivial one: overlap(s3, s1) ≤ |s1| • A semi-trivial one. Let r1, r2, …, r3 be the order in which the first string of each of the cycles in the cover appears in OPT. Then Σ|ri| ≤ OPT + Σ overlap(ri, ri+1) since Σ (|ri| - overlap(ri, ri+1)) ≤ OPT • A clever one: overlap(ri, ri+1) ≤ CCi + CCi+1

The clever bound overlap(ri, ri+1) ≤ CCi + CCi+1 If |ri| ≤ CCi then it follows trivially since overlap(ri, ri+1) ≤ |ri| (similarly if |ri+1| ≤ CCi+1 there is nothing to show) Else, riis bigger than CCi . Huh? This can only happen if ri periodic, since riis fully contained in CCi (similarly ri+1 is periodic too)

The clever bound Now by way of contradiction assume we have two periodic strings ri, ri+1 such that overlap(ri, ri+1) ≥CCi + CCi+1 I.e. we have two periodic strings ri, ri+1, each containing their cycles and with high overlap. Intuitive idea: If two periodic things overlap for long enough they must be contained in each other modulo shifts. If so, it is not hard to see that the CCi covers every string in CCi+1 and hence the two cycles can be merged with cost CCiwhich contradicts the fact that we had a minimum cycle cover. Intuition is right, details in the book.

Three upper bounds for overlap • A trivial one: overlap(s3, s1) ≤ |s1| • A semi-trivial one. Let r1, r2, …, r3 be the order in which the first string of each of the cycles in the cover appears in OPT. Then Σ|ri| ≤ OPT + Σ overlap(ri, ri+1) since Σ (|ri| - overlap(ri, ri+1)) ≤ OPT • A clever one: overlap(ri, ri+1) ≤ CCi + CCi+1

A trivial one: overlap(s3, s1) ≤ |s1| • Σ|ri| ≤ OPT + Σ overlap(ri, ri+1) • overlap(ri, ri+1) ≤ CCi + CCi+1 approx SS=pref(s1,s2)+pref(s2,s3)+pref(s3,s1)+overlap(s3,s1)+pref(s4,s5)+pref(s5,s4)+overlap(s5,s4) ≤ OPT+overlap(s3,s1)+overlap(s5,s4 ) ≤ OPT+|s1|+|s4| ≤ OPT + Σ |ri| ≤ OPT + OPT + Σ overlap(ri, ri+1) ≤ OPT + OPT + Σ (CCi + CCi+1 ) ≤ OPT + OPT + OPT + OPT ≤ 4 OPT

Shortest Superstring (SS)