630 likes | 655 Views
Group Testing and Coding Theory. Or, A Theoretical Computer Scientist’s (Biased) View of Group Testing. Atri Rudra University at Buffalo, SUNY. Group testing overview. Test soldier for a disease. WWII example: syphillis. Group testing overview. Can we do better?.
E N D
Group Testing and Coding Theory Or, A Theoretical Computer Scientist’s (Biased) View of Group Testing Atri Rudra University at Buffalo, SUNY
Group testing overview Test soldier for a disease WWII example: syphillis
Group testing overview Can we do better? Test an army for a disease WWII example: syphillis What if only one soldier has the disease?
Communicating with my 2 year old C(x) x y = C(x)+error • “Code” C • “Akash English” • C(x) is a “codeword” x Give up
The setup C(x) x y = C(x)+error • Mapping C • Error-correcting code or just code • Encoding: xC(x) • Decoding: yx • C(x) is a codeword x Give up
The fundamental tradeoff Correct as many errors as possible with as little redundancy as possible Can one achieve the “optimal” tradeoff with efficient encoding and decoding ?
The main message Coding Theory Group Testing
Asymptotic view n! 10n2 n2
O() notation ≤ is O with glasses poly(n) is O(nc) for some fixed c
Group testing overview Can pool blood samples and check if at least one soldier has the disease Test an army for a disease WWII example: syphillis What if only one soldier has the disease?
Group testing Tons of applications Set of items: (Unknown) vector x in {0,1}n At most d positives: |x| ≤ d Tests: a subset S of {1,..,n} ………… 1 2 3 n …………. …………. …………. …………. 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 Non-adaptive tests: all tests are fixed a priori 2 Result of a test: OR of xi’s such that i in S 3 . . . . . . Output + items Goal 1: Figure out x t t = O(d2log n) is possible Goal 2: Minimize the number of tests t
The decoding step To be designed unknown Observed r1 x1 r2 x2 r3 ………… x3 1 2 3 n . . . …………. …………. …………. …………. . . . . . . 0 1 0 1 1 0 0 0 0 1 0 1 1 1 0 0 1 2 rt 3 How fast can this step be done? . . . . . . xn t
An application: heavy hitters One pass, poly log space, poly log update, poly log report time Stream items are numbers in the range {1,…,n} Output all items that occur at least 1/d fraction of the times
Cormode-Muthukrishnan idea Use group testing: maintain counters for each test Heavy tail property: Total frequency of non-heavy items < 1/d Maintain total count m ………… 1 2 3 n c1 …………. …………. 1 0 0 0 0 0 1 1 …………. 0 0 1 0 c2 ri = 1 iff ci ≥ m/d c3 xj= 1 iff j is a heavy item (|x| ≤ d) . . . Maintain count of items in tests . . . Reporting the heavy items is just decoding! r = M × x …………. 1 1 1 0 ct
Requirements from group testing Non-adaptiveness is crucial Minimize t (space) ………… 1 2 3 n c1 …………. …………. 0 1 0 0 0 0 1 1 Strongly explicit matrix …………. 0 0 1 0 c2 c3 Minimize decoding time (report time) . . . . . . …………. 1 1 1 0 ct
An overview of results d is O(logn) # tests (t) Decoding time O(nt) O(d2 log n) [DR82], [PR08] Big savings O(d4 log n) O(t) [GI04] O(d2 log2 n) poly(t) [GI04, implicit] O(d2 log n) poly(t) [INR10, NPR11]
Tackling the first row O(d4 log n) O(t) [GI04] # tests (t) Decoding time O(d2 log n) O(nt) [DR82], [PR08] O(d2 log2 n) poly(t) [GI04, implicit] O(d2 log n) poly(t) [INR10, NPR11]
d-disjunct matrices Every non-positive column has one0test result Sufficient condition for group testing d columns 0 0 0 …………….. 0 1 Test result=0 Exists Set of positives True for every d subset of columns and a disjoint column
Naïve decoder for d-disjunct matrices If rj = 0 then for every column i that is in test j, set xi = 0 d columns If xi=1 then all tests column i participates in will have a 1 0 0 0 …………….. 0 1 Set of positives O(nt) time O(Lt) time L columns
What is known Strongly explicit d-disjunct matrix with t = O(d2 log2n) [Kautz-Singleton 1964] d columns Randomized d-disjunct matrix with t = O(d2 logn) [Dyachkov-Rykov 1982] Deterministic d-disjunct matrix with t = O(d2 logn) [Porat-Rothschild 2008] r1 r2 Lower bound of Ω(d2 log n/log d) [Dyachkov-Rykov 1982] r3 . . . 0 0 0 …………….. 0 1 rt d-disjunct matrix Set of positives O(nt) time
Up next O(d4 log n) O(t) [GI04] # tests (t) Decoding time O(d2 log n) O(nt) [DR82], [PR08] O(d2 log2 n) poly(t) [GI04, implicit] O(d2 log n) poly(t) [INR10, NPR11]
Error-correcting codes C(x) x • Mapping C : km • Dimension k, blocklength m • m≥ k • Rate R =k/m 1 • Efficient means polynomial in m • Decoding time complexity y x Give up
Noise model Errors are worst case (Hamming) error locations arbitrary symbol changes Limit on total number of errors
Hamming’s 60 yr old observation D/2 ≥ D Large “distance” is good
All you need to remember about Reed-Solomon codes– Part I q is a prime power qq/(d+1)vectors from [q]qwhere every two agree in < q/(d+1) positions
Concatenation of codes [Forney 66] C1: ({0,1}k)K({0,1}k)M (Outer code) C2: {0,1}k{0,1}m (Inner code) C1° C2: {0,1}kK {0,1}mM Typically k=O(log M) C2(wM) C2(w1) C2(w2) How do we get binary codes ? x x1 x2 xK w1 w2 wM C1(x) C1° C2(x)
Disjunct matrices from RS codes Column i gets ith codeword n = qq/(d+1) Code Concatenation t = q2= O(d2 log2n) …. x …. 0 0 0 1 q x q rows x d-disjunct matrix [Kautz,Singleton] .
A q=3 example 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 2 1 1 0 2 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 1 2 0 0 2 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 2 1 0 1 0 2 1 0 2 0 0 1 0 1 0 0 0 1
1-Agreement between two columns 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 ≤ 1 agr 2 1 1 0 0 2 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 2 2 1 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 2 1 1 2 0 0 1 0 2 Agreement in binary = Agreement among RS codewords < q/(d+1) 0 0 1 0 1 0 1 0 0
d-disjunct matrices Sufficient condition for group testing d columns 0 0 0 …………….. 0 1 Exists Set of positives True for every d subset of columns and a disjoint column
d-disjunctness of Kautz-Singleton d columns 1 0 0 0 >q- q*d/(d+1)>0 rows 1 1 < q/(d+1) agr 1 1 < q/(d+1) agr 1 1 < q/(d+1) agr
Up next O(d4 log n) O(t) [GI04] # tests (t) Decoding time O(d2 log n) O(nt) [DR82], [PR08] O(d2 log2 n) poly(t) [GI04, implicit] O(d2 log n) poly(t) [INR10, NPR11]
The basic idea Every column is a codeword unknown Observed r1 x1 Show is same as `decoding’ the code r2 x2 r3 ………… x3 1 2 3 n . . . …………. …………. …………. …………. . . . . . . 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 0 1 2 rt 3 . . . . . . n= # codewords = exp(m) xn t t = poly(m)
Decoding x C(x) C(x) sent, y received x k,y m How much of y must be correct to recover x ? At least k symbols must be correct At most (m-k)/m = 1-R fraction of errors 1-R is the information-theoretic limit : the fraction of errors decoder can handle Information theoretic limit implies 1-R R = k/m y
Not if we always want to uniquely recover the original message Limit for unique decoding, <(1-R)/2 R 1-R c1 c2 r Can we get to the limit or 1-R ? (1-R)/2 (1-R)/2 1-R (1-R)/2
(1-R)/2 List decoding[Elias57, Wozencraft58] Almost all the space in higher dimension. All but an exponential (in m) fraction Always insisting on unique codeword is restrictive The “pathological” cases are rare “Typical” received word can be decoded beyond (1-R)/2 Better Error-Recovery Model Output a list of answers List Decoding Example: Spell Checker
Unique decoding Inf. theoretic limit Frac. of Errors () Rate (R) Information theoretic limit • < 1 - R • Information-theoretic limit • Can handle twice as many errors Achievable by random codes. NOT ALGORITHMIC!
Other applications of list decoding Cryptography Cryptanalysis of certain block-ciphers [Jakobsen98] Efficient traitor tracing scheme [Silverberg, Staddon, Walker 03] Complexity Theory Hardcore predicates from one way functions [Goldreich,Levin 89; Impagliazzo 97; Ta-Shama, Zuckerman 01] Worst-case vs. average-case hardness [Cai, Pavan, Sivakumar 99; Goldreich, Ron, Sudan 99; Sudan, Trevisan, Vadhan 99; Impagliazzo, Jaiswal, Kabanets 06] Other algorithmic applications IP Traceback [Dean,Franklin,Stubblefield 01; Savage, Wetherall, Karlin, Anderson 00] Guessing Secrets [Alon,Guruswami,Kaufman,Sudan 02; Chung, Graham, Leighton 01]
Unique decoding Inf. theoretic limit Frac. of Errors () Rate (R) Algorithmic list decoding results 1- R - > 0 Folded RS codes [Guruswami, R.06] Guruswami-Sudan 98 Parvaresh-Vardy 05 Folded RS
Concatenation of codes [Forney 66] C1: ({0,1}k)K({0,1}k)M (Outer code) C2: {0,1}k{0,1}m (Inner code) C1° C2: {0,1}kK {0,1}mM Typically k=O(log M) C2(wM) C2(w1) C2(w2) Concatenated codes • Brute force decoding for inner code x x1 x2 xK w1 w2 wM C1(x) C1° C2(x)
List decoding C1° C2 S1 S2 SM in {0,1}m y1 y2 yM in {0,1}k How do we “list decode” from lists ?
List recovery S1 S2 S3 SM Si subset of [q] . . . Output all codewords that agree with (all) the input lists ……………………… . . . . |Si| ≤ d ……………………… c1 c3 cM c2
All you need to remember about (Reed-Solomon) codes-- Part II q is a prime power qq/(d+1)vectors from [q]qwhere every two agree in < q/(d+1) positions poly(q) time algorithm for list recovery S1 S2 S3 Sq Si subset of [q] . . . Output all codewords that agree with all the input lists ……………………… . . . . |Si| ≤ d ……………………… c1 c3 cq c2
Back to the example 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 1 {1,2} {2} {0,2} 0 1 2 0 2 1 1 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 1 2 2 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 1 0 1 1 2 0 1 0 2 1 0 2 + items Result vector 1 0 0 0 1 0 0 0 1
All you ever needed to know about (Reed-Solomon) codes… at least for this talk q is a prime power qq/(d+1)vectors from [q]qwhere every two agree in < q/(d+1) positions poly(q) time algorithm for list recovery S1 S2 S3 Sq Si subset of [q] . . . Output all codewords that agree with all the input lists ……………………… |Si| ≤ d . . . . ……………………… c1 c3 cq c2
What does this imply? d2 columns d columns Implicit in [Guruswami-Indyk 04] t = O(d2 log2 n) r1 r2 r3 . . . 0 0 0 …………….. 0 1 rt KS matrix Set of positives poly(t) time O(d2t) time
Up next O(d4 log n) O(t) [GI04] # tests (t) Decoding time O(d2 log n) O(nt) [DR82], [PR08] O(d2 log2 n) poly(t) [GI04, implicit] O(d2 log n) poly(t) [INR10, NPR11]
Filter-evaluate decoding paradigm L columns d columns r1 y1 “Filtering” matrix r2 y2 r3 y3 . . . . . . 0 0 0 …………….. 0 1 rt yt’ d-disjunct matrix Set of positives O(Lt) time poly(t’)time
So all we need to do [Indyk, Ngo, R. 10] [Ngo, Porat, R. 11] o(d2 log n/log d) tests
Overview of the results O(d4 log n) O(t) [GI04] # tests (t) Decoding time O(d2 log n) O(nt) [DR82], [PR08] O(d2 log2 n) poly(t) [GI04, implicit] O(d2 log n) poly(t) [INR10, NPR11]