Group Testing and Coding Theory

Group Testing and Coding Theory Or, A Theoretical Computer Scientist’s (Biased) View of Group Testing Atri Rudra University at Buffalo, SUNY

Group testing overview Test soldier for a disease WWII example: syphillis

Group testing overview Can we do better? Test an army for a disease WWII example: syphillis What if only one soldier has the disease?

Communicating with my 2 year old C(x) x y = C(x)+error • “Code” C • “Akash English” • C(x) is a “codeword” x Give up

The setup C(x) x y = C(x)+error • Mapping C • Error-correcting code or just code • Encoding: xC(x) • Decoding: yx • C(x) is a codeword x Give up

The fundamental tradeoff Correct as many errors as possible with as little redundancy as possible Can one achieve the “optimal” tradeoff with efficient encoding and decoding ?

The main message Coding Theory Group Testing

Asymptotic view n! 10n2 n2

O() notation ≤ is O with glasses poly(n) is O(nc) for some fixed c

Group testing overview Can pool blood samples and check if at least one soldier has the disease Test an army for a disease WWII example: syphillis What if only one soldier has the disease?

Group testing Tons of applications Set of items: (Unknown) vector x in {0,1}n At most d positives: |x| ≤ d Tests: a subset S of {1,..,n} ………… 1 2 3 n …………. …………. …………. …………. 1 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 1 Non-adaptive tests: all tests are fixed a priori 2 Result of a test: OR of xi’s such that i in S 3 . . . . . . Output + items Goal 1: Figure out x t t = O(d2log n) is possible Goal 2: Minimize the number of tests t

The decoding step To be designed unknown Observed r1 x1 r2 x2 r3 ………… x3 1 2 3 n . . . …………. …………. …………. …………. . . . . . . 0 1 0 1 1 0 0 0 0 1 0 1 1 1 0 0 1 2 rt 3 How fast can this step be done? . . . . . . xn t

An application: heavy hitters One pass, poly log space, poly log update, poly log report time Stream items are numbers in the range {1,…,n} Output all items that occur at least 1/d fraction of the times

Cormode-Muthukrishnan idea Use group testing: maintain counters for each test Heavy tail property: Total frequency of non-heavy items < 1/d Maintain total count m ………… 1 2 3 n c1 …………. …………. 1 0 0 0 0 0 1 1 …………. 0 0 1 0 c2 ri = 1 iff ci ≥ m/d c3 xj= 1 iff j is a heavy item (|x| ≤ d) . . . Maintain count of items in tests . . . Reporting the heavy items is just decoding! r = M × x …………. 1 1 1 0 ct

Requirements from group testing Non-adaptiveness is crucial Minimize t (space) ………… 1 2 3 n c1 …………. …………. 0 1 0 0 0 0 1 1 Strongly explicit matrix …………. 0 0 1 0 c2 c3 Minimize decoding time (report time) . . . . . . …………. 1 1 1 0 ct

An overview of results d is O(logn) # tests (t) Decoding time O(nt) O(d2 log n) [DR82], [PR08] Big savings O(d4 log n) O(t) [GI04] O(d2 log2 n) poly(t) [GI04, implicit] O(d2 log n) poly(t) [INR10, NPR11]

Tackling the first row O(d4 log n) O(t) [GI04] # tests (t) Decoding time O(d2 log n) O(nt) [DR82], [PR08] O(d2 log2 n) poly(t) [GI04, implicit] O(d2 log n) poly(t) [INR10, NPR11]

d-disjunct matrices Every non-positive column has one0test result Sufficient condition for group testing d columns 0 0 0 …………….. 0 1 Test result=0 Exists Set of positives True for every d subset of columns and a disjoint column

Naïve decoder for d-disjunct matrices If rj = 0 then for every column i that is in test j, set xi = 0 d columns If xi=1 then all tests column i participates in will have a 1 0 0 0 …………….. 0 1 Set of positives O(nt) time O(Lt) time L columns

What is known Strongly explicit d-disjunct matrix with t = O(d2 log2n) [Kautz-Singleton 1964] d columns Randomized d-disjunct matrix with t = O(d2 logn) [Dyachkov-Rykov 1982] Deterministic d-disjunct matrix with t = O(d2 logn) [Porat-Rothschild 2008] r1 r2 Lower bound of Ω(d2 log n/log d) [Dyachkov-Rykov 1982] r3 . . . 0 0 0 …………….. 0 1 rt d-disjunct matrix Set of positives O(nt) time

Up next O(d4 log n) O(t) [GI04] # tests (t) Decoding time O(d2 log n) O(nt) [DR82], [PR08] O(d2 log2 n) poly(t) [GI04, implicit] O(d2 log n) poly(t) [INR10, NPR11]

Error-correcting codes C(x) x • Mapping C : km • Dimension k, blocklength m • m≥ k • Rate R =k/m 1 • Efficient means polynomial in m • Decoding time complexity y x Give up

Noise model Errors are worst case (Hamming) error locations arbitrary symbol changes Limit on total number of errors

Hamming’s 60 yr old observation D/2 ≥ D Large “distance” is good

All you need to remember about Reed-Solomon codes– Part I q is a prime power qq/(d+1)vectors from [q]qwhere every two agree in < q/(d+1) positions

Concatenation of codes [Forney 66] C1: ({0,1}k)K({0,1}k)M (Outer code) C2: {0,1}k{0,1}m (Inner code) C1° C2: {0,1}kK {0,1}mM Typically k=O(log M) C2(wM) C2(w1) C2(w2) How do we get binary codes ? x x1 x2 xK w1 w2 wM C1(x) C1° C2(x)

Disjunct matrices from RS codes Column i gets ith codeword n = qq/(d+1) Code Concatenation t = q2= O(d2 log2n) …. x …. 0 0 0 1 q x q rows x d-disjunct matrix [Kautz,Singleton] .

A q=3 example 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 2 1 1 0 2 0 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 1 2 0 0 2 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 2 1 0 1 0 2 1 0 2 0 0 1 0 1 0 0 0 1

1-Agreement between two columns 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 ≤ 1 agr 2 1 1 0 0 2 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 2 2 1 1 0 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 2 1 1 2 0 0 1 0 2 Agreement in binary = Agreement among RS codewords < q/(d+1) 0 0 1 0 1 0 1 0 0

d-disjunct matrices Sufficient condition for group testing d columns 0 0 0 …………….. 0 1 Exists Set of positives True for every d subset of columns and a disjoint column

d-disjunctness of Kautz-Singleton d columns 1 0 0 0 >q- q*d/(d+1)>0 rows 1 1 < q/(d+1) agr 1 1 < q/(d+1) agr 1 1 < q/(d+1) agr

The basic idea Every column is a codeword unknown Observed r1 x1 Show is same as `decoding’ the code r2 x2 r3 ………… x3 1 2 3 n . . . …………. …………. …………. …………. . . . . . . 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 0 1 2 rt 3 . . . . . . n= # codewords = exp(m) xn t t = poly(m)

Decoding x C(x) C(x) sent, y received x k,y m How much of y must be correct to recover x ? At least k symbols must be correct At most (m-k)/m = 1-R fraction of errors 1-R is the information-theoretic limit : the fraction of errors decoder can handle Information theoretic limit implies 1-R R = k/m y

Not if we always want to uniquely recover the original message Limit for unique decoding,  <(1-R)/2 R 1-R c1 c2 r Can we get to the limit or 1-R ? (1-R)/2 (1-R)/2 1-R (1-R)/2

(1-R)/2 List decoding[Elias57, Wozencraft58] Almost all the space in higher dimension. All but an exponential (in m) fraction Always insisting on unique codeword is restrictive The “pathological” cases are rare “Typical” received word can be decoded beyond (1-R)/2 Better Error-Recovery Model Output a list of answers List Decoding Example: Spell Checker

Unique decoding Inf. theoretic limit Frac. of Errors () Rate (R) Information theoretic limit • < 1 - R • Information-theoretic limit • Can handle twice as many errors Achievable by random codes. NOT ALGORITHMIC!

Other applications of list decoding Cryptography Cryptanalysis of certain block-ciphers [Jakobsen98] Efficient traitor tracing scheme [Silverberg, Staddon, Walker 03] Complexity Theory Hardcore predicates from one way functions [Goldreich,Levin 89; Impagliazzo 97; Ta-Shama, Zuckerman 01] Worst-case vs. average-case hardness [Cai, Pavan, Sivakumar 99; Goldreich, Ron, Sudan 99; Sudan, Trevisan, Vadhan 99; Impagliazzo, Jaiswal, Kabanets 06] Other algorithmic applications IP Traceback [Dean,Franklin,Stubblefield 01; Savage, Wetherall, Karlin, Anderson 00] Guessing Secrets [Alon,Guruswami,Kaufman,Sudan 02; Chung, Graham, Leighton 01]

Unique decoding Inf. theoretic limit Frac. of Errors () Rate (R) Algorithmic list decoding results   1- R -  > 0 Folded RS codes [Guruswami, R.06] Guruswami-Sudan 98 Parvaresh-Vardy 05 Folded RS

Concatenation of codes [Forney 66] C1: ({0,1}k)K({0,1}k)M (Outer code) C2: {0,1}k{0,1}m (Inner code) C1° C2: {0,1}kK {0,1}mM Typically k=O(log M) C2(wM) C2(w1) C2(w2) Concatenated codes • Brute force decoding for inner code x x1 x2 xK w1 w2 wM C1(x) C1° C2(x)

List decoding C1° C2 S1 S2 SM in {0,1}m y1 y2 yM in {0,1}k How do we “list decode” from lists ?

List recovery S1 S2 S3 SM Si subset of [q] . . . Output all codewords that agree with (all) the input lists ……………………… . . . . |Si| ≤ d ……………………… c1 c3 cM c2

All you need to remember about (Reed-Solomon) codes-- Part II q is a prime power qq/(d+1)vectors from [q]qwhere every two agree in < q/(d+1) positions poly(q) time algorithm for list recovery S1 S2 S3 Sq Si subset of [q] . . . Output all codewords that agree with all the input lists ……………………… . . . . |Si| ≤ d ……………………… c1 c3 cq c2

Back to the example 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 0 1 1 {1,2} {2} {0,2} 0 1 2 0 2 1 1 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 1 2 2 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 1 0 1 1 2 0 1 0 2 1 0 2 + items Result vector 1 0 0 0 1 0 0 0 1

All you ever needed to know about (Reed-Solomon) codes… at least for this talk q is a prime power qq/(d+1)vectors from [q]qwhere every two agree in < q/(d+1) positions poly(q) time algorithm for list recovery S1 S2 S3 Sq Si subset of [q] . . . Output all codewords that agree with all the input lists ……………………… |Si| ≤ d . . . . ……………………… c1 c3 cq c2

What does this imply? d2 columns d columns Implicit in [Guruswami-Indyk 04] t = O(d2 log2 n) r1 r2 r3 . . . 0 0 0 …………….. 0 1 rt KS matrix Set of positives poly(t) time O(d2t) time

Filter-evaluate decoding paradigm L columns d columns r1 y1 “Filtering” matrix r2 y2 r3 y3 . . . . . . 0 0 0 …………….. 0 1 rt yt’ d-disjunct matrix Set of positives O(Lt) time poly(t’)time

So all we need to do [Indyk, Ngo, R. 10] [Ngo, Porat, R. 11] o(d2 log n/log d) tests

Overview of the results O(d4 log n) O(t) [GI04] # tests (t) Decoding time O(d2 log n) O(nt) [DR82], [PR08] O(d2 log2 n) poly(t) [GI04, implicit] O(d2 log n) poly(t) [INR10, NPR11]

Group Testing and Coding Theory

Group Testing and Coding Theory

Presentation Transcript

Theory testing

Coding Theory and Protein Synthesis

Coding Theory

Information and Coding Theory Introduction

Group Sparse Coding

CODING THEORY

Information and Coding Theory

Personality Theory and Testing

Algebriac Coding Theory

Information and Coding Theory

Group Testing and Coding Theory

Network Coding: Theory and Practice

Information and Coding Theory Introduction

Information and Coding Theory

16.548 Coding and Information Theory

Biological Coding Theory

Coding and Unit Testing

Coding Theory, Compression, and Cryptography

Information Theory and Coding