Computational Molecular Biology

Computational Molecular Biology Group Testing – Pooling Designs

Group Testing (GT) • Definition: • Given n items with at most d positive ones • Identify all positive ones by the minimum number of tests • Each test is on a subset of items • Positive test outcome: there exists a positive item in the subset My T. Thai mythai@cise.ufl.edu

An Idea of GT _ _ _ _ _ _ _ _ _ _ _ + _ _ _ _ _ + Positive Negative My T. Thai mythai@cise.ufl.edu

Example 1 – Sequential Method 1 2 3 4 5 6 7 8 9 1 2 3 4 5 4 5 My T. Thai mythai@cise.ufl.edu

Example 2 – Non-adaptive Method P4p5 p6 p1 1 2 3 p2 4 5 6 p3 7 8 9 Non-adaptive group testing is called pooling design in biology My T. Thai mythai@cise.ufl.edu

Sequential and Non-adaptive • Sequential GT needs less number of tests, but longer time. • Non-adaptive GT needs more tests, but shorter time. • In molecular biology, non-adaptive GT is usually taken. Why? My T. Thai mythai@cise.ufl.edu

Because… • The same library is screened with many different probes. It is expensive to prepare a pool for testing first time. Once a pool is prepared, it can be screened many times with different probes. • Screening one pool at a time is expensive. Screening pools in parallel with same probe is cheaper. • There are constrains on pool sizes. If a pool contains too many different clones, then positive pools can become too dilute and could be mislabeled as negative pools. My T. Thai mythai@cise.ufl.edu

Pooling Designs • Problem Definition • Given a set of n clones with at most d positive clones • Identify all positive clones with the minimum number of tests • Pool:a subset of clones • Positive pool: a pool contains at least one positive clone • Clones = Items My T. Thai mythai@cise.ufl.edu

Relation to Pooling Designs clones c1 c2 cj cn p1 0 0 … 0 … 0 … 0 … 0 0 p2 0 1 … 0 … 0 … 0 … 0 1 pools . . . . pi 0 0 … 0 … 1 … 0 … 0 1 . . . . pt 0 0 … 0 … 0 … 0 … 0 0 txn tx1 M[i, j] = 1 iff the ith pool contains the jth clone Decoding Algorithm: Given M and V, identify all positive clones V Testing Mtxn = My T. Thai mythai@cise.ufl.edu

Observation clones c1 c2 c3 cj p1 1 1 1 0 0 0 0 0 0 p2 0 0 0 1 1 1 0 0 0 p3 0 0 0 0 0 0 1 1 1 pools 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 Observation: All columns are distinct. To identify up to d positives, all unions of up to d columns should be distinct! Union of d columns: Boolean sum of these d columns My T. Thai mythai@cise.ufl.edu

Challenges • Challenge 1: How to construct the binary matrixM such that: • Outputs of any union of d columns are distinct • Challenge 2: How to design a decoding algorithm with efficient time complexity [O(tn)] My T. Thai mythai@cise.ufl.edu

d-separable Matrix clones c1 c2 c3 cjcn p1 0 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 p2 0 1 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 p3 1 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 pools 0 0 1 … 0 … 0 … 0 … 0 … 0 … 0 … 0 . . pi 0 0 0 … 0 … 0 … 1 … 0 … 0 … 0 … 0 . . pt 0 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 All unions of d columns are distinct. My T. Thai mythai@cise.ufl.edu

d-separable Matrix clones c1 c2 c3 cjcn p1 0 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 p2 0 1 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 p3 1 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 pools 0 0 1 … 0 … 0 … 0 … 0 … 0 … 0 … 0 . . pi 0 0 0 … 0 … 0 … 1 … 0 … 0 … 0 … 0 . . pt 0 0 0 … 0 … 0 … 0 … 0 … 0 … 0 … 0 All unions of up to d columns are distinct. Decoding: O(nd) My T. Thai mythai@cise.ufl.edu

d-disjunct Matrix • Definition:An binary matrix Mtxn is a d-disjunct matrix (d < t) if: • The union of any d columns does not contain any other column • Example: 1 0 0 0 1 0 0 0 1 A 2-disjunct matrix M = My T. Thai mythai@cise.ufl.edu

d-disjunct Matrix (cont) • d-disjunct matrix can efficiently identify up to d positive clones. Why? • Theorem 1: All unions of d distinct columns are distinct (thus d-disjunct implies d-separable) • Theorem 2: Thenumber of clones not in negative pools is always at most d • Corollary 1: The tests of negative outputs determine all negative clones • Decoding time complexity: O(tn) My T. Thai mythai@cise.ufl.edu

Proof of Theorem 2 • Note that an item does not appearing in any negative pool iff its corresponding column is contained by the union of d positive columns • Therefore, the number of items not appearing in any negative pool is more than d iff there are at least a non-positive item whose column is contained by the d positive columns • But M is d-disjunct, hence Theorem 2 follows My T. Thai mythai@cise.ufl.edu

Decoding Algorithm Input:d-disjunct matrix M and output vector V Output: All positive clones for each clone c in n clones ifc is in a negative pool remove c return remaining clones c1 c2c3 c4 c5 c6 p1 1 1 1 0 0 0 1 P2 1 0 0 1 1 0 0 P3 0 1 0 1 0 1 0 P4 0 0 1 0 1 1 1 My T. Thai mythai@cise.ufl.edu

Fields • Field: is any set of elements that satisfies the field axioms for both addition and multiplication and is a division algebra • Eg: Compex, Rational, Real My T. Thai mythai@cise.ufl.edu

Division Algebra My T. Thai mythai@cise.ufl.edu

Finite Fields • Finite Field: • is a field with a finite field order, i.e., number of elements. • The order of a finite field is always a prime or a prime power (power of a prime) • Eg: 16 = 2^4 is a prime power where 6, 15 are not • Eg: in GF(5), 4+3=7 is reduced to 2 modulo 5 My T. Thai mythai@cise.ufl.edu

Consider a finite field GF(q). Choose s, q, k satisfying: Step 1: Construct matrixAsxnas follows: forx from 0 to s -1 for each polynomials pj of degree k A[x,pj] = pj(x) p1 p2 pj pn 0 1 A= xp2(x)pj(x) s-1 How to construct a d-disjunct matrix My T. Thai mythai@cise.ufl.edu

Step 2: Construct matrixBtxn from Asxnas follows: forx from 0 to s -1 fory from 0 to q -1 for each polynomials pj of degree k ifA[x,pj] = = y B[(x,y),pj] = 1 elseB[(x,y),pj] = 0 p1 p2 pj pn 0 1 A= x p2(x) pj(x) s-1 Algorithm (cont) p2(x) ≠ y p1 p2 pj pn (0,0) (0,1) B= (x,y) (s-1,q-1) pj(x) = y 0 1 My T. Thai mythai@cise.ufl.edu

Algorithm Analysis • Theorem 3: (Correctness) If kd ≤ s ≤ q, then Btxn is d-disjunct. • Theorem 4: The number of testst obtained from this algorithm is t = qs = O(q2) where: My T. Thai mythai@cise.ufl.edu

Errors in Experiments • False negative: • Pool contains some positive clones • But return the negative outcome • False positive: • Pool contains all negative clones • But return the positive outcome My T. Thai mythai@cise.ufl.edu

An e-Error Correcting Model • Definition: • Assume that there is at most e errors in testing • All positive clones can still be identified • Hamming distance: the Hamming distance of two column vectors is the number of different components between them • e-error-correcting: A matrix is said to be e-error-correcting if the Hamming distance of any two unions of d columns is at least 2e + 1 My T. Thai mythai@cise.ufl.edu

(d,e)-disjunct Matrix • Definition: An t × n binary matrix M is (d, e)-disjunct if for any one column j and any other d columns j1, j2, . . . , jd, there exist e + 1 rows i0, i2, … , ie such that Miuj = 1 and Miujv = 0 for u = 0, 1,…, e and v = 1, 2, . . . , d My T. Thai mythai@cise.ufl.edu

E-error Correcting • Theorem 5: For every (d,k)-disjunct matrix, the Hamming distance between any two unions of d columns is at least 2k + 2 My T. Thai mythai@cise.ufl.edu

Theorem 6 • Theorem 6: Suppose testing is based on a (d,e)-disjunct matrix. If the number of errors is at most e, then the number of negative pools containing a positive item is always smaller than the number of negative pools containing a negative item My T. Thai mythai@cise.ufl.edu

Proof of Theorem 6 • Let i be a positive item, j be a negative item. Suppose #negative pools containing i = m. Then m pools must receive errors. Hence, there are at most e – m error tests turning negative outcome to positive outcome. Moreover, if no error exists, # negative pools containing j is at least e + 1 due to (d,e)-disjunct. Hence #negative pools containing j is at least (e+1)-(e-m) = m +1>m My T. Thai mythai@cise.ufl.edu

Decoding in e-error-correcting • Corollary: From Theorem 6, we see that to decode positives from testing based on (d,e)-disjuct matrix, we only need to compute the number of negative pools containing each item and select d smallest one. This runs in time O(nt) My T. Thai mythai@cise.ufl.edu

Decoding Algorithm with e Errors T= empty set for each clone ci (i = 1…n) t(ci) = # negative pools containing ci T= Tt(ci) end for Let Td = set of dsmallestt(ci) in T return ci if t(ci) in Td Time complexity: O(tn) My T. Thai mythai@cise.ufl.edu

Computational Molecular Biology