380 likes | 403 Views
This presentation covers batch codes for load-balancing scenarios, with examples and applications in private information retrieval. Explore computational and information-theoretic PIR approaches, time complexity, and amortized PIR using hashing.
E N D
Batch Codes andTheir Applications Y.Ishai, E.Kushilevitz, R.Ostrovsky, A.Sahai Preliminary version in STOC 2004
Talk Outline • Batch codes • Amortized PIR • via hashing • via batch codes • Constructing batch codes • Concluding remarks
What’s wrong with a random partition? • Good on average for “oblivious” queries. • However: • Can’t balance adversarial queries • Can’t balance few random queries • Can’t relieve “hot spots” in multi-user setting
L R LR L R Example • 3 devices, 50% storage overhead. • By how much can the maximal load be reduced? • Replicating bits is no good:device s.t.1/6 of the bits can only be found at this device. • Factor 2 load reduction is possible:
n N y1 y2 ym x { i1,…,ik } Batch Codes • (n,N,m,k) batch code: • Notes • Rate = n / N • By default, insist on minimal load per bucket m≥k. • Load measured by # of probes. • Generalizations • Allow t probes per bucket • Larger alphabet
n N y1 y2 ym x < i1,…,ik > Multiset Batch Codes • (n,N,m,k) multiset batch code: • Motivation • Models multiple users (with off-line coordination) • Useful as a building block for standard batch codes • Nontrivial even for multisets of the form < i,i,…,i >
multiset multiset Examples • Trivial codes • Replication: N=kn, m=k • Optimal m, bad rate. • One bit per bucket: N=m=n • Optimal rate, bad m. • (L,R,LR)code: rate=2/3, m=3, k=2. • Goal: simultaneously obtain • High rate(close to 1) • Small m (close to k)
Private Information Retrieval (PIR) • Goal: allow user to query database while hiding the identity of the data-items she is after. • Motivation: patent databases, web searches, ... • Paradox(?): imagine buying in a store without the seller knowing what you buy. Note: Encrypting requests is useful against third parties; not against server holding the data.
Modeling • Database:n-bit string x • User: wishes to • retrieve xi and • keepi private
Server ??? xi User
Some “Solutions” 1. User downloads entire database. Drawback:n communication bits (vs. logn+1 w/o privacy). Main research goal: minimize communication complexity. 2. User masks i with additional random indices. Drawback: gives a lot of information about i. 3. Enable anonymous access to database. Note: addresses the different security concern of hiding user’s identity, not the fact that xi is retrieved. Fact: PIR as described so far requires (n) communication bits.
Two Approaches • Computational PIR[KO97, CMS99,...] • Computational privacy • Based oncryptographic assumptions • Information-Theoretic PIR[CGKS95,Amb97,...] • Replicate database among s servers • Unconditional privacy against tservers • Default: t=1
Communication Upper Bounds • Computational PIR • O(n), polylog(n), O(logn),O(+logn)[KO97,CMS99,…] • Information-theoretic PIR • 2 servers, O(n1/3)[CGKS95] • s servers, O(n1/c(s))where c(s)=Ω(slogs / loglogs)[CGKS95,Amb97,BIKR02] • O(logn/loglogn) servers, polylog(n)
Time Complexity of PIR • Given low-communication protocols, efficiency bottleneck shifts to servers’ time complexity. • Protocols require (at least)linear time per query. • This is an inherent limitation! • Possible workarounds: • Preprocessing • Amortize cost over multiple queries
Previous Results [BIM00] • PIR with preprocessing • s-server protocols with O(n) communication and O(n1/s+) work per query, requiring poly(n) storage. • Disadvantages: • Only work for multi-server PIR • Storage typically huge • Amortized PIR • Slight savings possible using fast matrix multiplication • Require a large batch of queries and high communication • Apply also to queries originating from different users. • This work: • Assume a batch of k queries originate from a single user. • Allow preprocessing (not always needed). • Nearly optimal amortization
??? xi , xi ,…, xi 1 2 k Model Server/s User
Amortized PIR via Hashing • Let P be a PIR protocol. • Hashing-based amortized PIR: • User picks hRH , defining a random partition of x into k buckets of sizen/k, and sends h to Server/s. • Except for 2- failure probability, at most t=O(logk)queries fall in each bucket. • P is applied t times for each bucket. • Complexity: • Time kt T(n/k) t T(n) • Communication ktC(n/k) • Asymptotically optimal up to “polylog factors”
So what’s wrong? • Not much… • Still: • Not perfect • introduces either error or privacy loss • Useless for small k • t=O(logk)overhead dominates • Cannot hash “once and for all” • h bad k-tuple of queries • Sounds familiar?
Amortized PIR via Batch Codes • Idea: use batch-encoding instead of hashing. • Protocol: • Preprocessing: Server/s encode x as y=(y1,y2,…,ym). • Based on i1,…,ik, User computes the index of the bit it needs from each bucket. • P is applied once for each bucket. • Complexity • Time 1jmT(Nj) T(N) • Communication 1jmC(Nj) mC(n) • Trivial batch codes imply trivial protocols. • (L,R,LR) code: 2 queries,1.5 X time, 3 X communication
n N y1 y2 ym x i1,…,ik Overview • Recall notion • Main qualitative questions: 1.Can we get arbitrarily high constant rate (n/N=1-) while keeping m feasible in terms of k (say m=poly(k))? 2.Can we insist on nearly optimal m (say m=O(k)) and still get close to a constant rate? • Several incomparable constructions • Answer both questions affirmatively. ~
n m Batch Codes from Unbalanced Expanders • By Hall’s theorem, the graph represents an (n,N=|E|,m,k) batch code iff every set S containing at most k vertices on the left has at least |S| neighbors on the right. • Fully captures replication-based batch codes.
Parameters • Non-explicit: N=dn,m=O(k (nk)1/(d-1)) • d=3: rate=1/3, m=O(k3/2n1/2). • d=logn:rate=1/logn, m=O(k) Settles Q2 • Explicit (using [TUZ01],[CRVW02]) • Nontrivial, but quite far from optimal • Limitations: • Rate < ½ (unless m=(n)) • For const. rate, m must also depend on n. • Cannot handle multisets.
The Subcube Code • Generalize (L,R,LR) example in two ways • Trade better rate for larger m • (Y1,Y2,…,Ys,Y1 … Ys) • still k=2 • Handle larger k via composition
Geomertic Interpretation A B A B C D AB C D CD AC BD ABCD
Parameters • Nklog(1+1/s)n, mklog(s+1) • s=O(logk)gives an arbitrary constant rate with m=kO(loglogk). “almost” resolves Q1 • Advantages: • Arbitrary constant rate • Handles multisets • Very easy decoding • Asymptotically dominated by subsequent construction.
The Gadget Lemma • From now on, we can choose a “convenient” n and get same rate and m(k) for arbitrarily larger n. Primitive multiset batch code
Batch Codes vs. Smooth Codes • Def. A code C:n m is q-smooth if there exists a (randomized) decoder D such that • D(i) decodes xi by probing q symbols of C(x). • Each symbol of C(x) is probed w/prob q/m. • Smooth codes are closely related to locally decodable codes [KT00]. • Two-way relation with batch codes: • q-smooth code primitive multiset batch code with k=m/q2 (ideally would like k=m/q). • Primitive multiset batch code (expected) q-smooth for q=m/k • Batch codes and smooth codes are very different objects: • Relation breaks when relaxing “multiset” or “primitive” • Gap between m/q and m/q2 is very significant for high rate case • Best known smooth codes with rate>1/2 require q>n1/2 • These codes are provably useless as batch codes.
Batch Codes from RM Codes • (s,d) Reed-Muller code over F • Message viewed as s-variate polynomial p over F of total degree (at most) d. • Encoded by the sequence of its evaluations on all points in Fs • Case |F|>d is useful due to a “smooth decoding” feature: p(z) can be extrapolated from the values of p on any d+1 points on a line passing through z.
Two approaches for handling conflicts: • Replicate each point t times • Use redundancy to “delete” intersections • Slightly increases field size, but still allows constant rate. x2 xn x1 s=2, d(2n)1/2
Parameters • Rate = (1/s!-), m=k1+1/(s-1)+o(1) • Multiset codes with constant rate (< ½) • Rate = (1/k), m=O(k) resolves Q2 for multiset codes as well • Main remaining challenge: resolve Q1 ~
( ) ( ) [s] d s d s x y d The Subset Code • Choose s,d such that n • Each data bit i[n] is associated T • Each bucket j[m] is associated S • Primitive code: yS=TSxT ( ) [s] d
( ) [s] d Batch Decoding the Subset Code xT • Lemma: For each T’T, xTcan be decoded from all ySsuch that ST=T’. • Let LT,T’ denote the set of such S. • Note: {LT,T’ : T’T } defines a partition of yT’ 0011110000 **0110****
Batch Decoding the Subset Code (contd.) x3 x1 x2 • Goal: Given T1,…,Tk, find subsets T’1,…,T’k such that LTi,T’i are pairwise disjoint. • Easy if all Ti are distinct or if all Ti are the same. • Attempt 1: T’i is a random subset of Ti • Problem: if Ti,Tj are disjoint, LTi,T’i and LTj,T’j intersect w.h.p. • Attempt 2: greedily assign to Ti the largest T’i such that LTi,T’i does not intersect any previous LTj,T’j • Problem: adjacent sets may “block” each other. • Solution: pick random T’iwith bias towards large sets.
Parameters • Allows arbitrary constant rate with m=poly(k) Settles Q1 • Both the subcube code and the subset code can be viewed as sub-codes of the binary RM code. • The full binary RM code cannot be batch decoded when the rate>1/2.
Concluding Remarks: Batch Codes • A common relaxation of very different combinatorial objects • Expanders • Locally-decodable codes • Problem makes sense even for small values of m,k. • For multiset codes with m=3,k=2, rate 2/3 is optimal. • Open for mk+2. • Useful building block for “distributed data structures”.
Non-adaptive Adaptive ? Single user Multiple users ? ? Concluding Remarks: PIR • Single-user amortization is useful in practice only if PIR is significantly more efficient than download. • Certainly true for multi-server PIR • Most likely true also for single-server PIR • Killer app for lattice-based cryptosystems?