400 likes | 427 Views
Noise Tolerant Learning. Presented by Aviad Maizels. Based on: “Noise-tolerant learning, the parity problem, and the statistical query model” Avrim Blum, Adam Kalai and Hal Wasserman “A Generalized Birthday problem” David Wagner
E N D
Noise Tolerant Learning Presented by Aviad Maizels Based on: “Noise-tolerant learning, the parity problem, and the statistical query model” \Avrim Blum, Adam Kalai and Hal Wasserman “A Generalized Birthday problem” \ David Wagner “Hard-core predicates for any one way function” \ Goldreich O. and L.A.Levin “Simulated annealing and Boltzmann machines” \ Emile Aarts and Jan Korst
void Agenda() { do { • A few sentences about Codes • The opposite problem • Learning with noise • The k-sum problem • Can we do it faster ?? • Annealing } while (!understandable); }
1-p 0 0 p p 1-p 1 1 void fast_introduction_to_LECC() { The communication channel may disrupt the original data: Proposed solution: encode messages to give some protection against errors.
void fast_introduction_to_LECC()(Continue – terminology) Linear Codes: • Fixed sized block code • Additive closure Code is tagged using two parameters (n,k): • k – data size • n – encoded word size Source Encoder Channel msg=u1u2…uk codeword=x1x2…xn noise
k n-k data redundancy void fast_introduction_to_LECC()(Continue – terminology) • Systematic code – original data appears directly inside the codeword. • Generating matrix (G) - a matrix s.t. multiplying it with a message will output the encoded word. • Num of rows == space dimension (k) • Every codeword can be represented as a linear combination of G’s rows.
111 011 111 011 101 001 101 001 010 110 010 110 000 100 000 100 void fast_introduction_to_LECC()(Continue – terminology) • Hamming distance – the number of places two vectors differ in • Denoted by dist(x,y) • Hamming weight – the number of places that differ from zero in a vector • Denoted by wt(x) • Minimum distance of linear code – minimum weight of any non-zero vector
void fast_introduction_to_LECC()(Continue – terminology) • Perfect code (t)- Every vector has hamming distance <=t from a unique codeword Channel Decoder Target received word=x + e msg’ ?? error vector=e1e2…en
void fast_introduction_to_LECC()(Continue – terminology) • Complete Decoding – The acceptance groups around the codewords together contains all the vectors of length n ... }
void the_opposite_problem() { • Decoding linear (n,k) codes in the presence of random noise when k > O(logn) in poly(n)-time. • k = O(logn) is trivial | in !(coding-theory) terms: • Given a finite set of code words (examples) of length n, their labels and a codeword , find\learn the label of , in the presence of random noise, in poly(n) time.
void the_opposite_problem()(Continue – Main idea) Without noise: • Any vector can be written as a linear combination of previously seen examples. • Deducing the vector’s label can be done in the same way. So… All we need is to find a basis to deduce any label of a new example. Qs: Is it the same with the presence of noise ??
void the_opposite_problem()(Continue – Main idea) Well… No. Summing examples actually boosts the noise: Given s examples and a noise rate of η < ½, the sum of s examples has a noise rate of ½ + ½(1-2η)s write basis vectors as a sum of small number of examples and the new sample as a linear combination of the above. }
void learning_with_noise() { • Concept – boolean function over the input space • Concept class – set of concepts • World model: • There is a fixed noise rate η<1/2, • Fixed probability distribution D over the input space • The alg. may ask for labeled example (x,l). • &… an unknown concept c.
k bits 1010111… x c void learning_with_noise() { • Goal: Find an e-approximation of c • a function h s.t. Prx←D[h(x) = c(x)] ≥ 1-e • Parity function: defined by a corresponding vector v{0,1}n. The function is then given by the rule
void learning_with_noise()(Continue – Preliminaries) • Efficiently learnable: Concept class C is E.L. in the presence of random classification noise under distribution D if: • alg A s.t. e>0, δ>0, η>0 and concept cC • A produces an e-approximation of c with probability at least 1- δ when given access to D-random examples. • A must run in time polynomial in n,1/e,1/ δ and in 1/(1/2- η).
void learning_with_noise()(Continue – Goal) • We’ll show that: The length-k parity problem for noise rate η<1/2, can be solved with computation time and total size of examples of 2O(k/logk). Observe the behavior of the noise when we’re adding up examples:
void learning_with_noise()(Continue – Noise behavior) • pi + qi = 1 • Denote: si = pi-qi = 2pi–1 = 1–2qi si[-1,1] p3 = p1q2+p2q1 ; q3 = p1p2+ q1q2 s3 = p3–q3 = s1s2 p1= appearing frequency of noisy bit. q1= appearing frequency of correct bit. 1010111… 1111011… p2= appearing frequency of noisy bit. q2= appearing frequency of correct bit.
void learning_with_noise()(Continue – Idea) Main idea: Draw much more examples than needed to find basis vectors as a sum of relatively small number of examples. • If η<1/2 the sum of (logn) labels will be polynomially indistinguishable from random • We can repeat the process to boost reliability
… 1010111… b bits 1 2 a void learning_with_noise()(Continue – Definitions) A few more definitions: • k = a*b • Vi - subspace of {0,1}ab consisting of vectors whose last i blocks are zeroed • i-sample – set of independent vectors that are uniformly distributed over Vi
void learning_with_noise()(Continue – Main construction) Construction: Given i-sample of size s, we construct (i+1)-sample of size at least s-2b in time O(s) Behold: • i-sample={x1,…,xs}. • Partition the x’s based on the (a-i) block (we’ll get max 2b partitions). • For each non-empty partition, pick a random vector, add it to the other vectors on his partition and then discard the vector. Result: z1,…,zm vectors, m≥s-2b where: • The block (a-i-1) is zeroed out • zj are independent uniformly distributed over Vi+1
void learning_with_noise()(Continue – Algorithm) Algorithm (Finding the 1st bit): • Ask for a2b labeled examples • Apply construction (a-1) times to get (a-1)-sample • There is 1-1/e chance that the vector (1,0,…,0) will be a member of the (a-1)- sample. If it’s not there, we’ll do it again with new labeled examples (expected number of repetitions is constant) Note: we’ve written (1,0,…,0) as a sum of 2(a-1) examples, causing the noise rate to boost to
void learning_with_noise()(Continue – Observations) Observations: • We found the first bit of our new sample using the number of examples and computation time in poly • We can shift all examples to determine the remainder bits • Fixing a=(1/2)logk and b=2k/logk will give the desired for a constant noise rate η. }
void the_k_sum_problem() { The key to improve the above alg is to find a better way to solve a problem similar to “k-sum”. Problem: Given k lists L1,…,Lk of elements, drawn uniformly and independently from {0,1}n, find x1L1,…,xkLk s.t. Note: a solution to the “k-sum” problem exists with good probability if |L1|*|L2|*…*|Lk| >> 2n (Similar to birthday paradox)
void the_k_sum_problem()(Continue – Wagner’s Algorithm - Definitions) Preliminary definitions and observations: • Lowl(x) – the l LS bits of x • L1 xl L2 – contains all pairs from L1 x L2 that agree on the l LS bits. • If lowl(x1x2)=0 and lowl(x3x4)=0 then lowl(x1x2x3x4)=0 and Pr[x1x2x3x4=0]=2l/2n • Join (xl) operation: • Hash join: stores one list and scans through the other • (|L1| + |L2|) steps, O(|L1|+|L2|) storage • Merge join: sorts & scans the two sorted lists • O(max(|L1|,|L2|)log(max(|L1|,|L2|))) time
{(x1,…,x4): x1…x4=0} xl L1 xl L2 L3 xl L4 xl xl L1 L2 L3 L4 void the_k_sum_problem()(Continue – Wagner’s Algorithm – Simple case) The 4 lists case: • Extends lists until they each contains 2l elements • Generate a new list L12 of values x1x2 s.t. lowl(x1x2)=0 and a new list L34 in the same way • Search for matches between L12 and L34
void the_k_sum_problem()(Continue – Wagner’s Algorithm) Observation: • Pr[lowl(xixj)=0]=1/2l when 1ij 4 and xi,xj are chosen uniformly at random • E[|Lij|]=(|Li|*|Lj|)/2l=22l/2l=2l • The expected number of elements common between L12 and L34 that will yield the desired solutions is |L12|*|L34|/2n-l (ln/3 will give us at least 1) Complexity: • O(2n/3) time and space
void the_k_sum_problem()(Continue – Wagner’s Algorithm) Improvisations: • We don’t need low l bits to be zero. We can fix them to any α (i.e. ) • The value 0 in x1… xk=0 can be replaced with a constant c of our choice (by replacing Lk with Lk’=Lkc) • If k>k’ the complexity of the “k-sum” problem can be no larger than the complexity of the “k’-sum” problem (just pick arbitrary xk’+1,…,xk, define c=xk’+1… xk and use “k’-sum” alg to find a solution for x1… xk’=c) we can solve “k-sum” problem with complexity at most O(2n/3) for all k4
void the_k_sum_problem()(Continue – Wagner’s Algorithm) Extending the 4 lists case: • Create complete binary tree of depth logk. • At depth h we’ll use So we’ll get an algorithm that requires time and space Note: if k is not a power of 2 we’ll take k’ to be - the largest power of 2 less than k, using afterwards the list elimination trick }
void can_we_do_it_better_?() { But… Maybe there’s a problem with the approach ? • How many samples do we really need to get a solution with good probability ? • Do we even need a basis ? • Can we do it without scanning the whole space ? • Do we need the best solution ? • Yes • Yes • K+logk-log(-ln(1-e)) • Yes & no… • Yes • no
void can_we_do_it_better_?()(Continue – Sampling space) To have a solution we need k linearly independent vectors in our sampling space S. So… We’ll want: where e[0,1] |sampling space|=O(k+logk+f(e)) }
void annealing() { Physical process of heating up solid until it melts, followed by cooling it down into a state of perfect lattice. Problem’: finding, among potentially very large number of solutions, a solution with minimal cost. • Note: We don’t even need the minimal cost solution - just one who has a noise rate below our threshold
void annealing()(Continue – Combinatorial optimization) Some definitions: • The set of solutions to the combinatorial problem is taken as the set of states S’ • Note: In our case: • The price function is the energy E:S’ R that we minimize • The transition probability between neighboring states depends on their energy difference and an external temperature T
void annealing()(Continue – Pseudo code algorithm) • Set T to a high temperature • Choose an arbitrary initial state c • Loop: • Select a neighbor c’ of c; set ΔE = E(c')-E(c) • If ΔE < 0 then move to c', else move to c' with probability exp(-ΔE/T). • Do the 2 steps above several more times • Decrease T • Wait long enough and cross fingers…(preferably more than 2)
void annealing()(Continue – Problems) Problems: • Not all states can yield our new sample (only the ones containing at least one vector from S\basis). • The probability that a “capable” state will yield the zero vector is 1/2k • The probability that any 1jk vectors from S will yield a solution is • Note: When |S|k the phrase above approaches zero
void annealing()(Continue – Reduction) Idea: • Sample a little more than is needed: |S|=O(c*k) • Assign each vector its hamming weight and sort S by it. Reduction: • Spawning the next generation: all the states which includes a vector who has a hamming weight 2*wt(l)
void annealing()(Continue – Convergence & Complexity ??) Complexity: Where L denotes the number of steps to reach quasi-equilibrium in each phase and denotes the computation time of a transition • ln(|S’|) denotes the number of phases to reach an accepted solution, using polynomial-time cooling schedule
Game Over “I don’t even see the code anymore… all I can see now are blondes, brunettes, redheads…” - Cipher (“The matrix”)
void appendix()([GL]) Theorem: Suppose we have oracle access to random process bx:{0,1}n{0,1}, so that where the probability is taken uniformly over internal coin tosses of bx and all possible choices of r, and b(x,r) denote the inner-product mod 2 of x and r. Then, We can in time polynomial in n/ output a list of string that contains x with probability at least ½.
void appendix()(Continue – [GL] – highway) How ?? 1 way (to extract xi): Suppose s(x)=Pr[bx(r)=b(x,r)]3/4+ (hmmm??) The probability that both bx(r)=b(x,r) and bx(rei)=b(x,rei) will hold is at least but…
void appendix()(Continue – [GL] – better way) 2nd way: Idea: Guess b(x,r) by ourselves. Problem: Need to guess polynomially many r’s. Solution: Generate polynomially many r’s so that they are “sufficiently” random but still we can guess them with non-negligible probability.
void appendix()(Continue – [GL] – better way) Construction: • Select uniformly strings in {0,1}n and denote them by s1,…,sl. • Guess The probability that all guesses are correct is • assign each rj to different subsets of {1,..,l} s.t. • Note that: • Try all possibilities for 1,…,l and output a list of 2l candidates for zi{0,1}n