480 likes | 1.05k Views
Fourier Analysis and Boolean Function Learning. Jeff Jackson Duquesne University www.mathcs.duq.edu/~jackson. Themes. Fourier analysis is central to learning theoretic results in wide variety of models
E N D
Fourier Analysis and Boolean Function Learning Jeff Jackson Duquesne University www.mathcs.duq.edu/~jackson
Themes • Fourier analysis is central to learning theoretic results in wide variety of models • Results generally are the strongest known for learning Boolean function classes with respect to uniform distribution • Work on learning problems has led to some new harmonic results • Spectral properties of Boolean function classes • Algorithms for approximating Boolean functions
Uniform Learning Model Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Uniform Random Examples< x, f(x) > Example OracleEX(f) Learning AlgorithmA Accuracy ε > 0
Circuit Classes • Constant-depth AND/OR circuits (AC0 without the polynomial-size restriction; call this CDC) • DNF: depth-2 circuit with OR at root Ù } Ú Ú Ú d levels Ù Ù Ù . . . . . . . . . . . . . . . v1 v2 v3 vn Negations allowed
Decision Trees v3 v2 v1 0 1 v4 0 1 0
Decision Trees v3 x3 = 0 v2 v1 0 1 v4 0 1 0 x = 11001
Decision Trees v3 v2 v1 x1 = 1 0 1 v4 0 1 0 x = 11001
Decision Trees v3 v2 v1 0 1 v4 0 1 0 x = 11001 f(x) = 1
Function Size • Each function representation has a natural size measure: • CDC, DNF: # of gates • DT: # of leaves • Size sF (f) of f with respect to class F is size of smallest representation of f within F • For all Boolean f, sCDC(f) ≤ sDNF(f) ≤ sDT(f)
Efficient Uniform Learning Model Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Time poly(n,sF ,1/ε) Target functionf : {0,1}n {0,1} Uniform Random Examples< x, f(x) > Example OracleEX(f) Learning AlgorithmA Accuracy ε > 0
Harmonic-Based Uniform Learning • [LMN]: constant-depth circuits are quasi-efficiently (n polylog(s/ε)-time) uniform learnable • [BT]: monotone Boolean functions are uniform learnable in time roughly 2√n logn • Monotone: For all x, i: f(x|xi=0) ≤ f(x|xi=1) • Also exponential in 1/ε (so assumes ε constant) • But independent of any size measure
Notation • Assume f: {0,1}n {-1,1} • For all a in {0,1}n, χa(x) ≡ (-1) a · x • For all a in {0,1}n, Fourier coefficient f(a) of f at a is: • Sometimes write, e.g., f({1}) for f(10…0) ^ ^ ^
Fourier Properties of Classes • [LMN]: f is a constant-depth circuit of depth d andS = { a : |a| < logd(s/ε) } ( |a| ≡ # of 1’s in a ) • [BT]:f is a monotone Boolean function andS = { a : |a| < √n / ε) }
Proof Techniques • [LMN]: Hastad’s Switching Lemma + harmonic analysis • [BT]: Based on [KKL] • Define AS(f) ≡ n · Prx,i[f(x|xi=0) ≠ f(x|xi=1)] • If S = {a : |a| < AS(f)/ε} then ΣaÏS f2(a) < ε • For monotone f, harmonic analysis + Cauchy-Schwartz shows AS(f) ≤ √n • Note: This is tight for MAJ ^
Function Approximation • For all Boolean f, • For S Í {0,1}n, define • [LMN]:
“The” Fourier Learning Algorithm • Given: ε (and perhaps s, d, ...) • Determine k such that for S = {a : |a| < k}, ΣaÏS f2(a) < ε • Draw sufficiently large sample of examples <x,f(x)> to closely estimate f(a) for all aÎS • Chernoff bounds: ~nk/ε sample size sufficient • Output h ≡ sign(ΣaÎS f(a) χa) • Run time ~ n2k/ε ^ ^ ~
Halfspaces • [KOS]: Halfspaces are efficiently uniform learnable (given ε is constant) • Halfspace: $wÎRn+1 s.t. f(x) = sign(w · (xº1)) • If S = {a : |a| < (21/ε)2 } then åaÏS f2(a) < ε • Apply LMN algorithm • Similar result applies for arbitrary function applied to constant number of halfspaces • Intersection of halfspaces key learning pblm ^
Halfspace Techniques • [O] (cf. [BKS], [BJTa]): • Noise sensitivity of f at γ is probability that corrupting each bit of x with probability γ changes f(x) • NSγ (f) ≡ ½(1-åa(1-2 γ)|a|f2(a)) • [KOS]: • If S = {a : |a| < 1/ γ} then åaÏS f2(a) < 3 NSγ (f) • If f is halfspace then NSγ(f) < 9√ γ ^ ^
Monotone DT • [OS]: Monotone functions are efficiently learnable given: • ε is constant • sDT(f) is used as the size measure • Techniques: • Harmonic analysis: for monotone f, AS(f) ≤ √log sDT(f) • [BT]: If S = {a : |a| < AS(f)/ε} then ΣaÏS f2(a) < ε • Friedgut: $ |T| ≤ 2AS(f)/ε s.t. ΣAËT f2(A) < ε ^ ^
Weak Approximators • KKL also show that if f is monotone,there is an i such that -f({i}) ≥ log2n/n • Therefore Pr[f(x) = -χ{i}(x)] ≥ ½ + log2n/2n • In general, h s.t. Pr[f = h] ≥ ½ + 1/poly(n,s) is called a weak approximator to f • If A outputs a weak approximator for every f in F , then F is weakly learnable ^
Uniform Learning Model Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Uniform Random Examples< x, f(x) > Example OracleEX(f) Learning AlgorithmA Accuracy ε > 0
Weak Uniform Learning Model Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ½ -1/p(n,s) Target functionf : {0,1}n {0,1} Uniform Random Examples< x, f(x) > Example OracleEX(f) Learning AlgorithmA
Efficient Weak Learning Algorithm for Monotone Boolean Functions • Draw set of ~n2 examples <x,f(x)> • For i = 1 to n • Estimatef({i}) • Outputh ≡ argmaxf({i})(-χ{i}) ^ ^
Weak Approximation for MAJ of Constant-Depth Circuits • Note that adding a single MAJ to a CDC destroys the LMN spectral property • [JKS]: MAJ of CDC’s is quasi-efficiently quasi-weak uniformlearnable • If f is a MAJ of CDC’s of depth d, and if the number of gates in f is s, then there is a set A Í {0,1}n such that • |A| < logd s ≡ k • Pr[f(x) = χA(x)] ≥ ½ +1/4snk
Weak Learning Algorithm • Compute k = logds • Draw ~snk examples <x,f(x)> • Repeat for |A| < k • Estimate f(A) • Until find A s.t. f(A) > 1/2snk • Outputh ≡ χA • Run time ~npolylog(s) ^ ^
Weak ApproximatorProof Techniques • “Discriminator Lemma” (HMPST) • Implies one of the CDC’s is a weak approximator to f • LMN spectral characterization of CDC • Harmonic analysis • Beigel result used to extend weak learning to CDC with polylog MAJ gates
Boosting • In many (not all) cases, uniform weak learning algorithms can be converted to uniform (strong) learning algorithms using a boosting technique ([S], [FS], …) • Need to learn weakly with respect to near-uniform distributions • For near-uniform distribution D, find weak hj s.t. Prx~D[hj = f] > ½ + 1/poly(n,s) • Final h typically MAJ of weak approximators
Strong Learning for MAJ of Constant-Depth Circuits • [JKS]: MAJ of CDC is quasi-efficiently uniform learnable • Show that for near-uniform distributions, some parity function is a weak approximator • Beigel result again extends to CDC with poly-log MAJ gates • [KP] + boosting: there are distributions for which no parity is a weak approximator
Uniform Learning from a Membership Oracle Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Membership OracleMEM(f) Learning AlgorithmA x f(x) Accuracy ε > 0
Uniform Membership Learning of Decision Trees • [KM] • L1(f) ≡ åa |f(a)| ≤ sDT(f) • If S = {a : |f(a)| ≥ ε/L1(f)} then ΣaÏS f2(a) < ε • [GL]: Algorithm (memberhip oracle) for finding {a : |f(a)| ≥ θ} in time ~n/θ6 • So can efficiently uniform membership learn DT • Output h same form as LMN:h ≡ sign(ΣaÎS f(a) χa) ^ ^ ^ ^ ^ ^ ~
Uniform Membership Learning of DNF • [J] • "(distributions D)$ χa s.t. Prx~D[f(x) = χa(x)] ≥ ½ + 1/6sDNF • Modified [GL] can efficiently locate such χa given oracle for near-uniform D • Boosters can provide such an oracle when uniform learning • Boosting provides strong learning • [BJTb], [KS], [F] • For near-uniform D, can find χa in time ~ns2
Uniform Learning from a Random Walk Oracle Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Random Walk Examples< x, f(x) > Random Walk Oracle RW(f) Learning AlgorithmA Accuracy ε > 0
Random Walk DNF Learning • [BMOS] • Noise sensitivity and related values can be accurately estimated using a random walk oracle • NSγ (f) ≡ ½(1-åa(1-2 γ)|a|f2(a)) • Tb(f) ≡ åa b|a|f2(a) • Estimate of Tb(f) is efficient if |b| logarithmic • Only need logarithmic |b| to learn DNF [BF] ^ ^
Random Walk Parity Learning • [JW] (unpub) • Effectively, [BMOS] limited to finding “heavy” Fourier coefficents f(a) for logarithmic |a| • Using a “breadth-first” variation of KM, can locate any |f(a)| > θ in time O(nlog 1/ θ) • “Heavy” coefficient corresponds to a parity function that weakly approximates ^ ^
Uniform Learning from a Classification Noise Oracle Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Classification Noise OracleEXη(f) Learning AlgorithmA Uniform random x Pr[<x, f(x)>]=1-η Pr[<x, -f(x)>]=η Accuracy ε > 0 Error rate η > 0
Uniform Learning from a Statistical Query Oracle Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Statistical Query OracleSQ(f) Learning AlgorithmA ( q(), τ ) EU[q(x, f(x))] ± τ Accuracy ε > 0
SQ and Classification Noise Learning • [K] • If F is uniform SQ learnable in time poly(n, sF ,1/ε, 1/τ) then F is uniform CN learnable in time poly(n, sF ,1/ε, 1/τ, 1/(1-2η)) • Empirically, almost always true that if F is efficiently uniform learnable then F is efficiently uniform SQ learnable(i.e., 1/τ poly in other parameters) • Exception: F = PARn ≡ {χa : aÎ {0,1}n, |a| ≤ n}
Uniform SQ Hardness for PAR • [BFJKMR] • Harmonic analysis shows that for any q, χa:EU[q(x,χa(x))] = q(0n+1) + q(aº 1) • Thus adversarial SQ response to (q,τ) is q(0n+1) whenever |q(aº 1)| < τ • Parseval: |q(bº 1)| < τ for all but 1/τ2 Fourier coefficients • So ‘bad’ query eliminates only poly coefficients • Even PARlog n not efficiently SQ learnable ^ ^ ^ ^ ^
Uniform Learning from an Attribute Noise Oracle Boolean Function Class F(e.g., DNF) Hypothesis h:{0,1}n {0,1} s.t. Prx~U [f(x) ≠ h(x) ] < ε Target functionf : {0,1}n {0,1} Attribute Noise OracleEXDN(f) Learning AlgorithmA Uniform random x <xÅr, f(x)>, r~DN Accuracy ε > 0 Noise model DN
Uniform Learning with Independent Attribute Noise • [BJTa]: • LMN algorithm produces estimates of f(a) · Er~DN[χa(r)] • Example application • Assume noise process DN is a product distribution: • DN(x) = ∏i (pixi + (1-pi)(1-xi)) • Assume pi < 1/polylog n, 1/ε at most quasi-poly(n) (mild restrictions) • Then modified LMN uniform learns attributenoisy AC0 in quasi-poly time ^
Agnostic Learning Model Arbitrary Boolean Function Hypothesis h in H s.t. Prx~U [f(x) ≠ h(x) ] <= optH + ε Target functionf : {0,1}n {0,1} Uniform Random Examples< x, f(x) > Example OracleEX(f) Learning AlgorithmA Accuracy ε > 0
Agnostic Learning of Halfspaces • [KKMS] • Agnostic learning algorithm for H the set of halfspaces • Algorithm is not Fourier-based (L1 regression) • However, a somewhat weaker result can be obtained by simple Fourier analysis
Near-Agnostic Learning via LMN • [KKMS]: • Let f be an arbitrary Boolean function • Fix any set S Í {1..n} and fix ε • Let g be any function s.t. • ΣaÏS g2(a) < ε and • Pr[f ≠ g] (call this η) is minimized for any such g • Then for h learned by LMN by estimating coefficients of f over S: • Pr[f ≠ h] < 4η + ε ^
Summary • Most uniform-learning results for Boolean function classes depend on harmonic analysis • Learning theory provides motivation for new harmonic observations • Even very “weak” harmonic results can be useful in learning-theory algorithms
Some Open Problems • Efficient uniform learning of monotone DNF • Best to date for small sDNF is [Ser], time ~nslog s (based on [BT], [M], [LMN]) • Non-uniform learning • Relatively easy to extend many results to product distributions, e.g. [FJS] extends [LMN] • Key issue in real-world applicability
Open Problems (cont’d) • Weaker dependence on ε • Several algorithms fully exponential (or worse) in 1/ε • Additional proper learning results • Allows for interpretation of learned hypothesis
References • Beigel: When Do Extra Majority Gates Help? ... • [BFJKMR] Blum, Furst, Jackson, Kearns, Mansour, Rudich. Weakly Learning DNF... • [BJTa] Bshouty, Jackson, Tamon. Uniform-Distribution Attribute Noise Learnability. • [BJTb] Bshouty, Jackson, Tamon. More Efficient PAC-learning of DNF... • [BKS] Benjamini, Kalai, Schramm. Noise Sensitivity of Boolean Functions... • [BMOS] Bshouty, Mossel, O’Donnell, Servedio. Learning DNF from Random Walks. • [BT] Bshouty, Tamon. On the Fourier Spectrum of Monotone Functions. • [F] Feldman. Attribute Efficient and Non-adaptive Learning of Parities... • [FJS] Furst, Jackson, Smith. Improved Learning of AC0 Functions. • [FS] Freund, Schapire. A Decision-theoretic Generalization of On-line Learning... • Friedgut: Boolean Functions with Low Average Sensitivity Depend on Few Coordinates. • [HMPST] Hajnal, Maass, Pudlak, Szegedy, Turan. Threshold Circuits of Bounded Depth. • [J] Jackson. An Efficient Membership-Query Algorithm for Learning DNF... • [JKS] Jackson, Klivans, Servedio. Learnability Beyond AC0. • [JW] Jackson, Wimmer. In prep. • [KKL] Kahn, Kalai, Linial. The Influence of Variables on Boolean Functions. • [KKMS] Kalai, Klivans, Mansour, Servedio. On Agnostic Boosting and Parity Learning. • [K] Kearns. Efficient Noise-tolerant learning from Statistical Queries. • [KM] Kushilevitz, Mansour. Learning Decision Trees using the Fourier Spectrum. • [KOS] Klivans, O’Donnell, Servedio. Learning Intersections and Thresholds of Halfspaces. • [KP] Krause, Pudlak. On Computing Boolean Functions by Sparse Real Polynomials. • [KS] Klivans, Servedio. Boosting and Hard-core Sets. • [LMN] Linial, Mansour, Nisan. Constant-depth Circuits, Fourier Transform, and Learnability. • [M] Mansour. An O(nloglog n) Learning Algorithm for DNF... • [O] O’Donnell. Hardness Amplification within NP. • [OS] O’Donnell, Servedio. Learning Monotone Functions from Random Examples in Polynomial Time. • [S] Schapire. The Strength of Weak Learnability. • [Ser] Servedio. On Learning Monotone DNF under Product Distributions.