340 likes | 350 Views
Conditional and Reference Class Linear Regression. Brendan Juba Washington University in St. Louis. Based on ITCS’17 paper and joint works with: Hainline , Le, and Woodruff; Calderon, Li, Li, and Ruan . AISTATS’19 arXiv:1806.02326. Outline. Introduction and motivation
E N D
Conditional and Reference Class Linear Regression Brendan Juba Washington University in St. Louis Based on ITCS’17 paper and joint works with:Hainline, Le, and Woodruff; Calderon, Li, Li, and Ruan.AISTATS’19 arXiv:1806.02326
Outline • Introduction and motivation • Overview of sparse regression via list-learning • General regression via list-learning • Recap and challenges for future work
How can we determine which data is useful and relevant for making data-driven inferences?
Conditional Linear Regression • Given: data drawn from a joint distribution over x∈{0,1}n, y∈Rd, z∈R • Find: • A k-DNF c(x)Recall: OR of “terms” of size ≤ kterms: ANDs of Boolean “literals” • Parameters w∈Rd • Such that on the conditionc(x), the linear rule<w,y>predictsz. z w y c(x) = •∨•
Reference Class Regression • Given: data drawn from a joint distribution over x∈{0,1}n, y∈Rd, z∈Rpoint of interest x* • Find: • A k-DNF c(x) • Parameters w∈Rd • Such that on the conditionc(x), the linear rule<w,y>predictszandc(x*)=1. z w x* y c(x) = •∨•∨•
Motivation • Rosenfeld et al. 2015: some sub-populations have strong risk factors for cancer that are insignificant in the full population • “Intersectionality” in social sciences: subpopulations may behave differently • Good experimentsisolate a set of conditions in which a desired effect has a simple model • It’s useful to find these “segments” or “conditions” in which simplemodelsfit • And, we don’t expect to be able to model all cases simply…
Results: algorithms for conditional linear regression We can solve this problem for k-DNF conditions on nBoolean attributes, regression on d real attributes when • w is sparse: (‖w‖0 ≤ s for constant s) for all lp norms, • loss εÕ(εnk/2) for general k-DNF • loss εÕ(εT log log n) for T-term k-DNF • For general coefficients w with σ-subgaussian residuals, maxterms t of c‖Cov(y|t) - Cov(y|c)‖op sufficiently smalll2 loss εÕ(εT (log log n + log σ)) for T-term k-DNF Technique: “list regression” (BBV’08, J’17, CSV’17)
Why only k-DNF? • Theorem(informal): algorithms to find w andc satisfying E[(<w,y>-z)2|c(x)] ≤ α(n)εwhen a conjunction c* exists enable PAC-learning of DNF (in the usual, “distribution-free” model). • Sketch: encode DNF’s “labels” for x as follows • Label 1 ⟹ z ≡ 0 (easy to predict with w = 0) • Label 0 ⟹ z has high variance (high prediction error) • Terms of the DNF are conjunctionsc* with easy z • c selecting x with easy to predict z gives weak learner for DNF
Why only approximate? • Same construction: algorithms for conditional linear regression solve agnostic learning task for c(x) on Boolean x • State-of-the-art for k-DNFs suggests: require poly(n)(n = # Boolean) blow-up of loss, α(n) • ZMJ‘17: achieve Õ(nk/2) blow-up for k-DNF for the corresponding agnostic learning task • JLM ‘18: achieve Õ(T log log n)blow-up for T-term k-DNF • Formal evidence of hardness? Open question!
Outline • Introduction and motivation • Overview of sparse regression via list-learning • General regression via list-learning • Recap and challenges for future work
Main technique– “list regression” • Given examples we can find a polynomial-size list of candidate linear predictorsincluding anapproximately optimal linear rulew’ on the unknown subset S = {(x(j),y(j),z(j)): c*(x(j))=1, j=1,…,m} for the unknown conditionc*(x). • We learn a conditionc for each w, take a pair (w’,c’) for which c’ satisfies μ-fraction of data
Sparse max-norm “list regression” (J.’17) • Fix a set of s coordinates i1,…,is. • For the unknown subset S, the optimal linear rule using i1,…,is is the solution to the following linear program minimize ε subject to -ε ≤ wi1xi1(j) + … + wisxis(j) - z(j) ≤ εfor j∈S. • s+1 dimensions – optimum attained at basic feasible solution given by s+1 tight constraints
Sparse max-norm “list regression” (J.‘17) • The optimal linear rule using i1,…,is is given by the solution to the systemwi1xi1(jr) + … + wisxis(jr) - z(jr) = σrεfor r = 1,…,s+1for some j1,…,js+1∈S, σ1,…,σs+1∈{-1,+1}. • Enumerate (i1,…,is,j1,…,js+1,σ1,…,σs+1) in[d]s×[m]s+1×{-1,+1}s+1, solve for each w • Includes all (j1,…,js+1)∈Ss+1 (regardless of S!) • List has size dsms+12s+1 = poly(d,m) for constant s.
Summary – algorithm for max-norm conditional linear regression (J.’17) For each (i1,…,is,j1,…,js+1,σ1,…,σs+1) in[d]s×[m]s+1×{-1,+1}s+1, • Solve wi1xi1(jr) + … + wisxis(jr) - z(jr) = σrεfor r = 1,…,s+1 • Ifε > ε*(given), continue to next iteration. • Initialize c to k-DNF over all terms of size k • For j=1,…,m, if |<w,y(j)> - z(j)| > ε • For each term T in c, if T(x(j))=1 • Remove T from c • If #{j:c(x(j))=1} > μ’m, (μ’ initially 0) • Put μ’ = #{j:c(x(j))=1}/m, w’=w, c’=c Learn condition c using “labels” provided by w Choose condition c that includes the most data
Extension to lp-norm list-regressionJoint work with Hainline, Le, Woodruff (AISTATS’19) • Consider the matrix S = [y(j),z(j)]j=1,…,m:c*(x(j))=1 • lp-norm of S(w,-1) approximates lp-loss of w on c • Since w*∈Rd is s-sparse, there exists a small “sketch” matrix S’ such that ‖Sv‖p≈‖S’v‖pfor all vectors v on these s coordinates • (Cohen-Peng ’15): moreover, rows of S’ can be O(1) rescaled rows of S • New algorithm: search approximate weights, minimize lp-loss to find candidates for w
lp-norm conditional linear regressionJoint work with Hainline, Le, Woodruff (AISTATS’19) • Using the polynomial-size list containing an approximation to w*, we still need to extract a conditionc such that E[|<w,y>-z|P|c(x)] ≤ α(n)ε • Use |<w,y(i)>-z(i)|P as weight/label for ith point • Easy Õ(T log log n) approximation for T-term k-DNF: • only consider terms t with E[|<w,y>-z|P|t(x)]Pr[t(x)] ≤ εPr[c*(x)] • Greedy algorithm for partial set cover: (terms = sets of points) covering (1-γ)Pr[c*(x)]-fraction • Obtains T log m-size cover – small k-DNF c’ • Haussler’88: to estimate T-term k-DNFs, only require m =O(Tk log n) points ⇒ Õ(εT log log n) loss on c’ • Reference class: add any surviving term satisfied by x*
lp-norm conditional linear regressionJoint work with Hainline, Le, Woodruff (AISTATS’19) • Using the polynomial-size list containing an approximation to w*, we still need to extract a conditionc such that E[|<w,y>-z|P|c(x)] ≤ α(n)ε • Use |<w,y(i)>-z(i)|P as weight/label for ith point • ZMJ’17/Peleg’07: More sophisticated algorithm achieving Õ(nk/2) approximation for general k-DNF(plug in directly to obtain conditional linear regression) • J.,Li’19: can obtain same guarantee for reference class
l2-norm conditional linear regression vs. selective linear regression LIBSVM benchmarks. Red: conditional regression; Black: selective regression Boston (m=506, d=13) Bodyfat (m=252, d=14) Space_GA (m=3107, d=6) Cpusmall (m=8192, d=12)
Outline • Introduction and motivation • Overview of sparse regression via list-learning • General regression via list-learningJoint work with Calderon, Li, Li, and Ruan (arXiv). • Recap and challenges for future work
CSV’17-style List Learning • Basic algorithm: Soft relaxation of fitting unknown S ↕︎ (alternating) Outlier detection and reduction • Improve accuracy by clustering output w and “recentering” • We reformulate CSV’17 in terms of terms(rather than individual points)
Relaxation of fitting unknown c • For fixed weights (“inlier” indicators) u(1),…,u(T) ∈[0,1] • Each termt has its own parameters w(t) • Solve: minw,Y ∑tu(t)|t|lt(w(t)) + λtr(Y) (Y: enclosing ellipsoid; λ: carefully chosen constant) w(1) Enclosing ellipsoid Y w(3) w(2) ⋱ w(outlier) w(outlier)
Outlier detection and reduction • Fix parameters w(1),…,w(T) ∈Rd • Give each termt its own formula indicators ct’(t) • Must find “coalition” c(t) of ≥μ’-fraction (|c(t)|=∑t’∈c(t)|t’|, μ’ = 1/m∑t∈c|t|)such that w(t) ≈ ŵ(t) = 1/|c(t)|∑t’∈c(t)ct’(t)|t’|w(t’) • Reduce inlier indicator u(t) by 1-|lt(w(t)) - lt(ŵ(t))|/maxt’|lt’(w(t’)) – lt’(ŵ(t’))|-factor w(1) • w(3) ŵ(1) • w(2) • • ⋱ Intuition: points fit by parameters in small ellipsoid have a good coalition for which objective value changes little (for smooth loss l). Outliers cannot find a good coalition and are damped/removed. ŵ(outlier) • w(outlier) •
Clustering and recentering Full algorithm: (initially, one data cluster) Basic algorithm on cluster centers ↕︎ (alternating) Cluster outputs Next iteration: run basic algorithm with parameters of radius R/2 centered on each cluster w(1) w(3) w(2) ⋱ w(j) w(k) w(l)
Overview of analysis (1/3)Loss bounds from basic algorithm • Follows same essential outline as CSV’17 • Guarantee for basic algorithm:we find ŵsuch that given ‖w*‖≤ R,E[l(ŵ)|c*]-E[l(w*)|c*]≤ O(R maxw,t∈c*‖[∇E[lt(w)|t]-∇E[l(w)|c*]‖/√μ) • Where ‖[∇E[lt(w)|t]-∇E[l(w)|c*]‖ ⟶‖(w-w*)(Cov(y|t) - Cov(y|c*))‖ • The bound is O(R2maxt∈c*‖Cov(y|t) - Cov(y|c*)‖/√μ) for all R ≥ 1/poly(γμm/σT)(errors σ-subgaussian on c*)
Overview of analysis (2/3) From loss to accuracy via convexity • We can find: ŵsuch that given ‖w*‖≤ R,E[l(ŵ)|c*]-E[l(w*)|c*]≤O(R2‖Cov(y|t) - Cov(y|c*)‖/√μ) • Implies that for all significant t in c*, ‖w(t)-w*‖2≤O(R2 T maxt∈c*‖Cov(y|t) - Cov(y|c*)‖/κ√μ) where κis the convexity parameter of l(w) • Iteratively reduce R with clustering and recentering…
Overview of analysis (3/3) Reduce R by clustering & recentering • In each iteration, for all significant t in c*, ‖w(t)-w*‖2≤O(R2 T maxt∈c*‖Cov(y|t) - Cov(y|c*)‖/κ√μ) where κis the convexity parameter of l(w) • Padded decompositions (FRT’04): can find a list of clusteringssuch that one cluster in each contains w(t) for all significant tin c* with high probability • Moreover, if κ≥ Ω(Tlog(1/μ)maxt∈c*‖Cov(y|t)-Cov(y|c*)‖/√μ) then we obtain a new cluster center ŵ in our list such that ‖ŵ-w*‖≤ R/2with high probability • So: we can iterate, reducing R→1/poly(γμm/σT) • m large ⟹ ‖ŵ(t)-w*‖⟶0
Finishing up: obtaining a k-DNF from ŵ • Cast as weighted partial set cover instance (terms = sets of points) using ∑i:t(x(i))=1li(w) as weight of termt and the ratio objective (cost/size) • ZMJ’17: with ratio objective, still obtain O(log μm) approximation • Recall: we chose ŵto optimize ∑t∈c∑i:t(x(i))=1li(w) (=cost of cover c in set cover) – adds only T-factor • Recall: we only consider terms satisfied with probability ≥ γμ/T– use at most T/γterms • Haussler’88, again: only need ∼O(Tk/γlog n) points
Summary: guarantee for general conditional linear regression Theorem. Suppose that D is a distribution over x∈{0,1}n, y∈Rd, and z∈Rsuch that for some T-termk-DNF c*and w*∈Rd, <w*,y>-z is σ-subgaussian on D|c*withE[(<w*,y>-z)2|c*(x)] ≤ ε, Pr[c*(x)] ≥ μ, and E[(<w,y>-z)2|c*(x)] is κ-strongly convex in w with κ≥ Ω(Tlog(1/μ)maxt∈c*‖Cov(y|t)-Cov(y|c*)‖/√μ).There is a polynomial-time algorithm that uses examples from D to find wand c s.t.w.p. 1-δ,E[(<w,y>-z)2|c(x)] ≤ Õ(εT (log log n + log σ)) and Pr[c(x)] ≥ (1-γ)μ.
Comparison to sparse conditional linear regression on benchmarks Space_GA (m=3107, d=6) Cpusmall (m=8192, d=12) Small benchmarks (Bodyfat, m=252, d=14; Boston, m=506, d=13): does not converge https://github.com/wumming/lud
Outline • Introduction and motivation • Overview of sparse regression via list-learning • General regression via list-learning • Recap and challenges for future work
Summary: new algorithms for conditional linear regression We can solve conditional linear regression for k-DNF conditions on nBoolean attributes, regression on d real attributes when • w is sparse: (‖w‖0 ≤ s for constant s) for all lp norms, • loss εÕ(εnk/2) for general k-DNF • loss εÕ(εT log log n) for T-term k-DNF • For general coefficients w with σ-subgaussian residuals, maxterms t of c‖Cov(y|t) - Cov(y|c)‖op sufficiently smalll2 loss εÕ(εT (log log n + log σ)) for T-term k-DNF Technique: “list regression” (BBV’08, J’17, CSV’17)
Open problems • Remove covariance requirement! • Improve large-formula error bounds to O(nk/2ε) for general (dense) regression • Algorithms without semidefinite programming? Without padded decompositions? • Note: algorithms for sparse regression alreadysatisfy 1—3. • Formal evidence for hardness of polynomial-factor approximations for agnostic learning? • Conditional supervised learning for other hypothesis classes?
References • Hainline, Juba, Le, Woodruff. Conditional sparse lp-norm regression with optimal probability. In AISTATS, PMLR 89:1042-1050, 2019. • Calderon, Juba, Li, Li, Ruan. Conditional Linear Regression. arXiv:1806.02326 [cs.LG], 2018. • Juba. Conditional sparse linear regression. In ITCS, 2017. • Rosenfeld, Graham, Hamoudi, Butawan, Eneh, Kahn, Miah, Niranjan, Lovat. MIAT: A novel attribute selection approach to better predict upper gastrointestinal cancer. In DSAA, pp.1—7, 2015. • Balcan, Blum, Vempala. A discriminative framework for clustering via similarity functions. In STOC pp.671—680, 2008. • Charikar, Steinhardt, Valiant. Learning from untrusted data. In STOC, pp.47—60, 2017. • Fakcharoenphol, Rao, Talwar. A tight bound on approximating arbitrary metrics by tree metrics. JCSS 69(3):485-497, 2004. • Zhang, Mathew, Juba. An improved algorithm for learning to perform exception-tolerant abduction. In AAAI, pp.1257—1265, 2017. • Juba, Li, Miller. Learning abduction under partial observability. In AAAI, pp.1888—1896, 2018. • Cohen, Peng. lp row sampling by Lewis weights. In STOC, pp.47—60, 2017. • Haussler. Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artificial Intelligence, 36:177—221, 1988. • Peleg. Approximation algorithms for the label covermax and red-blue set cover problems. J. Discrete Algorithms, 5:55—64, 2007.