360 likes | 374 Views
Polynomial-time probabilistic reasoning with partial observations via implicit learning in probability logics. Brendan Juba Washington University in St. Louis I n AAAI’19 Supported by NSF Award CCF-1718380. Probabilistic inference.
E N D
Polynomial-time probabilistic reasoning with partial observations via implicit learning in probability logics Brendan Juba Washington University in St. Louis In AAAI’19 Supported by NSF Award CCF-1718380
Probabilistic inference • Givenconstraints on real random variables, for example: • Support constraints: cancer∈{0,1}, lung_cancer∈{0,1}, skin_cancer∈{0,1}, elevatedX∈{0,1}, Xlevel≥ 0 • Rules: lung_cancer ⇒ cancer, skin_cancer ⇒ cancer, etc. • Definitions: elevatedX ⟺ (Xlevel ≥ 10)(actual: elevatedX(Xlevel-10) ≥ 0, (1-elevatedX)(10-Xlevel) ≥ 0) • Bounds on expectation, for example: Pr[elevatedX] ≥ 10% • Decidequeries, e.g., can E[Xlevel] ≤ 0.999? • (No: E[Xlevel] ≥ 10 Pr[elevatedX] = 1.) • In general: add a bound or constraint, decide consistency • Conditional expectation queries using rewritten bounds, e.g., Pr[cancer|elevatedX] ≥ p = E[cancer elevatedX]≥ p E[elevatedX]= E[cancer elevatedX – p elevatedX] ≥ 0
…using partial examples • For example, given a data set of 1000 patients • Some attributes unspecified – say, observed lung cancer in patient 2, don’t know about patent 3 Write: lung_cancer2 = 1, lung_cancer3= * (missing) • Assuming patients drawn i.i.d. from a population with distribution D, answer queriesabout D • Pr[cancer|elevatedX] > ? • Can we learn useful bounds and constraints?
This work: For the “sum-of-squares” probability logic… • Tractable syntactic cases (“fragments”) simulate the most powerful known tractable fragments of resolution • Captures forward-chaining and more… • Bounds and constraints can be efficiently learned implicitly from partially observed examples
Our approach vs. standard approaches • This work: Distribution given only by • Bounds & constraints • Partial examples • Inference without generating full representation • Cf. building explicit models of distributions • Graphical models (Markov Logic, Probabilistic Soft Logic) and sum-product networks; Bayesian/Probabilistic Logic Programs; regression trees, … • Computationally challenging, especially “structure learning” under partial information
Strengths & weaknesses • Polynomial-time inference from partial observations capturing rich inference rules • Easy to compute marginals of small # of attributes: no problem “summing out” other attributes • But doesn’t capture joint probability of large # of attributes, conditioning on large #s of attributes,… • Unconstrained distributions: no requirement of independence of attributes, masking at random… • Unable to assert knowledge of independence • Not a generative model ⟹ no counterfactuals
Outline • Probability logics:from Nilsson’s Linear Programming-based logic to Sum-of-squares • Sum-of-squares is powerful: simulating fragments of resolution and more • Implicit learning of bounds and constraints from partial examples in sum-of-squares
Nilsson’s probability logic based on linear programming (‘86) (Nilsson ‘86)
Issues using Nilsson’s probability logic, Halpern-Pucella’s logic of expectation • Exponentially many propositional formulas • Exponential LP for linear combination inferences • Axioms include: propositional tautologies, etc. – checking applications NP-hard • Define a tractable fragment? • Still may include exponentially many formulas • for example, if we can represent all clauses • Unclear how to perform inference
The sum-of-squares logic (essentiallyGrigoriev & Vorobjov‘01) • Language: real polynomial (in)equalities on vector of n indeterminatesx, p(x) ≥ 0 • sum-of-squares polynomial:σ(x) = ∑lpl(x)2 • A refutation system: given • Constraints{gi(x) ≥ 0}, {hj(x) = 0} (polynomials gi(x),hj(x)) • Bounds{bk(x)} (E[bk(x)] ≥ 0 for polynomials bk(x)) • if we can write a formal expression (with real ck≥0)σ0(x) + ∑iσi(x)gi(x) + ∑jpj(x) hj(x) + ∑kckbk(x) = -1then no expectation operator E[∙] is consistent:E[σ0(x)] ≥ 0, E[σi(x) gi(x)] ≥ 0, E[pj(x) hj(x)] = 0, and E[ckbk(x)] ≥ 0. This is a “sum-of-squares refutation.”
Conventions and notation • We will assume every variable is Boolean or explicitly bounded in the constraints, extend to monomials • For every indeterminate x, include constraintsx2-x=0 (x∈{0,1}, aka “Boolean axiom”) or boundedB2-x2≥ 0 for some real B (x∈[-B,B]) • If, for constraints{gi(x)≥0}, {hj(x)=0} we can write g’(x) = σ0(x) + ∑iσi(x) gi(x) + ∑jpj(x) hj(x), we say “{gi(x)≥0}, {hj(x)=0}⊦ g’(x)≥ 0” • Can use σ’(x) g’(x) in a refutation (expands out to σ’(x) σ0(x) + ∑iσ’(x) σi(x) gi(x) + ∑jσ’(x) pj(x) hj(x)…) • Examples: x2-x=0 ⊦ 1-x ≥ 0 (1-x = (1-x)2 +(-1)(x2-x))B2-x2≥ 0 ⊦ x+B ≥ 0 (x+B = 1/2B(B2-x2)+1/2B(B+x)2)B2-x2≥ 0, C2-y2≥ 0 ⊦ BC-xy≥ 0 (etc., for every monomial)(BC-xy= C/2B(B2-x2)+B/2C(C2-y2)+BC/2 (1/Bx-1/Cy)2)(simply assume monomial bounds are given…)
Polynomial-time fragments of sum-of-squares • Degree-d fragment: every polynomialσ0(x), σi(x)gi(x), pj(x)hj(x), bk(x) in the refutationσ0(x) + ∑iσi(x) gi(x) + ∑jpj(x) hj(x) + ∑kckbk(x)has total degree at most d (Analogously, “⊦d”) • Theorem (Shor, Nesterov, Parrilo, Lasserre): there is a degree-d refutation of {gi(x) ≥ 0}, {hj(x) = 0}, {E[bk(x)] ≥ 0} iff a nO(d)-size semidefinite program (determined by the system) is infeasible. • Semidefinite programming: linear programming extended by “A≽0” constraints (x⊤Ax ≥ 0 for all x), for matrices A of linear forms over variables • Decidable in polynomial time in system size for fixed d
Key question: How expressive are these polynomial-time fragments?
Outline • Probability logics: from Nilsson’s Linear Programming-based logic to Sum-of-squares • Sum-of-squares is powerful:simulating fragments of resolution and more • Implicit learning of bounds and constraints from partial examples in sum-of-squares
Encoding Boolean knowledge • Two ways of writing clauses: for Boolean x & y • Monomial equalities: x∨¬y ⟺ (1-x)y = 0 • More generally: Boolean formulas over {∧,¬} basis • Downside: width (max # vars in clauses) = degree • Linear inequalities:x∨¬y⟺ x + (1-y) -1 ≥ 0 • Degree = 1, independent of width
Propositional proof systems (recall)Resolution • Language: Clauses(ORs of literals: variables or negations) • Inference rule: cut – from x∨C, ¬x∨D, infer C∨D (where C,D are clauses) • Refute a system by deriving empty clause • “Treelike:” derivation does not re-use formulas • “Space-s:” derivation can be carried out in a memory that can hold at most s clauses at a time
Propositional proof systems Polynomial calculus • Language: polynomial equalities (p(x) = 0, q(x) = 0, …) • Inference rules • Linear combination (infer: a p(x)+b q(x) = 0 for real a,b) • Multiplication by indeterminate (infer: y p(x) = 0 for indeterminate/variable y) • Refute system by deriving 1 = 0 • With “twin variables” ¬x = 1-x for each xand Boolean axioms x2-x = 0 for each x, simulates width-w resolution in degree w.
Simulations “A simulates B” if, whenever there is a proof in system B, there is also a proof in system A. • Theorem (Berkholz ’18): Degree-2d sum-of-squaressimulates degree-d polynomial calculus, given Boolean axioms for all variables • Theorem (new): Degree-s sum-of-squares simulates space-s treelike resolution using the linear inequality encoding and Boolean axioms. • Note: space-2 treelike resolution simulates forward chaining (in Horn KBs)
Sketch of simulation • Theorem (Ansótegui, Bonet, Levy, Manyá ’08): Space-(s+1) treelike resolution is equivalent to • unit propagation (from clausex, infer x=1) with • up to s-1 nested applications of the “failed literal” rule: guess x=1; if a refutation exists, infer x=0. • Unit propagation is simulated in degree-1 • Unit clauses are x - 1 ≥ 0, (1-x) - 1 ≥ 0 (i.e., -x ≥ 0) • Example: clauses¬x, y∨xencoded as (1-x) -1 ≥ 0, y+x-1 ≥ 0; can derive y-1 ≥ 0((1-x)-1 + y+x-1 = y-1)
Sketch of simulation • Simulating the failed literal rule: • If there is no degree-d refutation in sum-of-squares, and the solution to the corresponding semidefinteprogram satisfies E[x] ≠ 0, then there is a conditioning operation (Karlin, Mathieu, Nguyen ‘11) that produces a solution to the degree-(d-1) semidefinite program that satisfies E[x] = 1(so there is no degree-(d-1) refutation when E[x] = 1). • Contrapositive: if there is a degree-(d-1) refutation given E[x] = 1, either there is a refutation of degree-d or else the degree-d solution satisfies E[x] = 0.
Probabilistic inference example “Markov’s inequality” inference (simple) • Given constraints: elevatedX2-elevatedX = 0, Xlevel≥ 0, elevatedX(Xlevel-10) ≥ 0, (1-elevatedX)(10-Xlevel) ≥ 0, Given bound: elevatedX - .1 ≥ 0 (Pr[elevatedX] ≥ .1) • Infer Xlevel – 1 ≥ 0 (thus, E[Xlevel] ≥ 1) as follows:Xlevel – 1 = 10(elevatedX - .1) + elevatedX(Xlevel-10) + (1-elevatedX)2Xlevel + (-Xlevel)(elevatedX2-elevatedX)
Outline • Probability logics: from Nilsson’s Linear Programming-based logic to Sum-of-squares • Sum-of-squares is powerful: simulating fragments of resolution and more • Implicit learning of bounds and constraints from partial examples in sum-of-squares
Our examples have used a given family of bounds and constraints… How can we learn relevant bounds and constraints directly from empirical data?
Real-valued assignments M D Our data consists of i.i.d. ρ = m(x): Hides part of assignment (x1,x2,…,xn) m = ρ Use existing models for the data source [“PAC-Semantics,” Valiant AIJ’00][“Masking processes,” Michael AIJ’10/Rubin Biometrika’76]
Learning under partial information: the implicit learning technique (J.’13) • Implicit learning: answer queriesagainst an “implicit” (not explicitly represented) KBwhen… • Only require ability to decidelogical queries • Could efficiently answer queriesif only we had the KB • No statistical/information-theoretic problem, i.e., sufficientdata/information to find answer • Key question: what constraints can we use?
Naïve norm: a measure of proof size We will be able to learn systems of “small” constraints that are simultaneously “witnessed true” with high probability – “testable” • Naïve norm: • Each monomial xα (=∏ixiαi) will have given lower/upper bounds Lα & Bα(based on known ranges for each xi). • For a polynomial p(x), suppose we substitute Bαfor xαin p if the coefficient of xαis positive, and otherwise (if the coefficient is negative) substitute Lαfor xα – this is the upper bound of p. • Similarly, if we substitute Lαfor xαin pif the coefficient of xαis positive, and otherwise substitute Bα for xα, this is the lower boundof p. • The naïve normof p(x)is the maximum of the absolute value of the upper bound and the lower bound of p(x).
Witnessedconstraintsvia partial evaluation We will be able to learn systems of “small” constraints that are simultaneously “witnessed true” with high probability – “testable” • Naïve norm: the maximum of the absolute value of the upper bound and the lower bound of p(x). • Partial evaluation: Plug in the values specified in ρin a polynomial p to obtain the partially evaluatedpolynomial p|ρ • Example: p(x,y,z) = 3x2yz-xy2+2z-1, ρ= (10,*,2): p|ρ(y) = 3(10)2y2-10y2+2(2)-1 = -10y2+600y+3 • Witnessing: For a constraint/boundp(x)≥0, if the lower bound of p|ρis still positive, we say that p(x)≥0 is witnessedtrue under ρ. • Example: If -1 ≤ y ≤ 1, then in p(x,y,z) above, the lower bound of p|ρ(y)is -10-600+3=-607, so p is notwitnessed true under ρ.If 0 ≤ y ≤ 1/10, then the lower bound is -10(1/100)+600(0)+3 = 2.9, which is witnessed true under ρ. • Testable: If the system of polynomial bounds/constraintsK is simultaneously witnessed true with probability at least 1-ε for partial examples ρ drawn from M(D), we say that K is (1-ε)-testablew.r.t. M(D).
Implicit learning guarantee for constraints in sum-of-squares • Let S be an upper bound for |Bα-Lα| over all xαof total degree at most d. • Theorem. Given m = Ω(S2(d log n +log 1/δ)) partial examples from M(D), there is an efficient algorithm that on input a system of boundsand constraintsK, with probability 1-δ • Accepts if D satisfies the system K • Rejects if there is a system of (1-1/2S)-testableconstraints under M(D) that completes a sum-of-squares refutation of K of total naïve norm at most S.
The approach: motivation • If we had complete examples from D… • We could check that the constraints were empirically satisfied in 1/εlog1/δexamples • We could estimate the moments E[xα] via Chernoff/Hoeffding bounds, check bounds • This would ensure that with probability 1-δ, both are satisfied w.r.t. D with probability 1-ε • But we only have partial examples from M(D)
The actual approach • Use sum-of-squares to check if each partial example is consistent with the constraints. • Use partially evaluatedempirical moments: set as bounds on the unknown moments of D • Write: 1/m∑ρxα|ρ + O(Bα(d ln(n/δ)/2m)1/2) ≥ xαxα ≥ 1/m∑ρxα|ρ - O(Lα(d ln(n/δ)/2m)1/2) • Check if input bounds are necessarily consistent with these partially evaluated moments. • Use inference from constraints: set up as single sum-of-squares query
Sketch of the analysis, pt. 1: no refutation if D is consistent with K • Chernoff/Hoeffdingbounds imply our “slack” is sufficient to guarantee our empirical constraints are valid. • Therefore: in the first case, D is consistent ⟹ no sum-of-squares refutation exists and the algorithm accepts • Easy!
Sketch of the analysis, pt. 2: empirical constraints capture implicitKB • Consider an empirical average of the small naïve norm proof using the implicitKB with K, partially evaluated by each partial example ρ:-1 = 1/m∑ρ(σ0(x) + ∑iσi(x) gi(x) + ∑jpj(x) hj(x) + ∑kckbk(x))|ρ= 1/m∑ρ(sum-of-squares expression using K)|ρ + 1/m∑ρ(implicitKB terms)|ρ • For a (1-1/2S)–testableimplicit KB, in all but a ≈1/2S-fraction of the examples, the constraints are all witnessed true. • Chernoff/Hoeffding guarantees that the actual fraction is adequately bounded, say by < 2/3S • Sum-of-squares can derive the terms using the implicitKB for the witnessed examples (these are trivial) • When the implicitKB is not witnessed, the expression can be bounded by the naïve norm of the overall proof: S. Since these occupy a < 2/3S-fraction of the examples, they contribute at most 2/3 to the overall expression. • This part of the proof can be bounded by 2/3 by sum-of-squares, so we still obtain a derivation of 2/3-1 = -1/3. • Thus, using these bounds and rescaling the proof by 3 gives a sum-of-squares refutation (deriving -1). • So, in the second case there is a sum-of-squares refutation and the algorithm rejects.
Implicit learning guarantee for constraints in sum-of-squares • Let S be an upper bound for |Bα-Lα| over all xαof total degree at most d. • Theorem. Given m = Ω(S2(d log n +log 1/δ)) partial examples from M(D), there is an efficient algorithm that on input a system of boundsand constraintsK, with probability 1-δ • Accepts if D satisfies the system K • Rejects if there is a system of (1-1/2S)-testableconstraints under M(D) that completes a sum-of-squares refutation of K of total naïve norm at most S.
Recap: in this work… Sum-of-squares is useful as a probability logic for deciding bound/constraintqueries • Tractable fragments simulate the most powerful known tractable fragments of resolution • Captures forward-chaining and more… • Bounds and constraints can be efficiently learned implicitly from partially observed examples
Future directions • Can we extend to relational (first-order) logics? • Unsatisfying answer: can use propositionalization (similar to Valiant’00) • Better idea: can we use a “Relational Semidefinite Program,” similar to Relational Linear Programming (Kersting et al. ‘17)? • Can we relax the requirement that the implicitKB is (1-1/2S)–testable– make use of (1-ε)–testableimplicitKBs to guarantee D is consistent up to some ε-probability modification?
References • Ansótegui, C.; Bonet, M. L.; Levy, J.; and Manya ́, F. 2008. Measuring the hardness of SAT instances. In Proc. AAAI’08, 222–228. • Berkholz, C. 2018. The relation between polynomial calculus, Sherali-Adams, and sum-of-squares proofs. In Proc. 35th STACS, LIPIcs, 11:1–11:14. • De, A., Mossel, E., and Neeman, J. (2013). Majority is stablest: discrete and SoS. In Proc. 45th STOC, pages 477–486. • Fagin, R.; Halpern, J. Y.; and Megiddo, N. 1990. A logic for reasoning about probabilities. Information and computation 87(1-2):78–128. • Grigoriev, D., and Vorobjov, N. 2001. Complexity of null- and positivstellensatz proofs. Ann. Pure and Applied Logic 113(1):153–160. • Halpern, J. Y., and Pucella, R. 2007. Characterizing and reasoning about probabilistic and non-probabilistic expectation. J. ACM 54(3):15. • Juba, B. 2013. Implicit learning of common sense for reasoning. In Proc. 23rd IJCAI, 939–946. • Karlin, A. R.; Mathieu, C.; and Nguyen, C. T. 2011. Integrality gaps of linear and semi-definite programming relaxations for knapsack. In International Conference on Integer Programming and Combinatorial Optimization, 301– 314. Springer. • Kersting, K.; Mladenov, M.; and Tokmakov, P. 2017. Relational linear programming. Artificial Intelligence 244:188– 216. • Lasserre, J. B. 2001. Global optimzation with polynomials and the problem of moments. SIAM J. Optimization 11(3):796–817. • Michael, L. 2010. Partial observability and learnability. Artificial Intelligence 174(11):639–669. • Mossel, E.; O’Donnell, R.; and Oleszkiewicz, K. 2010. Noise stability of functions with low influences: invariance and optimality. Ann. Math. 171(1):295–341. • Nesterov, Y. 2000. Squared functional systems and optimization problems. High performance optimization 13:405– 440. • Nilsson, N. J. 1986. Probabilistic logic. Artificial Intelligence 28:71–87. • Parrilo, P. A. 2000. Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. Ph.D. Dissertation, California Institute of Technology. • Putinar, M. 1993. Positive polynomials on compact semi- algebraic sets. Indiana U. Math. J. 42:969–984. • Rubin, D. B. 1976. Inference and missing data. Biometrika63(3):581–592. • Shor, N. 1987. An approach to obtaining global extremums in polynomial mathematical programming problems. Cybernetics and Systems Analysis 23(5):695–700. • Valiant, L. G. 2000. Robust logics. Artificial Intelligence 117:231–253.