1 / 18

Sampling and Soundness: Can We Have Both?

Carla Gomes, Bart Selman, Ashish Sabharwal Cornell University Jörg Hoffmann DERI Innsbruck …and I am: Frank van Harmelen. Sampling and Soundness: Can We Have Both?. Talk Roadmap. A Sampling Method with a Correctness Guarantee Can we apply this to the Semantic Web? Discussion.

lan
Download Presentation

Sampling and Soundness: Can We Have Both?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Carla Gomes, Bart Selman, Ashish Sabharwal Cornell University Jörg Hoffmann DERI Innsbruck …and I am: Frank van Harmelen Sampling and Soundness: Can We Have Both?

  2. ISWC’07 Talk Roadmap A Sampling Method with a Correctness Guarantee Can we apply this to the Semantic Web? Discussion

  3. ISWC’07 How Might One Count? Problem characteristics: Space naturally divided into rows, columns, sections, … Many seats empty Uneven distribution of people (e.g. more near door, aisles, front, etc.) How many people are present in the hall?

  4. ISWC’07 #1: Brute-Force Counting Idea: • Go through every seat • If occupied, increment counter Advantage: • Simplicity, accuracy Drawback: • Scalability

  5. ISWC’07 #2: Branch-and-Bound (DPLL-style) Idea: • Split space into sectionse.g. front/back, left/right/ctr, … • Use smart detection of full/empty sections • Add up all partial counts Advantage: • Relatively faster, exact Drawback: • Still “accounts for” every single person present: need extremely fine granularity • Scalability Framework used in DPLL-based systematic exact counters e.g. Relsat [Bayardo-et-al ’00], Cachet [Sang et al. ’04]

  6. ISWC’07 #3: Naïve Sampling Estimate Idea: • Randomly select a region • Count within this region • Scale up appropriately Advantage: • Quite fast Drawback: • Robustness: can easily under- or over-estimate • Scalability in sparse spaces:e.g. 1060 solutions out of 10300 means need region much larger than 10240 to “hit” any solutions

  7. ISWC’07 Sampling with a Guarantee Idea: • Identify a “balanced” row split or column split (roughly equal number of people on each side) • Use local search for estimate • Pick one side at random • Count on that side recursively • Multiply result by 2 This provably yields the true count on average! • Even when an unbalanced row/column is picked accidentallyfor the split, e.g. even when samples are biased or insufficiently many • Surprisingly good in practice, using a local search as the sampler

  8. ISWC’07 Algorithm SampleCount Input: Boolean formula F Set numFixed = 0, slack = some constant (e.g. 2, 4, 7, …) Repeat until F becomes feasible for exact counting Obtain s solution samples for F Identify the most balanced variable and variable-pair[“x is balanced” : s/2 samples have x = 0, s/2 have x = 1 “(x,y) is balanced” : s/2 samples have x = y, s/2 have x = –y] If x is more balanced than (x,y), randomly set x to 0 or 1Else randomly replace x with y or –y; simplify F Increment numFixed Output: model count  2numFixed – slack exactCount(simplified F) with confidence (1 – 2– slack ) [Gomes-Hoffmann-Sabharwal-Selman IJCAI’07] Note: showing one trial

  9. ISWC’07 Correctness Guarantee Key properties: Holds irrespective of the quality of the local search estimates No free lunch! Bad estimates  high variance of trial outcome  min(trials) is high-confidence but not tight Confidence grows exponentially with slack and t Ideas used in the proof: Expected model count = true count (for each trial) Use Markov’s inequality Pr[X>kE[X]] < 1/k to bound error probability (X is outcome of one trial) Theorem: SampleCount with t trials gives a correct lower bound with probability (1 – 2– slack t) e.g. slack =2, t =4  99% correctness confidence

  10. Instance True Count SampleCount (99% conf.) Relsat (exact) Cachet (exact) 2bitmax_6 2.1 x 1029  2.4 x 1028 29 sec 2.1 x 1029 66 sec 2.1 x 1029 2 sec 3bitadd_32 ---  5.9 x 101339 32 min --- 12 hrs --- 12 hrs wff-3-3.5 1.4 x 1014  1.6 x 1013 4 min 1.4 x 1014 2 hrs 1.4 x 1014 7 min wff-3-1.5 1.8 x 1021  1.6 x 1020 4 min  4.0 x 1017 12 hrs 1.8 x 1021 3 hrs wff-4-5.0 ---  8.0 x 1015 2 min  1.8 x 1012 12 hrs  1.0 x 1014 12 hrs Circuit Synthesis, Random CNFs ISWC’07

  11. ISWC’07 Talk Roadmap • A Sampling Method with a Correctness Guarantee • Can we apply this to the Semantic Web? • Discussion

  12. ISWC’07 Talk Roadmap • A Sampling Method with a Correctness Guarantee • Can we apply this to the Semantic Web? [Highly speculative] • Discussion

  13. ISWC’07 Counting in the Semantic Web… • … should certainly be possible with this method • Example: given RDF database D, count how many triples comply with query q • Throw a constraint cutting the set of all triples in half • If feasible, count n triples exactly; return n*2#constraints-slack • Else, iterate • “Merely” technical challenges: • What are “constraints” cutting the set of all triples in half? • How to “throw” a constraint? • When to stop throwing constraints? • How to efficiently count the remaining triples?

  14. ISWC’07 What about Deduction? • Does  follow from ? • Exploit connection “implication  UNSAT upper bounds”? • A similar theorem does NOT hold for upper bounds • Nutshell: Markov’s inequality Pr[X>kE[X]] < 1/k does not have a symmetric “Pr[X<kE[X]]” counterpart • An adaptation is possible but has many problems → does not look too promising • Heuristic alternative: • Add constraints into  to obtain ’; check whether ’ implies  • If “No”, stop; if “yes”, goto next trial • After t successful trials, output “it’s enough, I believe it” • No provable confidence but may work well in practice

  15. ISWC’07 What about Deduction? • Does  follow from ? • Much more distant adaptation: • “Constraint” = something that removes half of  !! • Throw some and check whether ’   • Confidence problematic: • Can we draw any conclusions if ’ NOT  ? • May be that 1, 2 in  with 1 2  , but a constraint separated 1 from 2 • May be thatall relevant  are thrown out • Are there interesting cases where we can bound the probability of these events??

  16. ISWC’07 Talk Roadmap • A Sampling Method with a Correctness Guarantee • Can we apply this to the Semantic Web? [Highly speculative] • Discussion

  17. ISWC’07 Discussion • In prop CNF, one can efficiently obtain high-confidence lower bounds on nr of models, by sampling • Application to Semantic Web: • Adaptation to counting tasks should be possible • Adaptation for   , via upper bounds, is problematic • Promising: heuristic method sacrificing confidence guarantee • Alternative adaptation weakens  instead of strengthening it • “Sampling the knowledge base” • Confidence guarantees?? Your feedback and thoughts are highly appreciated!!

  18. ISWC’07 What about Deduction? • Does  follow from ? • Straightforward adaptation: • There is a variant of this algorithm that computes high-confidence upper bounds instead • Throw “large” constraints, check if ’ is SAT • If SAT, no implication; if UNSAT in each of t iterations, confidence on upper bound on #models • Many problems: • Is the ’ actually easier to check?? • “Large” constraints are tough even in propositional CNF context! • (“Large” = involves half of the prop vars; needed for confidence) • Upper bound on #models is not confidence in UNSAT!

More Related