340 likes | 471 Views
Some Recent Progress in Combinatorial Statistics. Elchanan Mossel, UC Berkeley + Weizmann Institute http://www.stat.berkeley.edu/~mossel. What is Combinatorial Statistics. “ Combinatorial Statistics ” deals with Inference of Discrete parameters such as Graph, Trees or Partitions.
E N D
Some Recent Progress in Combinatorial Statistics Elchanan Mossel, UC Berkeley + Weizmann Institute http://www.stat.berkeley.edu/~mossel
What is Combinatorial Statistics • “Combinatorial Statistics” deals with Inference of • Discrete parameters such as Graph, Trees or Partitions. • With the goal of providing algorithms with guarantees on: • Sampling Complexity, Computational Complexity and Success Probability. • Alternatively: Sampling or Complexity Lower bounds.
Markov random fields / Graphical Models • A common model for stochastic networks bounded degree graph G = (V, E) weight functions C:| C|R≥0for every cliqueC 1 (1, 2) 2 (1, 3) (2, 3) 3 (3, 4) 4
(1, 3)( , ) (1, 2)( , ) (2, 3)( , ) (3, 4)( , ) Markov Random Fields / Graphical Models • A common model for stochastic networks bounded degree graph G = (V, E) weight functions C:| C|R≥0 for every edge clique C 1 nodes v are assigned values av in alphabet 2 distribution over statesV given by 3 Pr[] ~ CC(au, u 2 C) 4 where the product goes over all Cliques C
Reconstruction task for Markov random fields • Suppose we can obtain independent samples from the Markov random field • Given observed data at the nodes, is it possible to reconstruct the model (network)? • Important: Do we see data at all of the nodes or only a subset?
Problem: Given k independent samples of= (1,…n)at all the nodes, find the graphG (Given activity at all the nodes, can network be reconstructed?) Restrict attention to graphs with max degree d: A structure estimator is a map Questions: 1. How many samples k are required (asymptotically) in order to reconstruct MRFs with number of nodes n, max degree d with probability approaching 1, I.e. 2. Want an efficient algorithm for reconstruction. Reconstruction problem
Tree Markov Fields can be reconstructed efficiently (even with hidden nodes). [Erdös,Steel,Szekely,Warnow,99], [Daskalakis,Mossel,Roch,06]. PAC Setup: [Abbeel,Koller,Ng, ‘06] produce a factorized distribution that is n close in Kullback-Leibler divergence to the true distribution. No guarantee to reconstruct the correct graph Running time and sampling complexity is nO(d) More restricted problem studied by [Wainwright,Ravikumar,Lafferty, ‘06] Restricted to Ising model, sample complexity (d5 log n), difficult to verify convergence condition – technique based on L1 regularization. Moreover works for graphs not for graphical models! (clique potentials not allowed). Subsequent to our results, [Santhanam,Wainwright, ‘08] determine information theoretic sampling complexity and [Wainwright,Ravikumar,Lafferty, ‘08] get (d log n) sampling (restricted to Ising models; still no checkable guarantee for convergence). Related work
Theorem (Bresler-M-Sly-08; Lower bound on sample complexity): In order to recover G of max-deg d need at least c d log n samples, for some constant c. Pf follows by “counting # of networks”; information theory lower bounds. More formally: Given any prior distribution which is uniform over degree d graphs (no restrictions on the potentials), in order to recover correct graph with probability ¸ 2-n need at least c d log n samples. Reconstructing General Networks - New Results • Theorem (Bresler-M-Sly-08; Asymptotically optimal algorithm): • If distribution is “non-degenerate” c d log n samples suffice to reconstruct • the model with probability ¸ 1 – 1/n100, for some (other) constant c. • Running time is nO(d) • (sampling complexity tight up to a constant factor; running time – unknown)
Observation: Knowing graph is same as knowing neighborhoods But neighborhood is determined by Markov property Same intuition behind work of Abeel et. al. Intuition Behind Algorithms • “Algorithm”: • Step 1. Compute empirical probabilities, • which are concentrated around • true probabilities • Step 2. For each node, simply test • Markov property of each • candidate neighborhood Main Challenge: Show non-degeneracy ) algorithm works
(1, 3)( , ) (1, 2)( , ) (2, 3)( , ) (3, 4)( , ) Example 2: Reconstruction of MRF with Hidden nodes • In many applications only some of the nodes can be observed visible nodesWV Markov random field over visible nodes is 1 2 W = (w : w W) 3 • Is reconstruction still possible? • What does “reconstruction” even mean? 4
Reconstruction versus distinguishing • Easy to construct many models that lead to the same distribution (statistical unidentifiable) • Assuming “this is not a problem” are there computational obstacles for reconstruction? • In particular: how hard is it to distinguish statistically different models?
Distinguishing problems • Let M1, M2 be two models with hidden nodes • Can you tell if M1 and M2 are statistically close or far apart (on the visible nodes)? • Assuming M1 and M2 are statistically far apart and given access to samples from one of them, can you tell where the samples came from? PROBLEM 1 PROBLEM 2
Hardness result with hidden nodes • In Bogdanov-M-Vadhan-08: Problems 1 and 2 are intractable (in the worst case) unless NP = RP • Conversely, if NP = RP then distinguishing (and other forms of reconstruction) are achievable
A possible objection • The “hard” models M1, M2 describe distributions that are not efficiently samplable • But if nature is efficient, we never need to worry about such distributions!
Distinguishing problem for samplable distributions • If M1 and M2 are statistically far apart and given access to samples from one of them, can you tell where the samples came from, assuming M1 and M2 are efficiently samplable? • Theorem • We don’t know if this is tight PROBLEM 3 Problem 3 is intractable unless computational zero knowledge is trivial
Example 3: Consensus Ranking and the Mallows Model • Problem Given a set of rankings {1, 2, ... N} find the consensus ranking (or central ranking) for d = distance on the set of permutations of n objects Most natural is dK which is the Kendall distance.
The Mallows Model • Exponential family model in : • P( | 0) = Z()-1 exp(- dK(,0)) • ´ 0 : uniform distribution over permutations • > 0 : 0 is the unique mode of P,0 • Theorem [Meila,Phadnis,Patterson,Bilmes 07] Consensus ranking (i.e ML estimation of 0 for constant ) can be solved exactly by a branch and bound (B&B) algorithm. • The B&B algorithm can take (super-) exponential time in some cases • Seem to perform well on simulated data.
Related work ML estimation • [Fligner&Verducci 86] introduce generalized Mallows model, ML estimation • [Fligner&Verducci 88] (FV) heuristic for 0 estimation Consensus ranking • [Cohen,Schapire,Singer 99] Greedy algorithm (CSS) • + improvement by finding strongly connected components • + missing data (not all i rank all n items) • [Ailon,Newman,Charikar 05] Randomized algorithm • guaranteed 11/7 factor approximation (ANC) • [Mathieu, 07] (1+) approximation, time O(n6/+2^2O(1/)) • [Davenport,Kalagnanan 03] Heuristics based on edge-disjoint cycles • [Conitzer,D,K 05] Exact algorithm based on integer programming, better bounds
Efficient Sorting of the Mallow’s model • [Braverman-Mossel-09]: • Given r independent samples from the Mallows Model, find ML solution exactly! in time nb,where • b = 1 + O(( r)-1), • where r is the number of samples • with high probability (say ¸1-n-100)
Sorting the Mallow’s Model • [Braverman-M-09]: Optimal order can be found in polynomial time and O(n log n) queries. • Proof Ingredient 1: “statistical properties” of generated permutations i in terms of the original order 0: • With high probability:x |i(x)-(x)| = O(n), max |i(x)-(x)| = O(log n) • Additional ingredient: A dynamic programming • algorithm to find given a starting point where each elements is at most k away with running time O(n 26k)
Example 4: MCMC Diagnostics • Sampling from a joint probability distribution π on Ω. • Construct Markov chain X0, X1, …which converges to π. • Multidimensional Integration • Bayesian inference (sampling from “posterior distributions” of the model) • Computational physics, • Computational biology, • Vision, Image segmentation.
Convergence Diagnostics • A method to determine t such that Pt(X0, ) is “sufficiently close” to π. • - MCMC in Practice, Gilks, Richardson & Spiegelhalter Are we there yet?
Convergence Diagnostics in the Literature • Visual inspection of functions of samples. • - MCMC in Practice, Gilks, Richardson & Spiegelhalter • [Raftery-Lewis] • Bounding the variance of estimates of quantiles of functions of parameters. • [Gelman-Rubin] • Convergence achieved if the separate empirical distributions are approximately the same as the combined distribution. • [Cowles-Carlin] • Survey of 13 diagnostics designed to identify specific types of convergence failure. • Recommend combinations of strategies – running parallel chains, computing autocorrelations.
Diagnostic Algorithm X Y Diagnostic algorithm A: • Input: • An MC P on Ω {0,1}n and a time t such that either • T(1/4) < t • T(1/4) > 100t • Output: • “Yes” if T(1/4) < t • “No” T(1/4) > 100t
Results - The General Case • Theorem [Bhatnagar-Bogdanov-M-09]: Given • A MC P on Ω {0,1}n • A time t such that either • T(1/4) < t • T(1/4) > 100t • It is a PSPACE-complete problem to distinguish the two.
PSPACE complete problems [Crasmaru-Tromp] Ladder problems in GO – does black have a winning strategy? PSPACE hard problems are “much harder” the NP-hard problems.
Results - Polynomial Mixing • Theorem 2 [Bhatnagar-Bogdanov-M-09]: Given • A MC P on Ω {0,1}n with a promise thatT(1/4) < poly(n) • A time t such that either • T(1/4) < t • T(1/4) > 100t • It is as hard as any coNP problem to distinguish the two. (x0+x3+x7)^(x2+x7+xn) …^(…) T(1/4) < t (x6+x3+x2)^(x0+x7+x1) …^(…) T(1/4) > 100t
Results - Mixing from a Given State • Theorem 3 [Bhatnagar-Bogdanov-M-09]: : Given • A MC P on Ω {0,1}n such that T(1/4) < poly(n) • A starting state X0, • A time t such that either • TX0(1/4) < t • TX0(1/4) > 100t • A diagnostic algorithm can be used to distinguish whether two efficiently sampleable distributions are close or far in statistical distance. • [Sahai-Vadhan] “Statistical Zero Knowledge” - complete … … r1 rn r’1 r’n C1 C2 … … a1 am a1 am
Comb. Stat. vs. Common ML Approaches • Some features of Combinatorial Statistics in comparison to ML: • +: Using explicit well defined models. • +: Explicit non-asymptotic running time / sampling bounds. • - : Often not “statistically efficient”. • - : Inputs must be generated from correct dist. or close to it (but: often: NP hard otherwise; so true for of all methods) • +/- : Need to understand combinatorial structure for each new problem vs: • Applying general statistical techniques and structure is hidden in programming/design etc.
Conclusions and Questions • [Cowles-Carlin] Survey of 13 diagnostics designed to identify specific types of convergence failure. Each can fail to detect the type of convergence it was designed for. • [Arora-Barak] Efficient algorithms are not believed to exist for PSPACE-complete, coNP-complete or SZK-complete problems. • Diagnostic algorithms do not exist for large classes of MCMC algorithms, like Gibbs samplers unless there are efficient algorithms for PSPACE or coNP or SZK. • For what classes of samplers are convergence diagnostics possible? • Hardness of testing convergence for more restricted classes of samplers.
Collaborators Nayantara Bhatnagar, Berkeley Guy Bresler Berkeley Mark Braverman Microsoft Andrej Bogdanov: Hong-Kong Allan Sly Berkeley Salil Vadhan: Harvard