370 likes | 467 Views
Efficient Learning in High Dimensions with Trees and Mixtures. Marina Meila Carnegie Mellon University. Learning. Multidimensional data. Multidimensional (noisy) data Learning tasks - intelligent data analysis categorization (clustering) classification novelty detection
E N D
Efficient Learning in High Dimensions withTrees and Mixtures Marina Meila Carnegie Mellon University
Learning Multidimensional data • Multidimensional (noisy) data • Learning tasks - intelligent data analysis • categorization (clustering) • classification • novelty detection • probabilistic reasoning • Data is changing, growing • Tasks change need to make learning automatic, efficient
Combining probability and algorithms • Automatic probability and statistics • Efficient algorithms • This talk the tree statistical model
Talk overview Perspective: generative models and decision tasks Introduction: statistical models The tree model Mixtures of trees Accelerated learning Bayesian learning Learning Experiments
Statistical model A multivariate domain Cough X ray Bronchitis Lung cancer Smoker • Data Patient1 Patient2 . . . . . . . . . . . . • Queries • Diagnose new patient • Is smoking related to lung cancer? • Understand the “laws” of the domain Cough X ray Bronchitis Lung cancer Smoker Cough X ray Smoker Bronchitis Lung cancer? X ray Smoker Cough Bronchitis Lung cancer?
Probabilistic approach • Smoker, Bronchitis .. (discrete) random variables • Statistical model (joint distribution) P( Smoker, Bronchitis, Lung cancer, Cough, X ray ) summarizes knowledge about domain • Queries: • inference e.g. P( Lung cancer = true | Smoker = true, Cough = false ) • structure of the model • discovering relationships • categorization
v1v2 00 01 11 00 v3 0 .01 .14 .22 .01 1 .23 .03 .33 .03 .14 + .03 .14 + .3 + .22 + .33 P(v1=0, v2=1) P(v2=1) Probability table representation • Query: P(v1=0 | v2=1) = = = .23 • Curse of dimensionality if v1, v2, … vn binary variables PV1,V2…Vntable with 2n entries! • How to represent? • How to query? • How to learn from data? • Structure?
Structure vertices = variables edges = “direct dependencies” Parametrization by local probability tables compact parametric representation efficient computation learning parameters by simple formula learning structure is NP-hard spectrum Z (red-shift) dust Obs spectrum Graphical models distance Galaxy type size spectrum Z (red-shift) dust observed size Obs spectrum photometric measurement
1 1 3 3 4 4 5 5 2 2 equivalent P Tuv(xuxv) P Tv(xv) T(x) = P Tv|u(xv|xu) uv E T(x) = uv E deg v-1 v V The tree statistical model • Structure tree (graph with no cycles) • Parameters • probability tables associated to edges T3 T34 T4|3 • T(x) factors over tree edges
Examples • Splice junction domain • Premature babies’ Bronho-Pulmonary Disease (BPD) junction type -7 -3 +7 +8 +5 +2 +3 +4 +6 -2 -4 -6 -5 -1 +1 PulmHemorrh Coag HyperNa Thrombocyt Hypertension Acidosis Gestation Weight Temperature BPD Neutropenia Suspect Lipid
P Tuv(xuxv) P Tv (xv) uv E T(x) = deg v -1 v V Trees - basic operations |V| =n • computing likelihood T(x) ~ n • conditioning TV-A|A(junction tree algorithm) ~ n • marginalization Tuv for arbitrary u,v ~ n • sampling ~ n • fitting to a given distribution ~ n2 • learning from data ~ n2Ndata • is a simple model Querying the model Estimating the model
The mixture of trees (Meila 97) h = “hidden” variable P( h=k ) = lk k = 1, 2 . . . m • NOT a graphical model • computational efficiency preserved m Q(x) = S lkTk(x) k=1
Learning - problem formulation • Maximum Likelihood learning • given a data set D = { x1, . . . xN } • find the model that best predicts the data Topt = argmax T(D) • Fitting a tree to a distribution • given a data set D = { x1, . . . xN } and distribution P that weights each data point, • find Topt = argmin KL( P|| T ) • KL is Kullbach-Leibler divergence • includes Maximum likelihood learning as a special case
Puv PuPv Fitting a tree to a distribution (Chow & Liu 68) Topt = argmin KL( P|| T ) • optimization over structure + parameters • sufficient statistics • probability tables Puv= Nuv/N u,v V • mutual informations Iuv Iuv = S Puv log
I12 I23 I61 I63 I34 I56 I45 Fitting a tree to a distribution - solution • Structure Eopt = argmax S Iuv uv E • found by Maximum Weight Spanning Tree algorithm with edge weights Iuv • Parameters • copy marginals of P Tuv = Puvfor uv E
Learning mixtures by the EM algorithm Meila & Jordan ‘97 E step which xi come from T k? distributionP k(x) • Initialize randomly • converges to local maximum of the likelihood M step fit T k to set of points min KL( Pk||Tk )
Remarks • Learning a tree • solution is globally optimal over structures and parameters • tractable: running time ~ n2N • Learning a mixture by the EM algorithm • both E and M steps are exact, tractable • running time • E step ~ mnN • M step ~ mn2N • assumes m known • converges to local optimum
Finding structure - the bars problem Data n=25 learned structure Structure recovery: 19 out of 20 trials Hidden variable accuracy: 0.85 +/- 0.08 (ambiguous) 0.95 +/- 0.01 (unambiguous) Data likelihood [bits/data point] true model 8.58 learned model 9.82 +/-0.95
Experiments - density estimation • Digits and digit pairs Ntrain = 6000 Nvalid = 2000 Ntest = 5000 n = 64 variables ( m = 16 trees ) n = 128 variables ( m = 32 trees ) Mix Trees Mix Trees
Tree TANB NB Supervised (DELVE) DNA splice junction classification • n = 61 variables • class = Intron/Exon, Exon/Intron, Neither
IE junction Intron Exon 15 16 . . . 25 26 27 28 29 30 31 Tree - CT CT CT - - CT A G G True CT CT CT CT - - CT A G G EI junction Exon Intron 28 29 30 31 32 33 34 35 36 Tree CA A G G T AG A G - True CA A G G T AG A G T (Watson “The molecular biology of the gene” 87) Discovering structure Tree adjacency matrix class
Irrelevant variables 61 original variables + 60 “noise” variables Original Augmented with irrelevant variables
Accelerated tree learning Meila ‘99 • Running time for the tree learning algorithm ~ n2N • Quadratic running time may be too slow: Example: document classification • document = data point --> N= 103-4 • word = variable --> n= 103-4 • sparse data --> #words in document s and s << n,N • Can sparsity be exploited to create faster algorithms?
0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 Sparsity • assume special value “0” that occurs frequently sparsity = s # non-zero variables in each data point s s << n, N • Idea: “do not represent / count zeros” Linked list length s Sparse data
Presort mutual informations Theorem (Meila,99) If v, v’ are variables that do not cooccur with u in V (i.e. Nuv = Nuv’ = 0 ) then Nv > Nv’ ==> Iuv > Iuv’ • Consequences • sort Nv=> all edges uv , Nuv = 0 implicitly sorted by Iuv • these edges need not be represented explicitly • construct black box that outputs next “largest” edge
The black box data structure v1 Nv v2 list of u , Nuv > 0, sorted by Iuv v F-heap of size ~n list of u, Nuv =0, sorted by Nv (virtual) vn next edge uv Total running timen log n + s2N + nK log n (standard alg.running time n2N )
Standard accelerated Experiments - sparse binary data • N = 10,000 • s = 5, 10, 15, 100
Remarks • Realistic assumption • Exact algorithm, provably efficient time bounds • Degrades slowly to the standard algorithm if data not sparse • General • non-integer counts • multi-valued discrete variables
1Z Bayesian learning of trees Meila & Jaakkola ‘00 • Problem • given prior distribution over trees P0(T) data D = { x1, . . . xN } • find posterior distribution P(T|D) • Advantages • incorporates prior knowledge • regularization • Solution • Bayes’ formula P(T|D) = P0(T) P T(xi) i=1,N • practically hard • distribution over structure E and parameters qE hard to represent • computing Z is intractable in general • exception: conjugate priors
Decomposable priors T = P f( u, v, qu|v) uv E • want priors that factor over tree edges • prior for structure E P0(E) a Pbuv uv E • prior for tree parameters P0(qE) = P D( qu|v ; N’uv ) uv E • (hyper) Dirichlet with hyper-parameters N’uv(xuxv), u,v V • posterior is also Dirichlet with hyper-parameters Nuv(xuxv) + N’uv(xuxv), u,v V
Decomposable posterior • Posterior distribution P(T|D) aP Wuv uv E • factored over edges • same form as prior Wuv = buv D( qu|v; N’uv+ Nuv ) • Remains to compute the normalization constant
v -buv u 1Z Sv’bvv' -buv The Matrix tree theorem Discrete: graph theory continuous: Meila & Jaakkola 99 • Matrix tree theorem If P0(E) = P buv, buv 0 uv E M( b ) = Then Z = det M( b )
Remarks on the decomposable prior • Is a conjugate prior for the tree distribution • Is tractable • defined by ~ n2 parameters • computed exactly in ~ n3 operations • posterior obtained in ~ n2N + n3 operations • derivatives w.r.t parameters, averaging, . . . ~ n3 • Mixtures of trees with decomposable priors • MAP estimation with EM algorithm tractable • Other applications • ensembles of trees • maximum entropy distributions on trees
So far . . • Trees and mixtures of trees are structured statistical models • Algorithmic techniques enable efficient learning • mixture of trees • accelerated algorithm • matrix tree theorem & Bayesian learning • Examples of usage • Structure learning • Compression • Classification
Generative models and discrimination • Trees are generative models • descriptive • can perform many tasks suboptimally • Maximum Entropy discrimination (Jaakkola,Meila,Jebara,’99) • optimize for specific tasks • use generative models • combine simple models into ensembles • complexity control - by information theoretic principle • Discrimination tasks • detecting novelty • diagnosis • classification
Bridging the gap Tasks Descriptive learning Discriminative learning
Future . . . • Tasks have structure • multi-way classification • multiple indexing of documents • gene expression data • hierarchical, sequential decisions Learn structured decision tasks • sharing information btw tasks (transfer) • modeling dependencies btw decisions