1 / 2

Structure Extension to Logistic Regression:

C. E 2. E k. …. E 1. X. X. X. X. X. W. W. W. W. W. Y. Y. Y. Y. Y. Z. Z. Z. Z. Z. Q. Q. Q. Q. Q. + 3 T … 0 + 0 F … -1 - 4 F … 0.2 … - -2 T … -3. + + - … -. Structure Extension to Logistic Regression:

finola
Download Presentation

Structure Extension to Logistic Regression:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C E2 Ek … E1 X X X X X W W W W W Y Y Y Y Y Z Z Z Z Z Q Q Q Q Q + 3 T … 0 + 0 F … -1 - 4 F … 0.2 … - -2 T … -3 + + - … - Structure Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers Russell Greiner* and Wei Zhou University of Alberta University of Waterloo *greiner@cs.ualberta.ca • If goal is … • Generative (learn distribution) B(ML) = arg maxB 1/|S| i ln PB( ci ,ei) • Discriminative (learn classifier) B*=arg minB err( B ) = arg minB i( ci hB(ei) )  B(MCL) = arg maxB 1/|S| i ln PB( ci|ei) E1 E2 E3 … En Learner’s task… Belief Net B =  V, A, … Ideally, minimize… + -1 T … -3 P( c,e ) E1 E2 E3 … En C B truth • Nodes V (Variables) • Arcs A (Dependencies) • Parameters (Conditional probabilities) Distribution KL( truth, B ) + + - … - + 3 T … 0 + 0 F … -1 - 4 F … 0.2 … - -2 T … -3 Performer ≡ ≡ c,e hB(e) truth( e) err( B ) = c,eP(c,e)  (c  hB(e) ) Classifier + • Computational Complexity: • NP-hard to find values that minimize • … even if only AllG,  for  = O(1/N) • Sample Complexity: • Given structure G = V,A = let • For any ,  > 0, let be the parameters that optimizefor a sample S of size • Then, with probability at least 1-, LCL( ) within  of LCL(*G,  ). • Our specific task: • Given • Structure (node, arcs… not parameters) • Labeled data sample • Find parameters that maximize AllG,  = {   ParamFor(G) |  d|f , d|f   } … … Proof: X1 X2 XN *G,  = argmax{ CL() |   AllG,  } + + - … - + 3 T … 0 + 0 F … -1 - 4 F … 0.2 … - -2 T … -3 … C1 CR Ghas |  | =K parameters over |V| =N variables D2 D3 . . . DK • Notes: • Similar bounds when dealing with err(), as with LCL() • [Dasgupta,1997] proves complete tuples sufficient wrt Likehood. • Same O(.) as our bound, ignoring ln2(.) and ln3(.) terms • The  is unavoidable here… (unlike likelihood case [ATW91]) C E2 Ek … E1 3 “C=1”s So C=1 = 3/5 • Other Algorithms… • When given complete data… • Compare to OFE (Observed Frequency Estimate) • Trivial algorithm… maximizes Likelihood • When given incomplete data… • EM (Expectation Maximation) • APN [BKRK97] – hillclimb in (unconditional) Likelihood • Relation to Logistic Regression: • ELR on Naïve Bayes structure  standard Logistic Regression • ELR deals with arbitrary structures, incomplete data • How to HillClimb? • Not just changing {d|f}, as constraints: • a. d|f  0 • b. dd|f = 1 • So…use “softmax” terms… climb along d|f’s ! • Need derivative: • Optimizations: • Initialize using OFE values (not random) – “plug in parameters” • Line-search, conjugate gradient ([Minka,2001] confirms these effectie for Logistic Regression) • Deriv = 0 when D and F are d-separated from E and C… and so can be ignored! C • ELR Learning Algorithm: • Input: • Structure • Labeled data sample • Output: • parameters  • Goal: find • As NP-hard… Hillclimb ! • Change each d|fto improve • How?? … F1 F2 2 “E1=1, C=1”s E2 D So E1=1|C=1 = 2/3 … E1 C=1 = 3/5 E1=1|C=1 = 2/3 E1=1|C=0 = … 2/3

  2. C Ek E1 E2 Ek E7 C … E2 Ek E1 This work was partially funded by NSERC and by Syncrude. NaïveBayes Structure TAN Structure Empirical Results C • NaïveBayes Structure • Attributes independent, given Class • Complete Data • Every attribute of every instance specified • 25 Datasets • 23 from UCI, continuous + discrete • 2 from SelectiveNB study • (used by [FGG’96]) E2 Ek … E1 • TAN structure: • Link from Class node to each attribute • Tree-structure connecting attributes • Permits dependencies between attributes • Efficient Learning alg; Classification alg • Works well in practice… [FGG’97] • TAN can deal with depend attributes,NB cannot • … but ELR is designed to help classifyOFE is not • NB does poorly on CORRAL • artificial dataset, fn of 4 attribute • Gen’l: NB+ELR  TAN+OFE • TAN+ELR did perfectly on CORRAL! • TAN+ELR  NB+ELR • TAN+ELR > TAN+OFE(p<0.025) • Chess domain • ELR-OFE: • Initialize params using OFE values • Then run ELR All 25 Domains • Below y=x NB+ELR better than NB+OFE • Bars are 1 standard deviation ELR better than OFE ! (p<0.005) Complete data: Missing Data Correctness of Structure • Compare NB+ELR to NB+OFE wrt • increasingly “non-NB data” • So far, each dataset complete • includes value of • every attribute in each instance • Now… some omissions • Omit values of attributes • w/ prob = 0.25 • “Missing Completely at Random” • OFE works only with COMPLETE data • Given INCOMPLETE data: • EM (Expectation Maximization) • APN(Adaptive Probabilistic Networks [BKRK97] ) • Experiments using • NaïveBayes, TAN • Why does ELR work so well • vs OFE(complete data) • vs EM / APN(incomplete data) for fixed simple structure (NB, TAN) ? • “Generative” Learner (OFE/APN/EM) • very constrained by structure… • So if structure is wrong, cannot do well! • “Discriminative” Learner (ELR) • not as constrained! C C #0 C #1 #2 E1 E2 E3 E1 E2 E3 E4 E1 E2 E3 E4 E4 25% MCAR omissions: • P(C) = 0.9 P(Ei|C) = 0.2 P(Ei|~C) = 0.8 • … then P(Ei|E1)=1.0, P(Ei|~E1)=0.0 when “joined”for model#2, model#3, … • Measured Classification Error • k=5, 400 records, … • NB+ELRbetter thanNB+EM, NB+APN (p<0.025) • TAN+ELRTAN+EM TAN+APN • TAN algorithm problematic… • as incomplete data Summary of Results • Future work: • Now: assume fixed structure • Learn STRUCTURE as well… discriminately • NP-hard to learning LCL-optimal parameters … • arbitrary structure • incomplete data What is complexity if complete data? … simple structure? Other Studies Analysis • OFE guaranteed to find parameters • optimal wrt Likelihood • for structure G • If G incorrect… • optimal-for-G is bad wrt true distribution  wrong answers to queries • … ELR not as constrained by G… • can do well, even when structure incorrect! • ELR useful, as structure often incorrect • to avoid overfitting • constrained set of structures (NB, TAN, …) • See Discriminative vs Generative learning… • Complete Data • Incomplete data Nearly correct structure • Given data: • Use PowerConstructor [CG02,CG99] to build structure • Use OFE vs ELR to find parameters • For Chess: TAN + ELR > TAN + OFE NB + ELR > NB + OFE Insert fig 2b from paper! Contributions: • Motivate/Describe • discriminative learning for BN-parameters • Complexity of task (NP-hard, poly sample size) • Algorithm for task, ELR • complete or incomplete data • arbitrary structures • soft-max version, optimizations, … • Empirical results showing ELR works + study to show why… • Clearly a good idea… • should be used for Classification Tasks! • ELR was relatively slow •  0.5 sec/iteration for small, … minutes for large data • much slower than OFE •  APN/EM • … same alg for Complete/INcomplete data • … ELR used unoptimized JAVA code Correct structure, incomplete data • Consider Alarm [BSCC89] structure (+ param): • 36 nodes, 47 links, 505 params • Multiple queries • 8 vars as pool of query vars • 16 other vars as pool of evidence vars • Each query: 1 q.var; each evid var w/prob ½ … so expect 16/2 evidence • NOTE: different q.var for different queries! (Like multi-task learning) • Results: TradeOff • Most BN-learners • Spend LOTS of time learning structure • Little time learning parameters • Why not… • Use SIMPLE (quick-to-learn) structure • Focus computational effort on getting good parameters C Related Work … E2 Ek E1 • Lots of work on learning BNs… most Generative learning • Some discriminative learners but most… • learn STRUCTURE discriminatively • then parameters generatively ! • See also Logistic Learning • [GGS’97] learns params discriminatively but… • different queries, L2-norm (not LCL) • needed 2 types of data-samples, … C=1 Insert fig 6c from paper! E=1|C=1

More Related