310 likes | 689 Views
Study of Bayesian network classifier. Huang Kaizhu Supervisors: Prof. Irwin King Prof. Lyu Rung Tsong Michael Markers: Prof. Chan Lai Wan Prof. Wong Kin Hong. Outline. Background What is Bayesian network?
E N D
Study of Bayesian network classifier Huang Kaizhu Supervisors: Prof. Irwin King Prof. Lyu Rung Tsong Michael Markers: Prof. Chan Lai Wan Prof. Wong Kin Hong
Outline • Background • What is Bayesian network? • How Bayesian networks can be used as classifiers? • Why choose Bayesian network? • What is problem of Learning Bayesian network ? • My main works • Large-Node Chow-Liu-tree • Maximum likelihood Large-Node-Bounded semi-Naïve BN • Future work • Conclusion
Background • What is Bayesian Network(BN)? • Composed of a “structure” component G and a “parameter” component . • G=(V,E) is a directed acyclic graph. nodes set :V and its edge set is E. And the nodes represent the attributes, the edges between the nodes represent the dependence relationship between the nodes. • is a conditional probability table. • It encodes the following joint probability among the nodes (X1,X2,…,Xn):
Background(con’t) Example of Bayesian network: Structure component Parameter component The Bayesian network above encodes the following probability relationship. P(F, B, L, D , H) =P(F) P(B)P(L |F) P(D |F,B)P(H|D)
Background (con’t) • How can BN be a classifier? • Firstly use BN to model the dataset • Then use the distribution encoded in BN to do classification
Background (con’t) • Why choose Bayesian network? • Bayesian network represents some inner relationship between the attributes • The joint probability based on BN can be written as a decomposable form
Background (con’t) • What is the problem of learning Bayesian network? • Given a training data set D={u1, u2 , u3 …uN } of instances of attributes U, find a network B that best matches D. • What’s the difficulty in learning BN? • Generally speaking, BN optimization problem is intractable. • Two Approaches • Either we constrain the searching in a certain restricted class of networks (Naïve BN, Semi-Naïve BN, CL-tree etc) Q1: Are these restricted class enough to represent the data? • Or we adopt some heuristic methods on general networks (K2 etc) • Q2: Are the heuristic methods on general network efficient ? Q3: Are the heuristic methods on general network redundant to represent the data?
Background (con’t) • Problems in two approaches • Q1: Are these restricted class enough to represent the data? • No, in many cases, they are really too limited in expression ability to model the data. • Q2: Are the heuristic methods on general network efficient ? • No, they have a big search space, which will be greatly time-consuming • Q3: is it possible that general networks obtained by the heuristic methods are redundant to represent the data? • Yes, sometimes, these methods favors more complex structure, which will really increase the risk of over-fitting problem.
Possible solutions • Upgrading Solution • How about we obtain a restricted BN firstly, then we aim at solving the shortcomings of this network caused by the restriction and upgrade it into a not so simple structure? • Bound Solution • Can we take some strategies to bound the complexity of networks, then we find the best structure in this bound. The final network can be controlled by a bound parameter.
Work1:Large node Chow-Liu tree • Upgrade Chow-Liu tree(CLT) into Large node Chow-Liu tree(LNCLT) • What is the restriction of CLT? • CLT restricts the network in a tree structure among the variables • Shortcomings caused by the restriction • CLT can not represent many dataset with a non-tree underlying structure . • Observations: • A “large node tree” may partially solve this shortcoming. • Example: • Right figure
Work1:Large node Chow-Liu tree • Large-node, which is a combination of several nodes, may partially relax the tree restriction. In forming a large node,There are two requirements. • Requirement 1 • Large-node must be really like a single node which means the nodes in a Large node are really more dependent on each other. • Requirement 2 • Large-node can not be too “large” or the probability estimation of this large node will not be not reliable • An extreme situation is that we combine all the nodes into a large node. This situation will lost all the advantages of Bayesian network.
Upgrade CLT into Large-Node-CLT • A bounded Frequent itemset can satisfy the Requirement 1 & 2 • What is Frequent itemset? • It is the set of attributes that come together with each other frequently. • Example: Food store---{bread}, {button}, {bread, button} • An frequent itemset with high frequency is more like a “large node”. ---Requirement 1 • We restrict that the the number of nodes involved in a large node is no greater than a K threshold ---Requirement 2 • Frequent itemset can be obtained according to the algorithm Apriori in [AS1994]
Upgrade CLT into Large-Node-CLT • The construction algorithm • Call Apriori[AS94] to generate the frequent itemsets, which have the size less than k. Record all the frequent itemsets together with their supports into list L. • Draft the CL-tree of the dataset according to the CLT algorithm • Until L is null • Iteratively combine the frequent itemsets which satisfy the combination conditions: father-son or sibling relationship 1.{A,C} does not satisfy the combination condition, filter out {A,C} 2.f{B,C} is the biggest and satisfies combination condition, combine them into (c) 3..Filter the frequent itemsets which have coverage with {B,C} , the {D,E} is left. 4..{D, E } is the frequent itemset and satisfies the combination condition, combine them into (d) Example: We assume the k is 2, after step 1, we get the frequent itemsets {A, B} {A, C},{B, C}, {B, E}, {B, D}, {D, E}. And f({B, C})>f({A, B})> f({B, E}) >f({B, D})>f({D, E}) (s represents the frequency of frequent itemsets). (b) is the CLT in step2.
Experiment • Database • The experiments are conducted on MNIST handwritten digit database. • MNIST consists of : • a 60000-digit training dataset • a 10000-digit testing dataset. • Both are 28*28 gray-level digit datasets
Experiment • Preprocessing of MNIST database • Binarization :We use a global binarization method to binarize MNIST datasets. • Feature Extraction[Bakis68]: 4*4*6=96 dimension feature
Experiment • We build 10 LNCLTs for 10 digits, we give out the classification result by selecting the LNCLT which has a maximum probability output. • We compare LNCLT with CLT in : • Data fitness---log likelihood • Recognition rate
Experimental results Data fitness---Log likelihood testing
Experimental results Recognition rate We randomly selected 1000 digits as test datasets from the 10000-digit testing dataset. We do the testing 10 times
Work2: Bound approach in semi-Naïve Bayesian network 1.A bounded Semi-Naïve Bayesian network(SNB). 2. We reduced the SNB into a network which has the same number K of nodes in every subset, K is the bound parameter. 3. We use Linear programming to do the optimization. 4. Our solution is shown sub-optimal
Comparison between our model and traditional SNB • Time cost • Our model can be solved in a polynomial time • Traditional SNB has an exponential time cost • Structure • Each large node in our model has the same number of nodes K, K is a bound parameter • The number of nodes in subsets of traditional SNB are not same and some values of this number may be very large. • Performance • Our model is shown to be a sub-optimal in the bound restriction • In traditional SNB , there is no evidence that show it is optimal or sub-optimal
Experimental results • We evaluate our approach on Tic and vote dataset from UCI machine learning repository
Future work • Evaluate our approaches based on a large number of datasets in Machine Learning repository from UCI • Build a Bayesian network which combine the upgrading strategy and bound strategy • In fact we are considering if we can upgrade our bounded-SNB into a mixture model of bounded-SNB.
Conclusion • A dilemma between simple structure and complex structure seems to exist in learning Bayesian network classifiers . • In this presentation, we test two approaches to deal with this problem. One is the Large-node Chow-Liu tree approach which is based on upgrading idea and the other is bounded semi-Naïve Bayesian network. • The experimental results show that these two approaches are promising and encouraging.
Main Reference • [AS1994] R. Agrawal, R. Srikant, 1994,“Fast algorithms for mining association rules”, Proc. VLDB-94 1994. • [Chow, Liu1968] Chow, C.K. and Liu, C.N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans. on Information Theory, 14,(pp462-467) • [Friedman1997] Friedman, N., Geiger, D. and Goldszmidt, M. (1997). Bayesian Network Classifiers. Machine Learning, 29,(pp.131-161) • [Kononenko1991] Kononenko, I. (1991). Semi-naive Bayesian classifier. In Y. Kodratoff (Ed.), Proceedings of sixth European working session on learning (pp.206-219). Springer-Verlag • [Maxwell1995] Learning Bayesian Networks is NP-Complete • [Pearl1988] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: networks of plausible inference, Morgan Kaufmann. • [Cheng1997] Cheng, J. Bell, D.A. Liu, W. 1997, Learning Belief Networks from Data: An Information Theory Based Approach. In Proceedings of ACM CIKM’97 • [Cheng2001] Cheng, J. and Greiner, R. 2001, • Learning Bayesian Belief Network Classifiers: Algorithms and System, E.Stroulia and S. Matwin(Eds.): AI 2001, LNAI 2056, (pp.141-151), • [Meretakis, Wuthrich1999] Meretakis, D. and Wuthrich, B. Extending Naive Bayes Classifiers using long Itemsets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, (pp. 165—174) • [Srebro2000] Artificial Intelligence Laboratory, Massachusetts Institute of Technology • Cambridge, Massachusetts 02139, http://www.ai.mit.edu
Q& A Thanks!
Work2: Bound strategy on Semi-Naïve BN • We restrict the semi-naïve network into not too complex structure. Large Node Bounded semi-Naïve BN Model Bounded-SNB MODEL DEFINITION
Reduce Bounded-SNB MODEL According to Lemma 1, given a bound K, we should not separate the variables set into too many small subsets. Or it is more possible that we can combine some of the subsets into a new subset whose cardinality is no greater than K, thus the new SNB will be coarser than the old one. From this viewpoint, we reduce the searching space of BLN-SNB into a K-regular SNB space since there are no possibility that a SNB coarser than K-regular SNB exists in the K-bound. Even though it is reasonable to search the maximum likelihood SNB in the K-regular-SNB space, we won't say that: a K-regular SNB is absolutely better than a non-K-regular SNB with the biggest cardinality no more than K . It is obvious some non-K-regular SNBs can not be combined into a K-regular SNB. Thus in such a way, we reduce the searching space into a sub-space of K-bound SNB.
Difference between our model & traditional SNB • Different approach • Traditional SNB employs independence testing to find the semi structure,which will cause an exponential computational cost. • Our approach employs the linear programming method to find the semi structure, which is polynomial in computational complexity. • Different performance • There are no evidence that shows traditional SNB can find an optimal or sub-optimal structure. • Our approach can maintain a sub-optimal structure.
K-Bounded-SNB Problem K-Bounded-SNB Problem: Finding the m= [n/K ] K-cardinality subsets from attributes set which satisfy the SNB conditions to maximize the log likelihood (3). [x] means rounding the x to the nearest integer
Transforming into Integer Programming Problem Model definition If we relax the (6) into 0x 1, IP is transformed into a Linear Programming problem which can be solved in a polynomial time.
Computational complexity analysis Traditional SNB time cost is exponential cost Our model is polynomial time cost