240 likes | 512 Views
Hierarchical Classification with the small set of features. Yongwook Yoon Apr 3, 2003 NLP Lab., POSTECH. Contents. Document Classification Bayesian Classifier Feature Selection Experiment & Results Candidate Baseline System Future Work. Hierarchical Classification.
E N D
Hierarchical Classification with the small set of features Yongwook Yoon Apr 3, 2003 NLP Lab., POSTECH
Contents • Document Classification • Bayesian Classifier • Feature Selection • Experiment & Results • Candidate Baseline System • Future Work
Hierarchical Classification • Massive number of documents are produced daily • From WWW or • Intranet environment as within an Enterprise • There exist some topic hierarchies for those documents • We need a classifier capable of classifying hierarchically according to some topics • History of Hierarchical collections • MEDLINE (medical literature maintained by NLM) • Patent collections of documents • Recently, Web search sites, Yahoo, Infoseek, Google
Simple vs. Hierarchical classification Business Root Grain Oil D1 D2 D3 D1 D2 Di Dj Dj+1 Dn Dn
Why Hierarchy? • Simplistic approach breaks down • Flattened class space with only one root • However, for a large corpus, • Hundreds of classes, and thousands of features • The computational cost is prohibitive • The resulting classifier is very large • Many thousandparameters leads to overfitting of the training data • Loses the intuition that topics that are close to each other have a lot more in common with each other • In Hierarchical classification, • Feature selection can be a useful tool in dealing with this issue
Bayesian Classifier • A Bayesian classifier is simply a Bayesian network applied to a classification domain • It contains a nodeC for the unobservable class variable • And a node Xi for each of the features • Given a specific instance x = (x1, x2, … ,xn), • It gives the class probability P(C=ck|X=x) for each possible class ck • Bayes Optimal classification • Select the class ck for which the probability P(ck|x) is maximized • ck = argmax P(ck|x) = argmax P(x|ck)P(ck) ck ck
Bayesian Network (1/2) • Naïve Bayesian Classifier • Very restrictive assumption: independency • P(X|C) = • Simple and unrealistic, but widely used up to now • More complex form: more expressive • Augmented some dependencies between features • The computation of inducing an optimal Bayesian classifier is NP-hard
Bayesian Network (2/2) • Two main solutions to this problem • Tree augmented network (TAN) • Restricts each node to have at most one additional parent • an optimal classifier can be found in quadratic time in number of features • KDB algorithm • Each node has at most k parents • Chooses as the parents of a node Xi • The k other features that Xi is most dependent on • Using a metric of class conditional mutual information, I(Xi;Xj|C) • Computational complexity • Network construction: quadratic in the total # of features • Parameter estimation (i.e., conditional probability table construction): exponential in k
Feature Selection (1/2) • We have a featurefor every word that appears in any document in the corpus • The computation would be prohibitive even if TAN or KDB algorithm is applied • So, Feature selection is imperative • But simple reduction of featureswithout combination with the hierarchical classifier don’t get high performance • Because the set of features required to classify the topics varies widely from one node to the other • Ex) (agriculture and computer) vs. (corn and wheat) • Adopt the method using Information Theoretic measures • It determines a subset of the original domain features that seem to best capturesthe class distribution in the data
Feature Selection (2/2) • Formally, the cross-entropy metric between two distribution μ and σ, is defined as , the “distance” between μ and σ • The algorithm • For each feature Xi, determines the expected cross-entropyδi = P(Xi) D(P(C|X),P(C|X-i)) • where X-i is the set of all domain features except Xi • Then, eliminates the features Xi for which δi is minimized • This process can be iterated to eliminate as many features as desired • To compute P(C|X), the algorithm uses the Naïve bayes model for speed and simplicity.
Experiment • The source corpus • Reuter-22173 dataset • Not have a pre-determined hierarchical classification scheme • But each document can have multiple labels • Goal • To construct two hierarchically classified document collections • Refinement to the corpus • Selects Two subsets from the collection named “Hier1” and “Hier2” • For each document, assign only onemajor topic and minor topic • They all would be grouped together at the top level named as “Current Business” • Next, each of these datasets was split 70%/30% into training and test sets
Experimental Methodology • Learning phase • Feature selection for the first tier of the hierarchy • Using just the major topics as classes • Next, build a probabilistic classifier with the reduced feature set • For each major topic, a separate round of probabilistic feature selection is employed • Finally, construct a separate classifier for each major topic on the appropriate reduced feature set • Testing phase • Test documents are classified through the first level classifier • then sending it down to the chosen second level classifier • In the flat classification scheme, • Do Feature selection, but induce only one classifier
Results - Baseline • Without employing any probabilistic feature selection • Very large number of features • never helps performing better than the simple flat method • Allows for the more expressive KDB algorithm to overfit the training data • These leads to the need for feature selection
Results – with feature selection (1/2) • Reduce the feature set in each node • From 1258 to 20, and then to 10 • Recall, however, that a potentially very different set of 10 or 20 features is selected at each node in the hierarchy • As a whole, actually examines a much larger set of features
Results – with feature selection (2/2) • Overall improvement in accuracy over the baseline results • And the hierarchical method over the flat method • Only one exception in the case of (Hire2, KDB-1) • This classifier trained on only 24 instances: a quite possible statistical anomaly • In the Hire1 dataset, which has more instances for induction do not encounter such problem
Results – analysis cont’d • In the case of (Flat, #features=50), • The accuracy is very close to the “Hier” cases such as TAN and KDB • But, the complexity of algorithms for the task of learning classifier is not comparable • Quadratic in the # of features (102 vs. 502) • Conclusion • The feature selection at the same time applied with the hierarchical classification yields far better performance than in the simple classification scheme • Some simple Bayesian networks such as TAN or KDB can be combined well with the hierarchical classification scheme
Candidate System • Requirements • Hierarchical classification • Feedback of classification result – online adaptive learning • Support for various types of data – heterogeneous data sources • Experiment environment • Modeling after a real implemented system although running in a different hardware • Share the same specification about the system structure and functions • Training and test dataset from the above real system
Enterprise Information Systems - KT • KMS (Knowledge Management System) • 사내에서 생산되는 다양한 지식을 체계적으로 관리하고 그 활용도를 높여서 기업활동에 이바지 • 각자가 생산한 정보(지식)을 시스템에 등록하고, 또한 업무에 필요한 정보를 검색, 추출하여서 사용 • 지식후보: 문서, 회의자료, 업무지식, 제안, 문헌 등 • 관리체계: 569개 인사직무체계로 지식 map 구성 • 종합문서시스템 (Groupware) • 사내에서 유통되는 문서, e-mail, message, 부서/개인정보 등을 체계적으로 관리 • 문서의 경우 기안, 결재, 생산, 전달, 보관 및 검색까지 모든 기능을 하나의 통일된 시스템으로 처리 • Microsoft사와 전략적 제휴로 공동 개발 – 웹 인터페이스 • 그 외 부서별 질의/응답게시판, 사원간 메신저 기능
Future Work • Target Baseline system • Basic Hierarchy Classification with some real data • Research Issues (+α) • Hierarchy • Utilize other implemented systems (BOW toolkit) • Online Learning • Efficient and appropriate algorithm on adaptive-learning • Bayesian online perceptron, Gaussian online process • Automatic expansion and extinction of the lowest level of Subtopics over time dimension • Pre-processing of law corpus • Integration of the heterogeneous data types • Text, tables, images, e-mails, specially-formatted texts, etc.