Hierarchical Classification with the small set of features

Hierarchical Classification with the small set of features Yongwook Yoon Apr 3, 2003 NLP Lab., POSTECH

Contents • Document Classification • Bayesian Classifier • Feature Selection • Experiment & Results • Candidate Baseline System • Future Work

Hierarchical Classification • Massive number of documents are produced daily • From WWW or • Intranet environment as within an Enterprise • There exist some topic hierarchies for those documents • We need a classifier capable of classifying hierarchically according to some topics • History of Hierarchical collections • MEDLINE (medical literature maintained by NLM) • Patent collections of documents • Recently, Web search sites, Yahoo, Infoseek, Google

Simple vs. Hierarchical classification Business Root Grain Oil D1 D2 D3 D1 D2 Di Dj Dj+1 Dn Dn

Why Hierarchy? • Simplistic approach breaks down • Flattened class space with only one root • However, for a large corpus, • Hundreds of classes, and thousands of features • The computational cost is prohibitive • The resulting classifier is very large • Many thousandparameters leads to overfitting of the training data • Loses the intuition that topics that are close to each other have a lot more in common with each other • In Hierarchical classification, • Feature selection can be a useful tool in dealing with this issue

Bayesian Classifier • A Bayesian classifier is simply a Bayesian network applied to a classification domain • It contains a nodeC for the unobservable class variable • And a node Xi for each of the features • Given a specific instance x = (x1, x2, … ,xn), • It gives the class probability P(C=ck|X=x) for each possible class ck • Bayes Optimal classification • Select the class ck for which the probability P(ck|x) is maximized • ck = argmax P(ck|x) = argmax P(x|ck)P(ck) ck ck

Bayesian Network (1/2) • Naïve Bayesian Classifier • Very restrictive assumption: independency • P(X|C) = • Simple and unrealistic, but widely used up to now • More complex form: more expressive • Augmented some dependencies between features • The computation of inducing an optimal Bayesian classifier is NP-hard

Bayesian Network (2/2) • Two main solutions to this problem • Tree augmented network (TAN) • Restricts each node to have at most one additional parent • an optimal classifier can be found in quadratic time in number of features • KDB algorithm • Each node has at most k parents • Chooses as the parents of a node Xi • The k other features that Xi is most dependent on • Using a metric of class conditional mutual information, I(Xi;Xj|C) • Computational complexity • Network construction: quadratic in the total # of features • Parameter estimation (i.e., conditional probability table construction): exponential in k

Feature Selection (1/2) • We have a featurefor every word that appears in any document in the corpus • The computation would be prohibitive even if TAN or KDB algorithm is applied • So, Feature selection is imperative • But simple reduction of featureswithout combination with the hierarchical classifier don’t get high performance • Because the set of features required to classify the topics varies widely from one node to the other • Ex) (agriculture and computer) vs. (corn and wheat) • Adopt the method using Information Theoretic measures • It determines a subset of the original domain features that seem to best capturesthe class distribution in the data

Feature Selection (2/2) • Formally, the cross-entropy metric between two distribution μ and σ, is defined as , the “distance” between μ and σ • The algorithm • For each feature Xi, determines the expected cross-entropyδi = P(Xi) D(P(C|X),P(C|X-i)) • where X-i is the set of all domain features except Xi • Then, eliminates the features Xi for which δi is minimized • This process can be iterated to eliminate as many features as desired • To compute P(C|X), the algorithm uses the Naïve bayes model for speed and simplicity.

Experiment • The source corpus • Reuter-22173 dataset • Not have a pre-determined hierarchical classification scheme • But each document can have multiple labels • Goal • To construct two hierarchically classified document collections • Refinement to the corpus • Selects Two subsets from the collection named “Hier1” and “Hier2” • For each document, assign only onemajor topic and minor topic • They all would be grouped together at the top level named as “Current Business” • Next, each of these datasets was split 70%/30% into training and test sets

dataset

Experimental Methodology • Learning phase • Feature selection for the first tier of the hierarchy • Using just the major topics as classes • Next, build a probabilistic classifier with the reduced feature set • For each major topic, a separate round of probabilistic feature selection is employed • Finally, construct a separate classifier for each major topic on the appropriate reduced feature set • Testing phase • Test documents are classified through the first level classifier • then sending it down to the chosen second level classifier • In the flat classification scheme, • Do Feature selection, but induce only one classifier

Results - Baseline • Without employing any probabilistic feature selection • Very large number of features • never helps performing better than the simple flat method • Allows for the more expressive KDB algorithm to overfit the training data • These leads to the need for feature selection

Results – with feature selection (1/2) • Reduce the feature set in each node • From 1258 to 20, and then to 10 • Recall, however, that a potentially very different set of 10 or 20 features is selected at each node in the hierarchy • As a whole, actually examines a much larger set of features

Results – with feature selection (2/2) • Overall improvement in accuracy over the baseline results • And the hierarchical method over the flat method • Only one exception in the case of (Hire2, KDB-1) • This classifier trained on only 24 instances: a quite possible statistical anomaly • In the Hire1 dataset, which has more instances for induction do not encounter such problem

Results – analysis cont’d • In the case of (Flat, #features=50), • The accuracy is very close to the “Hier” cases such as TAN and KDB • But, the complexity of algorithms for the task of learning classifier is not comparable • Quadratic in the # of features (102 vs. 502) • Conclusion • The feature selection at the same time applied with the hierarchical classification yields far better performance than in the simple classification scheme • Some simple Bayesian networks such as TAN or KDB can be combined well with the hierarchical classification scheme

Candidate System • Requirements • Hierarchical classification • Feedback of classification result – online adaptive learning • Support for various types of data – heterogeneous data sources • Experiment environment • Modeling after a real implemented system although running in a different hardware • Share the same specification about the system structure and functions • Training and test dataset from the above real system

Enterprise Information Systems - KT • KMS (Knowledge Management System) • 사내에서 생산되는 다양한 지식을 체계적으로 관리하고 그 활용도를 높여서 기업활동에 이바지 • 각자가 생산한 정보(지식)을 시스템에 등록하고, 또한 업무에 필요한 정보를 검색, 추출하여서 사용 • 지식후보: 문서, 회의자료, 업무지식, 제안, 문헌 등 • 관리체계: 569개 인사직무체계로 지식 map 구성 • 종합문서시스템 (Groupware) • 사내에서 유통되는 문서, e-mail, message, 부서/개인정보 등을 체계적으로 관리 • 문서의 경우 기안, 결재, 생산, 전달, 보관 및 검색까지 모든 기능을 하나의 통일된 시스템으로 처리 • Microsoft사와 전략적 제휴로 공동 개발 – 웹 인터페이스 • 그 외 부서별 질의/응답게시판, 사원간 메신저 기능

Future Work • Target Baseline system • Basic Hierarchy Classification with some real data • Research Issues (+α) • Hierarchy • Utilize other implemented systems (BOW toolkit) • Online Learning • Efficient and appropriate algorithm on adaptive-learning • Bayesian online perceptron, Gaussian online process • Automatic expansion and extinction of the lowest level of Subtopics over time dimension • Pre-processing of law corpus • Integration of the heterogeneous data types • Text, tables, images, e-mails, specially-formatted texts, etc.

The End

Hierarchical Classification with the small set of features

Hierarchical Classification with the small set of features

Presentation Transcript

The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features

AML-CLINICAL FEATURES,CLASSIFICATION,TREATMENT

Packet Classification using Hierarchical Intelligent Cuttings

Object Recognition with Informative Features and Linear Classification

Literary Style Classification with Deep Linguistic Features

Small Groups—poetry classification

Bitslicing using Small-scale Hierarchical Floorplanning

13.1 A Small Set of Instructions

Packet Classification using Hierarchical Intelligent Cuttings

Hierarchical Shape Classification Using Bayesian Aggregation

Decision trees for hierarchical multilabel classification

Hierarchical Features of Large-scale Cortical connectivity

Learning CRFs with Hierarchical Features: An Application to Go

Hierarchical Classification: Comparison with Flat Method

Features of Yankauer Suction Set

Features of small business lenders

Max-Margin Classification of Data with Absent Features

Clinical features and Classification of Diabetic Retinopathy

Object Recognition with Informative Features and Linear Classification

Hierarchical Structure of the Classification of Organisms

Hierarchical Shape Classification Using Bayesian Aggregation

Multiple hierarchical classification of free-text clinical guidelines.