320 likes | 466 Views
Incremental Context Mining for Adaptive Document Classification. Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Rey-Long Liu Yun-Ling Lu. Outline. Motivation Objective Introduction Overview of the approach Incremental context mining for ACclassifier Experiments
E N D
Incremental Context Mining for Adaptive Document Classification Advisor:Dr. Hsu Graduate:Chien-Shing Chen Author:Rey-Long Liu Yun-Ling Lu
Outline • Motivation • Objective • Introduction • Overview of the approach • Incremental context mining for ACclassifier • Experiments • Conclusions • Personal Opinion • Review
Motivation • Adaptive document classification (ADC) that adapts a DC system to the evolving contextual requirement of each document category, so that input documents may be classified based on their contexts of discussion.
Objective • 1.CR terms should be mined by analyzing multiple documents from multiple categories. • 2.Inappropriate feature may introduce the problems of inefficiency and errors. • 3.ADC may serve as the basis for supporting efficient and high-precision DC.
1.Introduction Two components of ACclassifier (Adaptive Context-based Classifier). 1. An incremental context miner 2. Document classifier. Both components work on a given text hierarchy in which a node corresponds to a document category.
3-3.CR CR : Contextual Requirement of the category
3-4. TFIDF Strength: w serving as a context word for the documents under c TFIDF (Term Frequency * Inverse Document Frequency)
3-5. TFIDF Strength(Wcomputer,CMIS)= Strength(Wdos,CMIS)=
3-6. The incremental context miner 電機 S(computer)>0.909 S(computer)=0.909 S(dos)=2 S(EC)=0.476 S(computer)=0.022
4-1. DOA Given a document d to be classified, the basic idea is to compute the degree of acceptance (DOA). The DOA is computed based on the strengths of d ’s distinct words on c.
4-2. Two phases of classifier • The estimation of DOA for each category. • The identification of the winner category.
4-4. DOA If w is a strong context word in c and occurs many times in d, c is more likely to “accept” d. Frequency:5 D1 : 5000 minSupport:0.001
4-5. Constraint I New Di Computer 20/40 DOS 10/40 Java 2/40 Mouse 3/40 Delphi 1/40
4-7. Given a document to be classified If w is a strong context word in c and occurs many times in d, c is more likely to “accept” d. New Di Computer 20/40 DOS 10/40 MIS DSS S(computer)=0.909 S(dos)=2 S(EC)=0.476 S(computer)=0.022
4-8. DOA DOAMIS=0.909 * 20/40 = 0.4545 DOAMIS=2 * 10/40 = 0.5 DOAMIS of Dnew DOAMIS=0.9545
5-1. correct classification • Builting from the 1100 documents for initial training.
5-2. correct classification • Baseline :allowed to use 5000 features in their feature set.
5-3. correct classification • Using all training documents to build their feature set and classifiers.
5-4. Consider the test document entitled • “Setting up Email in DOS with today’s ISP using a dialup PPP TCP/IP connection”. • Baseline systems: “Software”,””Windows”,and “Operating Systems” • ACclassifier:”TCP/IP”,”connection”,”computernetworking”,”userID”
5-5. cumulative training & testing time(sec.) • The time spent by ACclassifier grew slower when about 1400 training documents were entered.
5-6. cumulative training & testing time(sec.) • The time spent by ACclassifier grew slower when about 1400 training documents were entered.
6. Conclusions 1.Efficient mining of the contextual requirements for high-precision DC. 2.Incremental mining without reprocessing previous documents. 3.Evolutionary maintenance of the feature set. 4.Efficient and fault-tolerant hierarchical DC.
7.Personal Opinion It’s acceptable on purity in hierarchy.