Mining for Context Recognition in Document Filtering and Classification

Mining for Context Recognition in Document Filtering and Classification Rey-Long Liu Dept. of Information Management Chung Hua University HsinChu, Taiwan, R.O.C.

Problem Definition • Hierarchical document filtering & document classification (DF & DC) • Goal • Putting suitable information into suitable categories, which are organized hierarchically • Motivation • Information management, dissemination, & sharing

Problem Definition (Cont.) • Main challenge • Recognition of context of discussion (COD) • For example, a document mentioning “911” may be from many categories in the text hierarchy of Google, including recreation! • Deriving COD thresholds for making DF & DC decisions

Problem Definition (Cont.) • Main contributions • ICenter, which integrates DF and DC by COD recognition • Mining the profile of each category • Tuning a COD threshold for each category • DF & DC • Through COD recognition, higher-quality DF & DC may be achieved

Outline • Overview of ICenter • The profile miner • The COD threshold tuner • The filtering classifier • Empirical evaluation • Conclusion

Training Profile Mining Training Documents Threshold Tuning Category Profiles Category Thresholds Incoming Documents Filtering & Classification Testing Classified Documents Filtered Documents ICenter • ICenter: an information center for a user community

The Profile Miner • Procedure:ProfileMining(c), where c is a category in the text hierarchy. • Effect: Build the profile Pxof each descendant category x of c. • Begin • (1) For each child category x of c, do • (1.1) Px = ; • (1.2) W = {w | w is a word in documents under x, and w is not a stop word}; • (1.3) For each word w in W, do • (1.3.1)sw,x = P(w|x); • (1.3.2)gw,x = P(w|x)  (Bx/iP(w|xi)), where Bx = 1 + number of siblings of x, and xi is the ith child of c, 1i Bx; • (1.3.3) Px = Px {<w, sw,x, gw,x>}; • (1.4) If x is not a leaf category, recursively invoke ProfileMining(x) to build the profile of each descendant category of x; • End.

Manufacturing Systems Development Product, factory, …(O) System, Computer, Analysis, …(O) Transaction Processing Systems … Accounting, Sales … (O) System, Computer, … (X) … … … … Decision Support Systems Decision Support Systems Decision, simulation, … (O) … Decision, simulation,… (O) System, Computer, … (X) • Measuring how representative and discriminative a term w is in a category x: • sw,x= Support(w,x)(=P(w|x)) • gw,x = Support(w,x) / Avg Support(w,xi),where xi is in {x} U {siblings of x}

The COD Threshold Tuner • Procedure: CODThresholdTuning(x), where x is a leaf category. • Effect: (1) For each ancestor a of x, tune a COD threshold ha,x, and • (2) Tune a COD threshold hx,x for x. • Begin • (1) P = {p | p is a document belonging to x}; • (2) For each ancestor category a of x, do • (2.1)UB = Min{DOAp,a}, where p P, and DOAp,a is the DOA value of p with respect to a (DOAp,a =  sw,angw,arw,atsw,p); • (2.2) ha,x = Max{DOAn,a}, where n is a document not belonging to a, and DOAn,aUB; • (3) Q = {q | q is a document not belonging to x}; • (4) For each q in Q, do • (4.1) For each ancestor a of x • (4.1.1) If DOAq,aha,x, Q = Q – {q}; • (5)hx,x = DOAp,x, which maximizes the system’s performance on P and Q (p P); • End.

Manufacturing Systems Development (SA) The threshold allows all relevant documents to pass (but may filter out many non-relevant documents) … Transaction Processing Systems … … … … Decision Support Systems Decision Support Systems (DSS) … Only those non-relevant documents that pass the test of SA are considered to tune an optimum threshold

The Filtering Classifier • Procedure:DF&DC(d), where d is a document. • Return: A set S of categories to which d is classified (d may be classified into c only when it passes all tests of c and ancestors of c) • Begin • (1) Invoke DOAEstimation(d) to estimate DOAd,c, for each category c; • (2) S = ; • (3) For each leaf category x, do • (3.1) IsAccepted = true; • (3.2) For each ancestor a of x, do • (3.2.1) If DOAd,aha,x, • (3.2.1.1) IsAccepted = false; • (3.2.1.2) Exit the for-loop; • (3.3) If IsAccepted = true, • (3.3.1) If DOAd,x < hx,x, • (3.3.1.1) IsAccepted = false; • (3.4) If IsAccepted = true, • (3.4.1) S = S  {x}; • (5) Return S; • End.

Empirical Evaluation • Data • Source: the text hierarchy of Yahoo! • There were 507 categories (under 5 first-level categories) among which 211 were leaves (maximum height = 8) • There were 3612 documents in the leaves • Data splitting • 90% of the leaves served as “in-space” data (for DC) • 10% of the leaves served as “out-space” data (for DF)

Empirical Evaluation (Cont.) • Validation • 5-fold cross validation (i.e. 80% for training, and 20% for testing • Evaluation criteria • For DC • Precision • Recall • F1= 2PR / (P+R) • For DF • Percentage of out-space documents successfully filtered (FR) • Average # of misclassifications for misclassified out-space documents (AM)

Empirical Evaluation (Cont.) • System evaluated • ICenter • Baseline: The Rocchio’s classifier with thresholding (RO+T) • 2 (chi-square) technique for feature selection

Empirical Evaluation (Cont.) • Result • When compared with the baseline using a feature set of size 40000, ICenter contributed 6.2% improvement on FR and 18% reduction of AM

Conclusion • Main contribution • Exploring how and to what extent COD recognition may contribute to integrated DF and DC • The developed technique ICenter is both • More manageable (no need to tune feature sets), and • More competent (able to achieve better performances in both DC and DF)

Mining for Context Recognition in Document Filtering and Classification

Mining for Context Recognition in Document Filtering and Classification

Presentation Transcript

Data Mining: Classification

Document Analysis and Recognition

Document Classification Comparison

CONTEXT DEPENDENT CLASSIFICATION

Classification and Internet technical filtering

Document Classification

Instance Filtering for Entity Recognition

Tools and Libraries for Document Analysis and Recognition

Incremental Context Mining for Adaptive Document Classification

Data Mining Classification:

Income Recognition and Asset Classification

Bayesian Online Classifiers for Text Classification and Filtering

Document classification

Automatic Classification Document and Filing

Packet Classification and Filtering for Network Processors

Data Mining: Classification

Naive Bayes for Document Classification

Packet Classification and Filtering for Network Processors

CONTEXT DEPENDENT CLASSIFICATION

Instance Filtering for Entity Recognition