160 likes | 271 Views
Mining for Context Recognition in Document Filtering and Classification. Rey-Long Liu Dept. of Information Management Chung Hua University HsinChu, Taiwan, R.O.C. Problem Definition. Hierarchical document filtering & document classification (DF & DC) Goal
E N D
Mining for Context Recognition in Document Filtering and Classification Rey-Long Liu Dept. of Information Management Chung Hua University HsinChu, Taiwan, R.O.C.
Problem Definition • Hierarchical document filtering & document classification (DF & DC) • Goal • Putting suitable information into suitable categories, which are organized hierarchically • Motivation • Information management, dissemination, & sharing
Problem Definition (Cont.) • Main challenge • Recognition of context of discussion (COD) • For example, a document mentioning “911” may be from many categories in the text hierarchy of Google, including recreation! • Deriving COD thresholds for making DF & DC decisions
Problem Definition (Cont.) • Main contributions • ICenter, which integrates DF and DC by COD recognition • Mining the profile of each category • Tuning a COD threshold for each category • DF & DC • Through COD recognition, higher-quality DF & DC may be achieved
Outline • Overview of ICenter • The profile miner • The COD threshold tuner • The filtering classifier • Empirical evaluation • Conclusion
Training Profile Mining Training Documents Threshold Tuning Category Profiles Category Thresholds Incoming Documents Filtering & Classification Testing Classified Documents Filtered Documents ICenter • ICenter: an information center for a user community
The Profile Miner • Procedure:ProfileMining(c), where c is a category in the text hierarchy. • Effect: Build the profile Pxof each descendant category x of c. • Begin • (1) For each child category x of c, do • (1.1) Px = ; • (1.2) W = {w | w is a word in documents under x, and w is not a stop word}; • (1.3) For each word w in W, do • (1.3.1)sw,x = P(w|x); • (1.3.2)gw,x = P(w|x) (Bx/iP(w|xi)), where Bx = 1 + number of siblings of x, and xi is the ith child of c, 1i Bx; • (1.3.3) Px = Px {<w, sw,x, gw,x>}; • (1.4) If x is not a leaf category, recursively invoke ProfileMining(x) to build the profile of each descendant category of x; • End.
Manufacturing Systems Development Product, factory, …(O) System, Computer, Analysis, …(O) Transaction Processing Systems … Accounting, Sales … (O) System, Computer, … (X) … … … … Decision Support Systems Decision Support Systems Decision, simulation, … (O) … Decision, simulation,… (O) System, Computer, … (X) • Measuring how representative and discriminative a term w is in a category x: • sw,x= Support(w,x)(=P(w|x)) • gw,x = Support(w,x) / Avg Support(w,xi),where xi is in {x} U {siblings of x}
The COD Threshold Tuner • Procedure: CODThresholdTuning(x), where x is a leaf category. • Effect: (1) For each ancestor a of x, tune a COD threshold ha,x, and • (2) Tune a COD threshold hx,x for x. • Begin • (1) P = {p | p is a document belonging to x}; • (2) For each ancestor category a of x, do • (2.1)UB = Min{DOAp,a}, where p P, and DOAp,a is the DOA value of p with respect to a (DOAp,a = sw,angw,arw,atsw,p); • (2.2) ha,x = Max{DOAn,a}, where n is a document not belonging to a, and DOAn,aUB; • (3) Q = {q | q is a document not belonging to x}; • (4) For each q in Q, do • (4.1) For each ancestor a of x • (4.1.1) If DOAq,aha,x, Q = Q – {q}; • (5)hx,x = DOAp,x, which maximizes the system’s performance on P and Q (p P); • End.
Manufacturing Systems Development (SA) The threshold allows all relevant documents to pass (but may filter out many non-relevant documents) … Transaction Processing Systems … … … … Decision Support Systems Decision Support Systems (DSS) … Only those non-relevant documents that pass the test of SA are considered to tune an optimum threshold
The Filtering Classifier • Procedure:DF&DC(d), where d is a document. • Return: A set S of categories to which d is classified (d may be classified into c only when it passes all tests of c and ancestors of c) • Begin • (1) Invoke DOAEstimation(d) to estimate DOAd,c, for each category c; • (2) S = ; • (3) For each leaf category x, do • (3.1) IsAccepted = true; • (3.2) For each ancestor a of x, do • (3.2.1) If DOAd,aha,x, • (3.2.1.1) IsAccepted = false; • (3.2.1.2) Exit the for-loop; • (3.3) If IsAccepted = true, • (3.3.1) If DOAd,x < hx,x, • (3.3.1.1) IsAccepted = false; • (3.4) If IsAccepted = true, • (3.4.1) S = S {x}; • (5) Return S; • End.
Empirical Evaluation • Data • Source: the text hierarchy of Yahoo! • There were 507 categories (under 5 first-level categories) among which 211 were leaves (maximum height = 8) • There were 3612 documents in the leaves • Data splitting • 90% of the leaves served as “in-space” data (for DC) • 10% of the leaves served as “out-space” data (for DF)
Empirical Evaluation (Cont.) • Validation • 5-fold cross validation (i.e. 80% for training, and 20% for testing • Evaluation criteria • For DC • Precision • Recall • F1= 2PR / (P+R) • For DF • Percentage of out-space documents successfully filtered (FR) • Average # of misclassifications for misclassified out-space documents (AM)
Empirical Evaluation (Cont.) • System evaluated • ICenter • Baseline: The Rocchio’s classifier with thresholding (RO+T) • 2 (chi-square) technique for feature selection
Empirical Evaluation (Cont.) • Result • When compared with the baseline using a feature set of size 40000, ICenter contributed 6.2% improvement on FR and 18% reduction of AM
Conclusion • Main contribution • Exploring how and to what extent COD recognition may contribute to integrated DF and DC • The developed technique ICenter is both • More manageable (no need to tune feature sets), and • More competent (able to achieve better performances in both DC and DF)