270 likes | 474 Views
Recent Trends in Text Mining. Girish Keswani gkeswani@micron.com. Text Mining?. What? Data Mining on Text Data Why? Information Retrieval Confusion Set Disambiguation Topic Distillation How? Data Mining. Organization. Text Mining Algorithms Jargon Used Background
E N D
Recent Trends in Text Mining Girish Keswani gkeswani@micron.com
Text Mining? • What? • Data Mining on Text Data • Why? • Information Retrieval • Confusion Set Disambiguation • Topic Distillation • How? • Data Mining
Organization • Text Mining Algorithms • Jargon Used • Background • Data Modeling, • Text Classification, and • Text Clustering • Applications • Experiments {NBC, NN and ssFCM} • Further work • References
Text Mining Algorithms • Classification Algorithms • Naïve Bayes Classifier • Decision Trees • Neural Networks • Clustering Algorithms • EM Algorithms • Fuzzy
Jargon • DM: Data Mining • IR: Information Retrieval • NBC: Naïve Bayes Classifier • EM: Expectation Maximization • NN: Neural Networks • ssFCM: Semi-Supervised Fuzzy C-Means • Labeled Data (Training Data) • Unlabeled Data • Test Data
Background: Modeling • Vector Space Model
Background: Modeling • Generative Models of Data [13] : Probabilistic “to generate a document, a class is first selected based on its prior probability and then a document is generated using the parameters of the chosen class distribution” • NBC and EM Algorithms are based on this model
Importance of Unlabeled Data? Provides access to feature distribution in set F using joint probability distributions D A B Labeled Data Unlabeled Data Test Data G F E C
Experimental Results [1] Using NBC, EM and ssFCM
Experimental Results [2] Using NBC and EM
Extensions and Variants of these approaches • Authors in [6] propose a concept of Class Distribution Constraint matrix • Results on Confusion Set Disambiguation • Automatic Title Generation [7]: • Using EM Algorithm • Non-extractive approach
Relational Data [9] • A collection of data with relations between entities explained is known as relational data • Probabilistic Relational Models
IBM Text Analyzer [11] Decision Tree Based SAS Text Miner[12] Singular Value Decomposition Filtering Junk Email Hotmail, Yahoo Advanced Search Engines Commercial Use/Products
Experiments • NBC • Naïve Bayes Classifier • Probabilistic • NN • Neural Networks • ssFCM • Semi-Supervised Fuzzy Clustering • Fuzzy
Datasets (20 Newsgroups Data) • Sampling I: • Sampling II: Sampling I Vectors Data Raw Sampling II Vectors
NBC Sample25 Sample30
Further Work • Ensemble of Classifiers [16]
Further Work • Knowledge Gathering from Experts • E.g. 3 class Data: Input Data {C1,C2,C3} C1 C3 C2 Test Data ? Classifier
References [1] “Text Classification using Semi-Supervised Fuzzy Clustering,” Girish Keswani and L.O.Hall, appeared in IEEE WCCI 2002 conference. [2] “Using Unlabeled Data to Improve Text Classification,” Kamal Paul Nigam. [3] “Text Classification from Labeled and Unlabeled Documents using EM,” Kamal Paul Nigam et al. [4] “The Value of Unlabeled Data for Classification Problems,” Tong Zhang. [5] “Learning from Partially Labeled Data,” Martin Szummer et al. [6] “Training a Naïve Bayes Classifier via the EM Algorithm with a Class Distribution Constraint,” Yoshimasa Tsuruoka and Jun’ichi Tsujii. [7] “Automatic Title Generation using EM,” Paul E. Kennedy and Alexander G. Hauptmann. [8] “Unlabeled Data can degrade Classification Performance of Generative Classifiers,” Fabio G. Cozman and Ira Cohen. [9] “Probabilistic Classification and Clustering in Relational Data,” Ben Taskar et al. [10] “Using Clustering to Boost Text Classification,” Y.C. Fang et al. [11] IBM Text Analyzer: “A decision-tree-based symbolic rule induction system for text categorization,” D.E. Johnson et al. [12] “SAS Text Miner,” Reincke [13] “Pattern Recognition,” Duda and Hart 2000 [14] “Machine Learning,” Tom Mitchell [15] “Data Mining,” Margaret Dunham [16] http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume11/opitz99a-html/