580 likes | 594 Views
Explore text categorization algorithms for classifying programming discussions. Learn data mining, classification tasks, and prediction methods for efficient categorization. Discover how to use the MALLET toolkit and Bayes theorem to train classifiers and improve accuracy in classifying text data.
E N D
A Study of Text Categorization Classifying Programming Newsgroup Discussions using Text Categorization Algorithms by Lingfeng Mo
What is text categorization? • Definition • Classification of documents into a fixed number of predefined categories. • Sometimes alternately referred to as text data mining.
What is Data Mining? • Many Definitions • Non-trivial extraction of implicit, previously unknown and potentially useful information from data • Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
Data Mining Tasks... • Classification [Predictive] • Clustering [Descriptive] • Association Rule Discovery [Descriptive] • Sequential Pattern Discovery [Descriptive] • Regression [Predictive] • Deviation Detection [Predictive]
Data Mining Tasks • Prediction Methods • Use some variables to predict unknown or future values of other variables. • Description Methods • Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Classification: Definition • Given a collection of records (training set ) • Each record contains a set of attributes, one of the attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. • A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Test Set Model Classification Example 12 German documents Randomly choose certain portion Learn Classifier Training Set 12 English documents
Why our study? • Programmer often seek and exchange information about problems on using a certain library, framework, or API online. • Titles not corresponding to the content in newsgroup discussion. • Novice doesn’t know how to ask a question exactly. • By categorizing an ongoing discussion, such techniques could be used to directly point out previous discussions of similar problems to the developers who ask questions.
What’s this study for? • Ideal Goal: Automatically classifying discussions into meaningful semantic categories. • Approach: • Collect and save raw data • Import and optimize data • Select certain portion of data to train a classifier model. • Classify data • Evaluate results
Tool we use • MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. --a machine learning for language toolkit. • Via: http://mallet.cs.umass.edu/
Data Collecting • Download discussions from Java programming forum • Save each discussion as a text document(.txt) – Article(text, name) • Manually put similar discussions into the same folder (labels)
Import Data • Input: Labels included with their articles • How it works? • Output: Mallet document } Char Sequence Token Sequence Feature Vectors Data Name of Drive Name of Folder Name of Article + + Name/ Source Instance Name of Article Label Target
Example of Import Data • Import-file • Import-dir
Training Classifier • Input the produced data by importing process • Set training portion for the training set and test set • K-Fold Cross-Validation – 10 trials usually • Set the trainer - NaiveBayesTrainer
Basic Bayes theorem • A probabilistic framework for solving classification problems • Conditional Probability: • Bayes theorem:
Example of Bayes Theorem • Given: • A doctor knows that cold causes cough 50% of the time • Prior probability of any patient having cold is 1/50,000 • Prior probability of any patient having cough is 1/20 • If a patient has cough, what’s the probability he/she has cold?
Output of Classification • Confusion matrix • Test data accuracy for every trials • Train data accuracy mean • Standard Deviation • Standard Error • Test data accuracy mean • Standard Deviation • Standard Error
How to improve the accuracy? • Increase recognition rate – Words Splitting • Unify words’ tense - Words Stemming • Get rid of noisy data – Remove StopWords • About overlapped categories - Top N Method
Words stemming • Change Verb’s Tense back to original • Ex. Performed -> perform
Words Splitting (1 of 3) • In what case we could split a word? • Punctuation • Blank • Under Line
Words Splitting (2 of 3) • Examples: • Ex. Set_Value -> Set Value; • ImageIcon("myIcon.gif")); -> ImageIcon myIcon gif; • actionPerformed(ActionEvent e) -> actionPerformed Action Event e • See any problems? • There are some cases that people like to write many words or words with numbers together. • Ex. JButton, actionListener, Button1,2,3 and etc.
Words Splitting (3 of 3) • What the special cases are? • 1. Begin with any number of capital letters combined with ONE or couple words. Ex. JFrame -> J Frame; JJJJJJJButton -> JJJJJJJ Button; JButtonApple -> J Button Apple • 2. lower case letter/letters or lower case word combined with a word begin with capital letter Ex. cButton -> c Button; ccccccccButton -> cccccccc Button; setValue -> set Value; addActionListener -> add Action Listener; • 3. Many words ALL begin with capital letter combined togeter. Ex. MyFrame-> My Frame; SetActionCommand -> Set Action Command • 4. Combined with word and numbers EX. Button1 -> Button 1; 1Button -> 1 Button Button123 -> Button 123; 123Button -> 123 Button;
Remove Stopwords (1 Of 2) • What is stop words? • The most common, short function words, such as the, is, at, which, and, on. • Any special cases?
Remove Stopwords (2 Of 2) • Extra Stop Words • Programming words. Ex. public, private, class, new and etc. • Words Frequency Counter helps.
Overlapped Categories • Each category is treated as independent label by default. • How to solve realistic problems? – Top N Method
Top N Method Regular Way Top N Method
Some of our test results • Test base on 10 different labels and 45 instances in total. • Let’s see some pictures help us directly perceived through the senses
Classify data with Original Mallet Lowest: 19% Highest: 32% Average: 25.9%
After Stemming & Words Splitting Lowest: 26% Highest: 44% Average: 35%
After remove Stop Words Lowest: 36% Highest: 62% Average: 45%
Top N Method been used Lowest: 54% Highest: 72 % Average: 63.6%
Any way to improve accuracy again? • Highlight the key feature. • Use only code data as training data.
Only code Data Delete all the text other than code. What is considered as code? Code includes not only a snippet of code more than one line, but also a class name, such as JButton and JActionListener, or a method call, such as addActionListener(aListener).
Test Result of Code only Data Lowest: 54% Highest: 72 % Average: 63.6% Lowest: 24% Highest: 42% Average: 34%
Why this happened? • Only code data is not enough. • Can not remove too much data, especially those actually contributed to feature selection. • Is our data size not big enough?
Increase the Data Scale • What we done? - Increase the total instances from 45 – 158 - Increase the num of labels from 10 - 17 • Data analysis and Quality improvement since categories may overlap
After Data Scale Increased Lowest: 36% Highest: 62% Average: 45% Lowest: 51.88% Highest: 67.5 % Average: 59.75%
After Data Scale Increased with Top N Lowest: 54% Highest: 72 % Average: 63.6% Lowest: 66.25% Highest: 79.38 % Average: 72.05%
Why the accuracy increased? • Naïve Bayes classifier using Gaussian distribution to represent the class-conditional probability for continuous attributes, so we are wondering that if the frequency distribution of each word in the articles looks like a normal distribution • Count the frequencies of each word in the articles to create a histogram and to see whether the histogram looks like a normal distribution
Only Code Data again Lowest: 66.25% Highest: 79.38 % Average: 72.05% Lowest: 51.54 % Highest: 64.62 % Average: 58.23 %
Data Without Code Lowest: 66.25% Highest: 79.38 % Average: 72.05% Lowest: 67.5 % Highest: 76.25 % Average: 71.44 %
Results Analysis • Different from human beings, code is not the decisive factor. • Base on our prepared data, code is only a small part of a single instance.
Compare to Maximum Entropy Lowest: 57.5 % Highest: 70 % Average: 63.06% Lowest: 51.88% Highest: 67.5 % Average: 59.75%
Maximum Entropy with Top N Lowest: 66.25% Highest: 79.38 % Average: 72.05% Lowest: 69.38% Highest: 83.76 % Average: 78.63%