Using IR techniques to improve Automated Text Classification

NLDB04 – 9th International Conference on Applications of Natural Language to Information Systems Using IR techniques to improve Automated Text Classification Teresa Gonçalves, Paulo Quaresma tcg@di.uevora.pt, pq@di.uevora.pt Departamento de Informática Universidade de Évora, Portugal

Overview • Application area: Text Categorisation • Research area: Machine Learning • Paradigm: Support Vector Machine • Written language: European Portuguese and English • Study: Evaluation of Preprocessing Techniques Using IR techniques to improve ATC

Datasets • Multilabel classification datasets • Each document can be classified into multiple concepts • Written language • European Portuguese • PAGOD dataset • English • Reuters dataset Using IR techniques to improve ATC

PAGOD dataset • Represents the decisions of the Portuguese Attorney General’s Office since 1940 • Caracteristics • 8151 documents, 96 Mbytes of characters • 68886 distinct words • Averaged document: 1339 words, 306 distinct • Taxonomy of 6000 concepts, used around 3000 • 5 most used concepts: number of documents • 909, 680, 497, 410, 409 Using IR techniques to improve ATC

Reuters-21578 dataset • Originally collected by the Carnegie group from the Reuters newswire in 1987 • Caracteristics • 9603 train documents, 3299 test documents (ModApté split) • 31715 distinct words • Averaged document: 126 words, 70 distinct • Taxonomy of 135 concepts, 90 appear in train/test sets • 5 most used concepts: number of documents • Train set: 2861, 1648, 534, 428, 385 • Test set: 1080, 718, 179, 148, 186 Using IR techniques to improve ATC

Experiments • Document representation • Bag-of-words • Retain words’ frequency • Discard words that contain digits • Algorithm • Linear SVM (WEKA software package) • Classes of preprocessing experiments • Feature reduction/construction • Feature subset selection • Term weighting Using IR techniques to improve ATC

Feature reduction/construction • Uses linguistic information • red1: no use of linguistic information • red2: remove a list of non-relevant words (articles, pronouns, adverbs and prepositions) • red3: remove red2 word list and transform each word onto its lemma (its stem for the English dataset) • Portuguese • POLARIS, a Portuguese lexical database • English • FreeWAIS stop-list • Porter algorithm Using IR techniques to improve ATC

Feature subset selection • Uses a filtering approach • Keeps the features (words) that receive higher scores • Scoring functions • scr1: Term frequency • scr2: Mutual information • scr3: Gain ratio • Threshold value • scr1: the number of times each word appears in all documents • scr2, scr3: the same number of features as scr1 • Experiments • sel1, sel50, sel100, sel200, sel400, sel800, sel1200, sel1600 Using IR techniques to improve ATC

Number of attributes (with respect to threshold values) Using IR techniques to improve ATC

Term weighting • Uses the document, collection and normalisation components • wgt1: binary representation with no collection component but normalised to unit lenght • wgt2: raw term frequency with no collection nor normalisation component • wgt3: term frequency with no collection component but normalised to unit lenght • wgt4: term frequency divided by the collection component and normalised to unit lenght Using IR techniques to improve ATC

Experimental results • Method • PAGOD: 10-fold cross-validation • Reuters: train and test set (ModApté split) • Measures • Precision, recall and F1 • Micro- and macro-averaging for the top 5 concepts • Significance tests with 95% of confidence Using IR techniques to improve ATC

PAGOD dataset Using IR techniques to improve ATC

Reuters dataset Using IR techniques to improve ATC

Results • PAGOD • The best combination • scr1 – red2 – wgt3 – sel1 • Worst values • scr3 and wgt2 experiments • Reuters • The best combination • scr2 –(red1,red3) – (wgt3,wgt4)– sel400 • Worst values • scr3, and wgt2 experiments Using IR techniques to improve ATC

Results Worse values for PAGOD written language? more difficult concepts to learn? more imbalanced dataset? Best experiments Different for both datasets written language? area of written documents? Discussion • SVM • Deals well with non informative and non independent features in different languages Using IR techniques to improve ATC

Future work • Explore • the impact of the imbalance nature of datasets • the use of morpho-syntactical information • other datasets • Try • more powerful document representations Using IR techniques to improve ATC

Scoring functions • scr1: Term frequency • The score is the number of times the feature appears in the dataset • scr2: Mutual information • It evaluates the worth of a feature A by measuring its mutual information, I(C;A) , with respect to the class, C • scr3: Gain ratio • The worth is the attribute’s gain ratio with respect to the class Using IR techniques to improve ATC

Using IR techniques to improve Automated Text Classification

Using IR techniques to improve Automated Text Classification

Presentation Transcript

Text Classification

Automated Personality Classification

Document Classification Techniques using LSI

TEXT CLASSIFICATION

Text Classification

Automated landform classification using DEMs

Text Classification

Text Classification

Pre-procesing techniques for text classification

Text Understanding Techniques for Automated Assessment

Automated Classification of Medical Questions Using Semantic Parsing Techniques

Text Analysis Using Automated Language Translators

Text Classification

Text Classification Using Stochastic Keyword Generation

Text Classification

Classification Text

Text Classification using SVM-light

Text Classification

TEXT CLASSIFICATION