170 likes | 330 Views
NLDB04 – 9 th International Conference on Applications of Natural Language to Information Systems . Using IR techniques to improve Automated Text Classification. Teresa Gonçalves, Paulo Quaresma tcg@di.uevora.pt, pq@di.uevora.pt Departamento de Informática Universidade de Évora, Portugal.
E N D
NLDB04 – 9th International Conference on Applications of Natural Language to Information Systems Using IR techniques to improve Automated Text Classification Teresa Gonçalves, Paulo Quaresma tcg@di.uevora.pt, pq@di.uevora.pt Departamento de Informática Universidade de Évora, Portugal
Overview • Application area: Text Categorisation • Research area: Machine Learning • Paradigm: Support Vector Machine • Written language: European Portuguese and English • Study: Evaluation of Preprocessing Techniques Using IR techniques to improve ATC
Datasets • Multilabel classification datasets • Each document can be classified into multiple concepts • Written language • European Portuguese • PAGOD dataset • English • Reuters dataset Using IR techniques to improve ATC
PAGOD dataset • Represents the decisions of the Portuguese Attorney General’s Office since 1940 • Caracteristics • 8151 documents, 96 Mbytes of characters • 68886 distinct words • Averaged document: 1339 words, 306 distinct • Taxonomy of 6000 concepts, used around 3000 • 5 most used concepts: number of documents • 909, 680, 497, 410, 409 Using IR techniques to improve ATC
Reuters-21578 dataset • Originally collected by the Carnegie group from the Reuters newswire in 1987 • Caracteristics • 9603 train documents, 3299 test documents (ModApté split) • 31715 distinct words • Averaged document: 126 words, 70 distinct • Taxonomy of 135 concepts, 90 appear in train/test sets • 5 most used concepts: number of documents • Train set: 2861, 1648, 534, 428, 385 • Test set: 1080, 718, 179, 148, 186 Using IR techniques to improve ATC
Experiments • Document representation • Bag-of-words • Retain words’ frequency • Discard words that contain digits • Algorithm • Linear SVM (WEKA software package) • Classes of preprocessing experiments • Feature reduction/construction • Feature subset selection • Term weighting Using IR techniques to improve ATC
Feature reduction/construction • Uses linguistic information • red1: no use of linguistic information • red2: remove a list of non-relevant words (articles, pronouns, adverbs and prepositions) • red3: remove red2 word list and transform each word onto its lemma (its stem for the English dataset) • Portuguese • POLARIS, a Portuguese lexical database • English • FreeWAIS stop-list • Porter algorithm Using IR techniques to improve ATC
Feature subset selection • Uses a filtering approach • Keeps the features (words) that receive higher scores • Scoring functions • scr1: Term frequency • scr2: Mutual information • scr3: Gain ratio • Threshold value • scr1: the number of times each word appears in all documents • scr2, scr3: the same number of features as scr1 • Experiments • sel1, sel50, sel100, sel200, sel400, sel800, sel1200, sel1600 Using IR techniques to improve ATC
Number of attributes (with respect to threshold values) Using IR techniques to improve ATC
Term weighting • Uses the document, collection and normalisation components • wgt1: binary representation with no collection component but normalised to unit lenght • wgt2: raw term frequency with no collection nor normalisation component • wgt3: term frequency with no collection component but normalised to unit lenght • wgt4: term frequency divided by the collection component and normalised to unit lenght Using IR techniques to improve ATC
Experimental results • Method • PAGOD: 10-fold cross-validation • Reuters: train and test set (ModApté split) • Measures • Precision, recall and F1 • Micro- and macro-averaging for the top 5 concepts • Significance tests with 95% of confidence Using IR techniques to improve ATC
PAGOD dataset Using IR techniques to improve ATC
Reuters dataset Using IR techniques to improve ATC
Results • PAGOD • The best combination • scr1 – red2 – wgt3 – sel1 • Worst values • scr3 and wgt2 experiments • Reuters • The best combination • scr2 –(red1,red3) – (wgt3,wgt4)– sel400 • Worst values • scr3, and wgt2 experiments Using IR techniques to improve ATC
Results Worse values for PAGOD written language? more difficult concepts to learn? more imbalanced dataset? Best experiments Different for both datasets written language? area of written documents? Discussion • SVM • Deals well with non informative and non independent features in different languages Using IR techniques to improve ATC
Future work • Explore • the impact of the imbalance nature of datasets • the use of morpho-syntactical information • other datasets • Try • more powerful document representations Using IR techniques to improve ATC
Scoring functions • scr1: Term frequency • The score is the number of times the feature appears in the dataset • scr2: Mutual information • It evaluates the worth of a feature A by measuring its mutual information, I(C;A) , with respect to the class, C • scr3: Gain ratio • The worth is the attribute’s gain ratio with respect to the class Using IR techniques to improve ATC