210 likes | 295 Views
Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid. TÍTULO GENÉRICO. OUTLINE. Motivation Concept indexing with WordNet synsets Concept indexing in ATC Experiments set-up Summary of results & discussion Updated results
E N D
Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid TÍTULO GENÉRICO
OUTLINE • Motivation • Concept indexing with WordNet synsets • Concept indexing in ATC • Experiments set-up • Summary of results & discussion • Updated results • Conclusions & current work
Pre-classified documents Representation& learning Classifier(s) Classification Categories Newdocumentsinstances Newdocuments Representation Newdocumentscategorized MOTIVATION • Most popular & effective model for thematic ATC • IR-like text representation • ML feature selection, learning classifiers
MOTIVATION • Bag of Words • Binary • TF • TF*IDF • Stoplist • Stemming • Feature Selection
MOTIVATION • Text representation requirements in thematic ATC • Semantic characterization of text content • Words convey an important part of the meaning • But we must deal with polysemy and synonymy • Must allow effective learning • Thousands to tens of thousands attributes noise (effectiveness) & lack of efficiency
---- car --- ------------- ------------- - wagon -- N036030448 {automobile, car, wagon} ------------- ------------- ------------- automobile N206726781{train wagon, wagon} CONCEPT INDEXING WITH WORDNET SYNSETS • Using vectors of synsets instead of word stems • Ambiguous words mapped to correct senses • Synonyms mapped to same synsets
CONCEPT INDEXING WITH WORDNET SYNSETS • Considerable controversy in IR • Assumed potential for improving text representation • Mixed experimental results, ranging from • Very good [Gonzalo et al. 98] to bad [Voorhees 98] • Recent review in [Stokoe et al. 03] • A problem of state-of-the-art WSD effectiveness • But ATC is different!!!
CONCEPT INDEXING IN ATC • Apart of the potential... • We have much more information about ATC categories than IR queries • WSD lack of effectiveness can be less hurting because of term (feature) selection • But we have new problems!!! • Data sparseness & noise • Most terms are rare (Zipf’s Law) bad estimates • Categories with few documents bad estimates, lack of information
CONCEPT INDEXING IN ATC • Concept indexing helps to solve IR & new ATC problems • Text ambiguity in IR & ATC • Data sparseness & noise in ATC • Less indexing units of higher quality (selection) probably better estimates • Categories with few documents why not enriching representation with WordNet semantic relations? • Hyperonymy, meronymy, etc.
CONCEPT INDEXING IN ATC • Literature review • As in IR, mixed results, ranging from • Good [Fukumoto & Suzuki, 01] to bad [Scott, 98] • Notably, researchers use words in synsets instead of the synset codes themselves • Still lacking Concept indexing evaluation in ATC overa representative range of selectionstrategies and learning algorithms
EXPERIMENTS SETUP • Primary goal • Comparing terms vs. correct synsets as indexing units • Requires perfect disambiguated collection (SemCor) • Secondary goals • Comparing perfect WSD with simple methods • More scalability, less accuracy • Comparing terms with/out stemming, stop-listing • Nature of SemCor (genre + topic classification)
EXPERIMENTS SETUP • Overview of parameters • Binary classifiers vs. multi-class classifiers • Three concept indexing representations • Correct WSD (CD) • WSD by POS Tagger (CF) • WSD by corpus frequency (CA)
EXPERIMENTS SETUP • Overview of parameters • Four term indexing representations • No Stemming, No StopList (BNN) • No Stemming, with Stoplist (BNS) • With Stemming, without Stoplist (BSN) • With Stemming and Stoplist (BSS)
EXPERIMENTS SETUP • Levels of selection with IG • No selection (NOS) • top 1% (S01) • top 10% (S10) • IG>0 (S00)
EXPERIMENTS SETUP • Learning algorithms • Naïve Bayes • kNN • C4.5 • PART • SVMs • Adaboost+Naïve Bayes • Adaboost+C4.5
EXPERIMENTS SETUP • Evaluation metrics • F1 (average of recall – precission) • Macroaverage • Microaverage • K-fold cross validation (k=10 in our experiments)
Binary classification Multi-class classification SUMMARY OF RESULTS & DISCUSSION • Overview of results
CD > C* weakly supports that accurate WSD is required BNN > B* does not support that stemming & stop-listing are NOT required Genre/topic orientation Most importantly CD > B* does not support that synsets are better indexing units than words (stemmed & stop-listed or not) SUMMARY OF RESULTS & DISCUSSION
UPDATED RESULTS • Recent results combining synsets & words (no stemming, no stop-listing, binary problem) • NB S00, C4.5 S00, S01, S10 • SVM S01, ABNB S00, S00, S10
CONCLUSSIONS & CURRENT WORK • Synsets are NOT a better representation, but IMPROVE the bag-of-words representation • We are testing semantic relations (hyperonymy) on SemCor • It is required more work on Reuters-21578 • We will have to address WSD, initially with the approaches described in this work