TÍTULO GENÉRICO

Concept Indexing for Automated Text Categorization Enrique Puertas Sanz epuertas@uem.es Universidad Europea de Madrid TÍTULO GENÉRICO

OUTLINE • Motivation • Concept indexing with WordNet synsets • Concept indexing in ATC • Experiments set-up • Summary of results & discussion • Updated results • Conclusions & current work

Pre-classified documents Representation& learning Classifier(s) Classification Categories Newdocumentsinstances Newdocuments Representation Newdocumentscategorized MOTIVATION • Most popular & effective model for thematic ATC • IR-like text representation • ML feature selection, learning classifiers

MOTIVATION • Bag of Words • Binary • TF • TF*IDF • Stoplist • Stemming • Feature Selection

MOTIVATION • Text representation requirements in thematic ATC • Semantic characterization of text content • Words convey an important part of the meaning • But we must deal with polysemy and synonymy • Must allow effective learning • Thousands to tens of thousands attributes  noise (effectiveness) & lack of efficiency

---- car --- ------------- ------------- - wagon -- N036030448 {automobile, car, wagon} ------------- ------------- ------------- automobile N206726781{train wagon, wagon} CONCEPT INDEXING WITH WORDNET SYNSETS • Using vectors of synsets instead of word stems • Ambiguous words mapped to correct senses • Synonyms mapped to same synsets

CONCEPT INDEXING WITH WORDNET SYNSETS • Considerable controversy in IR • Assumed potential for improving text representation • Mixed experimental results, ranging from • Very good [Gonzalo et al. 98] to bad [Voorhees 98] • Recent review in [Stokoe et al. 03] • A problem of state-of-the-art WSD effectiveness • But ATC is different!!!

CONCEPT INDEXING IN ATC • Apart of the potential... • We have much more information about ATC categories than IR queries • WSD lack of effectiveness can be less hurting because of term (feature) selection • But we have new problems!!! • Data sparseness & noise • Most terms are rare (Zipf’s Law)  bad estimates • Categories with few documents  bad estimates, lack of information

CONCEPT INDEXING IN ATC • Concept indexing helps to solve IR & new ATC problems • Text ambiguity in IR & ATC • Data sparseness & noise in ATC • Less indexing units of higher quality (selection)  probably better estimates • Categories with few documents  why not enriching representation with WordNet semantic relations? • Hyperonymy, meronymy, etc.

CONCEPT INDEXING IN ATC • Literature review • As in IR, mixed results, ranging from • Good [Fukumoto & Suzuki, 01] to bad [Scott, 98] • Notably, researchers use words in synsets instead of the synset codes themselves • Still lacking Concept indexing evaluation in ATC overa representative range of selectionstrategies and learning algorithms

EXPERIMENTS SETUP • Primary goal • Comparing terms vs. correct synsets as indexing units • Requires perfect disambiguated collection (SemCor) • Secondary goals • Comparing perfect WSD with simple methods • More scalability, less accuracy • Comparing terms with/out stemming, stop-listing • Nature of SemCor (genre + topic classification)

EXPERIMENTS SETUP • Overview of parameters • Binary classifiers vs. multi-class classifiers • Three concept indexing representations • Correct WSD (CD) • WSD by POS Tagger (CF) • WSD by corpus frequency (CA)

EXPERIMENTS SETUP • Overview of parameters • Four term indexing representations • No Stemming, No StopList (BNN) • No Stemming, with Stoplist (BNS) • With Stemming, without Stoplist (BSN) • With Stemming and Stoplist (BSS)

EXPERIMENTS SETUP • Levels of selection with IG • No selection (NOS) • top 1% (S01) • top 10% (S10) • IG>0 (S00)

EXPERIMENTS SETUP • Learning algorithms • Naïve Bayes • kNN • C4.5 • PART • SVMs • Adaboost+Naïve Bayes • Adaboost+C4.5

EXPERIMENTS SETUP • Evaluation metrics • F1 (average of recall – precission) • Macroaverage • Microaverage • K-fold cross validation (k=10 in our experiments)

Binary classification Multi-class classification SUMMARY OF RESULTS & DISCUSSION • Overview of results

CD > C* weakly supports that accurate WSD is required BNN > B* does not support that stemming & stop-listing are NOT required Genre/topic orientation Most importantly CD > B* does not support that synsets are better indexing units than words (stemmed & stop-listed or not) SUMMARY OF RESULTS & DISCUSSION

UPDATED RESULTS • Recent results combining synsets & words (no stemming, no stop-listing, binary problem) • NB  S00, C4.5  S00, S01, S10 • SVM  S01, ABNB  S00, S00, S10

CONCLUSSIONS & CURRENT WORK • Synsets are NOT a better representation, but IMPROVE the bag-of-words representation • We are testing semantic relations (hyperonymy) on SemCor • It is required more work on Reuters-21578 • We will have to address WSD, initially with the approaches described in this work

THANK YOU !

TÍTULO GENÉRICO

TÍTULO GENÉRICO

Presentation Transcript

Rosa and Rico

Puerto Rico

Puerto Rico

Cap í tulo 1

Puerto Rico

RICO Operations Support

vti_encoding:SR|utf8-nl