An EM based training algorithm for Cross-Lingual Text Categorization

An EM based training algorithm for Cross-Lingual Text Categorization Bing Liu Department of Computer Science University of Illinois at Chicago Chicago – Illinois – USA liub@cs.uic.edu Leonardo Rigutini and Marco Maggini Department of Information Engineering University of Siena Siena – Italy {rigutini,maggini}@dii.unisi.it

Outline • From Cross Lingual Information Retrieval (CLIR)to Cross-Lingual Text Categorization (CLTC) • CLTC based on learning set traslation: • The basic algorithm • The improved algorithm • Experimental results • Conclusions L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Cross Lingual Text Categorization • Due to the globalization many companies and institutions also need to efficiently organize and search repositories containing multilingual documents • The management of these heterogeneous text collections increases the costs significantly because experts of different languages are required to organize these collections • Cross-Lingual Text Categorization can provide techniques to extend existing automatic classification systems in one language to new languages without requiring additional intervention of human experts L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Approaches • The CLTC is highly close to the Cross-Lingual Information Retrieval (CLIR): • Poly-Lingual: data composed by documents in different languages; an unique classifier is trained • Cross-Lingual: The language is identified and a translation step is required. This method includes three different solutions • Training set translation • Test set translation • “Esperanto” L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

a) Poly-Lingual • Advantages: • Conceptually simple method • An unique classifier is used • Quite good performances • Drawbacks: • Requires many documents for the learning set for each language • High dimensionality of the dictionary: • union of n vocabularies • Many terms shared between two languages • Difficult feature selection due to the coexistenceof many different languages L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

1) Cross-Lingual: Training set translation • The classifier is trained with a learning set in language L1 translated to the L2 , where L2 is the language of the unlabeled data: • the learning set is highly noisy and the classifier shows poor performances • The system works on the L2 language documents: • Number of translations lower than the test set translation approach L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

2) Cross-Lingual: Test set translation • The model is trained using documents in language L1 without translation: • Training data is not corrupted by noise • The unlabeled documents in language L2 are translated into the language L1: • The translation step is highly time consuming • It has very low performances and it introduces much noise • A filtering phase on the test data after the translation is needed • The translated documents are categorized by the classifier trained in the language L1: • Possible inconsistency between training and unlabeled data L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

3) Cross-Lingual:“Esperanto” • All documents in each languages are translated into a new universal language, Esperanto (LE) • The new language should maintain all the semantic features of each language • Very difficult to design • High amount of knowledge for each language is needed • The system works in the new universal language • It needs the translation of the training set and of the test set • Very time consuming L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

The approaches In Cross-Lingual Text Categorization: • Poly-Lingual approach • n mono-lingual text categorization problems, one for each language • It requires a training set for each language: experts that labels the documents for each language • Cross-lingual • Test set translation: • It requires the test set translation  time consuming • Esperanto: • It is very time consuming and requires a large amount of knowledge for each language • Training set translation: • Require experts to label the documents for each language L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Feasible approaches to CLTC • “Given a predefined category organization for documents in a well-known language L1 , the task is to classify documents in an unknown language L2 according to that organization without having to manually label the data in L2 since it requires experts in that language and this is expensive.” • Thus: • The Poly-Lingual approach translation is not usable in this case, since it requires a learning set in the unknown language L2 • Even the “esperanto” approach is not possible, since it needs knowledge about all the languages • Only the training and test set approaches can be used in this type of problem L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Learning from labeled and unlabeled examples • Learning from labeled and unlabeled examples: • Use a small initial labeled dataset to extract information for categorization from a large unlabeled dataset • EM algorithm: • E step: data are labeled using the parameter configuration estimated at step Mt-1 • M step: model is updated assuming the labels assigned by the Et step to be correct L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Cross Lingual Text Categorization • The problem can be summarized as: • We have a small labeled dataset in language L1 • We want to categorize a large unlabeled dataset in language L2 • We do not want to use experts for the language L2 • The idea is: • We can translate a small training set into the language L2 • We can initialize an EM algorithm with these very noisy data • We can reinforce the behavior of the classifier using the unlabeled data in language L2 L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

The basic algorithm Ts2 Translation 1 2 E step results Tr12 Tr1 C21 E(t) M step start EM iterations L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Problems in the basic algorithm • Once the classifier is trained, it can be used to label a larger dataset • this algortihm can start with a small initial dataset and this is an advantage since our initial dataset is very noisy • PROBLEMS: • Multilingual Data: • Temporal dependency: Documents regarding same topic in different times, deal with different themes, events etc… • Geographical dependency: Documents regarding the same topics in different places, deal with different persons, facts, events, etc… Find the discriminative terms for each topic independently of time and place L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Problems • Translation: • The translation systems perform very poorly expecially when the text is badly written: • Named Entity Recognition (NER): words that should not be translated or different words referring to the same entity • Word-sense disambiguation: in translation it is a fundamental problem • Algorithm: • Usually EM tends to form few large central classes and many small peripheral classes including outliers: • It depends on the starting point and on the noise on the data added at the class at each EM iteration L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Improved algorithm by using Feature Selection Ts2 E step FS1 results Tr12 C21 FS2 start E(t) M step EM iterations L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

The filter FS1 Ts2 FS1 results Tr12 • Highly selective since the data are composed by translated text and they are very noisy • Initialize the EM process by selecting the most informative words in the data L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

The filter FS2 Ts2 E step results C21 FS2 E(t) M step • It performs a regularization effect on the EM algorithm • it selects the most discriminative words according to the previous classification done by the E step • infact the not significant words do not influence the updating of the centroids in EM iterations • The filter can be less selective than the previous: • Infact it works on the original data, not corrupted by translation noisy L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Previous Works • In a work at the University of Barcelona, the authors used the ILO corpus in two language (English and Spanish) to test three different approaches to CLTC (Nuria et al. 2003) : • Polylingual • Test set translation • Profile-based translation • They used the Winnow (ANN) and Rocchio algorithms and compared the results with the monolingual test • Performances: 70%-75% in accuracy L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Multi-lingual Dataset • Very few multi-lingual data sets available: • No one with Italian language • We built the data set by crawling the Newsgroups: • Availability of the same groups in different languages • Large number of available messages • Different levels of each topic • Multi lingual dataset compostion • Two languages: Italian (LI) and English (LE) • Three groups: auto, hardware and sport L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Multi-lingual Dataset • Drawbacks: • Very short messages • Informal documents: • Slang terms • Badly written words • Often transversal topics • advertising, spam, other actual topics (for example U.S. elections) • Temporal dependency: same topic in two different moments deals with different problems • Geographical dependency: same topic in two different places deals with different persons, facts etc… L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Experimental Results • We used a Multinomial Naive Bayes Classifier with Good-Turing smoothing as classifier in the EM algorithm: • We used Information Gain as feature selection methods SF1 and SF2: • The results are averaged on a ten-fold cross validation L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Monolingual and Multilingual tests TsI results TrI CI Translation E I TsI CEI TrE results TrEI • Monolingual test: • No traslation, training set and test set in the Italian language • Baseline multilingual test: • Translation from English to Italian L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Basic Algorithm TsI Translation E I E step results TrEI TrE CEI M step start E(t) EM iterations • Translation from English to Italian L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Improved algorithm TsI IG k1 CEI E step results TrEI IG k2 start E(t) M step EM iterations • Translation from English to Italian • k1 = 300 • k2 = 1000 L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Conclusions • The filtered EM algorithm performs very well • It does not needs an initial labeled dataset in the desired language: • No other proposed algorithms have this feature • It achieves good results starting with few translated documents: • It does not require much time for translation • The algorithm can be tested using a different classifier (i.e. SVM) or a different feature selection method (a different informative function or LSI) L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

Thanks you for your attention rigutini@dii.unisi.it maggini@dii.unisi.it L.Rigutini, M.Maggini, B.Liu - An EM training algorithm for Cross Language Text categorization WI 2005

An EM based training algorithm for Cross-Lingual Text Categorization

An EM based training algorithm for Cross-Lingual Text Categorization

Presentation Transcript

Text Categorization

Text Categorization

Text Categorization (TC)

Cross-Lingual IR

Training-less Ontology-based Text Categorization.

Learning for Text Categorization

Text Categorization

Text Categorization

Text Categorization

A Concept-based Model for Enhancing Text Categorization

Text Categorization

text categorization

Statistical Text Categorization

Text Categorization

Text Categorization

Text Categorization

Text Categorization

CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION

Text Categorization

Text Categorization (continued)