180 likes | 195 Views
Learn how to employ active learning for cross-lingual sentiment classification and control data quality from source language to target language for improved accuracy. Experimentation shows the effectiveness of this approach.
E N D
Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang† Huanhuan Liu† Chu-Ren Huang‡ † Soochow University ‡ Hong Kong Polytechnic University
Outline • Introduction • Inadequacies of the Existing Work • Our Methods • Experimental Results • Conclusion
Introduction • Sentiment classification is a task of predicting the sentimental orientation (e.g., positive or negative) for a certain text. • However, the resources are rather imbalanced across different languages. • For example, due to dominant studies on English sentiment classification, the labeled data in English is often in a large scale while the labeled data in some other languages is much limited.
Introduction(Cont.) • Cross-lingual sentiment classification aims to predict the sentiment orientation of a text in a language (named as the target language) with the help of the resources from another language (named as the source language).
Inadequacies of the Existing Work • The classification performance of only using the labeled data in the source language remains far away from satisfaction due to the huge difference in linguistic expression and social culture. • One challenge in active learning-based cross-lingual sentiment classification lies in the much imbalanced labeled data from the source and target languages. • A huge imbalance in the labeled data easily floods the small amount of the labeled target data in the abundance of labeled source data and largely reduces the contribution of the labeled data in the target language.
Our Methods • We propose a certainty-based quality measurement (the intra-quality measurement), together with cross-validation to select high-quality samples in the source language. • We propose a similarity measurement (the extra-quality measurement) to select the samples in the source language that are similar to those in the target language. • For a particular data in the target language, these two kinds of measurements are integrated to select high-quality samples in the source language. • After obtaining the high-quality samples in the source language, we employ standard uncertainty sampling for active learning-based cross-lingual sentiment classification.
Intra-quality Measurement • It only employs the data in the source language to measure the quality of the samples in the source language. • We first split the labeled data from the source language into two different parts. One is severed as the training data and the other is severed as the validation data. • Then, we use the training data to train a classifier which is used to predict the samples in the validation data. • After the prediction process, we assume that the samples with high posterior possibilities are capable of representing the classification knowledge in the training data.
Integrating Intra- and Extra-Quality Measurements • We consider the certainty measurement as the main ranking factor and leave the similarity measurement as a supplementary one when designing the way to integrate them. • Input: Translated training data from the source language Testing data from the target language • Output: The selected data set
Active Learning-based Cross-lingual Sentiment Classification
Active Learning-based Cross-lingual Sentiment Classification
Experimental Settings • Labeled Data in the Source Language: English reviews from four domains: Book (B), DVD (D), Electronics (E) and Kitchen (K). Each domain contains 1000 positive and 1000 negative reviews. All these labeled samples are translated into Chinese ones with Google Translate. • Testing Data in the Target Language: Chinese reviews from IT168 and Chinese reviews from 360BUY , together with 2000 unlabeled reviews. • Unlabeled Data in the Target Language: We select 500 positive and 500 negative as the unlabeled samples for active learning.
Experimental Results Table 1:The classification performance by using all 8000 samples in the source domain Four Approaches: Random + No_source Uncertainty + No_source Uncertainty + All_source Uncertainty + Selected_source
Conclusion • We propose an active learning approach for cross-lingual sentiment classification and address the huge challenge of the data imbalance by controlling data quality in the source language. Experimentation verifies the appropriateness of active learning for cross-lingual sentiment classification. • In future work, we would like to improve the extra-quality measurement to make it more effective for selecting high quality samples. Meanwhile, we will try data quality controlling in other cross-lingual NLP tasks.