180 likes | 326 Views
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources. Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu. Janyce Wiebe University of Pittsburg wiebe@cs.pitt.edu. Subjectivity analysis.
E N D
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu Janyce Wiebe University of Pittsburg wiebe@cs.pitt.edu
Subjectivity analysis • Subjectivity analysis (opinions and sentiments) • Used in a wide variety of applications • Tracking sentiment timelines in news (Lloyd et. al, 2005) • Review classification (Turney, 2002; Pang et. al, 2002) • Mining opinions from product reviews (Hu and Liu, 2004) • Expressive text-to-speech synthesis (Alm et. al, 2005) • Text semantic analysis (Wiebe and Mihalcea, 2006; Esuli and Sebastiani, 2006) • Question answering (Yu and Hatzivassiloglou, 2003) • Much work on subjectivity analysis has focused on English • Japanese (Takumura et. al, 2006), Chinese (Hu et. al, 2005), German (Kim and Hovy, 2006)
Proportion of Languages on the Web internetworldstats.com ~ updated November 30, 2007
Objective • Develop a method for subjectivity analysis that • Requires few electronic resources • Can be easily ported to a new language • Applicable to the large number of languages that have scarce electronic resources
Related Work • Tools that rely on manually or semi-automatically constructed lexicons • Yu and Hatzivassiloglou, 2003; Riloff and Wiebe, 2003; Kim and Hovy, 2006 • Enable the efficient rule-based subjectivity and sentiment classifiers that rely on the presence of lexicon entries in text • These tools assume the availability of • advanced language processing tools: • Syntactic parsers (Wiebe, 2000), Information extraction (Riloff and Wiebe, 2003) • broad-coverage rich lexical resources • WordNet (Essuli and Sebastiani, 2006) • Our approach relates most closely to the method of (Turney, 2002) for the construction of lexicons annotated for polarity • We address the task of acquiring a subjectivity lexicon • We rely on fewer, smaller-scale resources
Our Method • Based on bootstrapping • Requires: • A small seed set of subjective entries • One/multiple electronic dictionaries • A small training corpus (approx. 500,000 words) • Experiments focused on Romanian • Applicable to other languages as well
Candidate synonyms query seeds Online dictionary Max. no. of iterations? no yes Selected synonyms Candidate synonyms Variable filtering Bootstrapping Process Fixed filtering
Seed Set 60 seeds, evenhandedly sampled from verbs, nouns, adjectives and adverbs. Manually selected Seed sources: XI-th grade curriculum for Romanian Language and Literature Translations of instances appearing in the OpinionFinder strong subjective lexicon (Wiebe and Riloff, 2005)
Expansion Definition Seed Candidate synonyms All open-class words, that have a definition in the dictionary longer than 3 letters Diacritics are removed Romanian dictionary: http://www.dexonline.ro Dictionaries for other languages are also available, or can be obtained from paper dictionaries through OCR
Filtering • Candidates are filtered based on a measure of similarity with the original seeds • We use Latent Semantic Analysis (LSA)(Dumais et al., 1988) trained on the SemCor corpus (Miller et al., 1993) • After each iteration, only candidates with an LSA score higher than a given threshold are selected for further expansion • Example: • Seed: dulce (sweet) • Candidate synonyms: cu gust dulce (sweet-tasting). placut (pleasant), dulceag (quasi-sweet)
Filtering • Several iterations of the bootstrapping process will result in a subjectivity lexicon consisting of a ranked list of candidates in decreasing order of similarity to the original seeds • A variable filtering threshold can be used to further restrict the similarity for a more pure lexicon • Filtering parameters: • Similarity threshold • Number of iterations
Evaluation • Rule-based classifier of subjectivity • (Riloff and Wiebe, 2003) • Subjective sentence: three or more subjective entries. • Objective sentence: two subjective entries or less. • Gold standard data set • (Mihalcea, Banea and Wiebe, 2007) • 504 sentences from five SemCor documents (manually translated in Romanian) • Labeled by two annotators • Agreement (all): 83% (=0.67) • Agreement (uncertain removed): 89% (=0.77) • Baseline: 54% (all subjective)
Number of Iterations F-measure for the bootstrapping subjectivity lexicon over 5 iterations and an LSA threshold of 0.5
Similarity Threshold F-measure for the fifth bootstrapping iteration for varying LSA scores
Comparison • Bootstrapping rule-based classifier: uses a 3913 entries subjectivity lexicon obtained through 5 iterations and similarity threshold of 0.5
Conclusions • Our bootstrapping method uses few electronic resources: • A small seed set • One/multiple dictionaries • A small corpus of half a million words • A large subjectivity lexicon of approx. 4000 entries was extracted • Using an unsupervised rule-based classifier, a subjectivity F-measure of 66.20% and an overall F-measure of 61.69% can be achieved