320 likes | 488 Views
ACL - 2007. PageRanking WordNet Synsets : An Application to Opinion Mining. Andrea Esuli and Fabrizio Sebastiani Istituto di Scienza e Tecnologie dell ’ Informazione Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi, 1 – 56124 Pisa, Italy
E N D
ACL - 2007 PageRanking WordNet Synsets :An Application to Opinion Mining Andrea Esuli and Fabrizio Sebastiani Istituto di Scienza e Tecnologie dell’Informazione Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi, 1 – 56124 Pisa, Italy fandrea.esuli,fabrizio.sebastianig@isti.cnr.it Advisor: Hsin-Hsi Chen Speaker: Yong-Sheng Lo Date: 2007/10/31
Introduction • Recent years have witnessed an explosion of work on opinion mining • An important part of this research has been the work on the automatic determination of the opinion-related properties (ORPs) of terms • OPPs = positive, negative, or neutral • polarity
Related work 1/2 • Traditional work • Determine the polarity of adjectives • Hatzivassiloglou and McKeown (1997) • Kamps et al. (2004) • Determine the polarity of generic terms • Turney and Littman (2003) • Kim and Hovy (2004) • Takamura et al. (2005)
Related work 2/2 • Recent work • Using glosses from online dictionary • Extend a set of terms of known positivity/negativity • Andreevskaia and Berger (2006a) • Determine the ORPs of generic terms • Esuli and Sebastiani (2005; 2006a) • Determining the ORPs of WordNet synsets (synonym sets) • Esuli and Sebastiani (2006b)
In this work • We have investigated the applicability of a random walk model to the problem of ranking synsets (synonym sets) according to positivity and negativity. • Using PageRank • Need nodes and links • Using eXtended WordNet version 2.0-1.1 • Based on WordNet version 2.0
eXtended WordNet (XWN) • The goal of this project is to develop a tool that takes as input the current or future versions of WordNet and automatically generates an eXtended WordNet that provides several important enhancements intended to remedy the present limitations of WordNet. • XWN has 4 files • adj.xml • adv.xml • noun.xml • verb.xml • How the information is represented in XWN ?
Graph generation for PageRank • The directed graph G = ( N, L ) • N (node) : The set of all WordNet synsets • 115,424 synsets • L (link) : From synset Si to synset Sk ( Si Sk ) • iff the gloss of Si contains at least a term belonging to Sk • For example • the gloss of Si contains “ by a small margin ; …“ • Sk contains “small , …” • Si Sk
The PageRank algorithm 1/4 • Input : • The row-normalized adjacency matrix (W) • W be the |N| X |N| adjacency matrix of G • |N| = # of synsets • Wo[ i,j ] = 1 iff there is a link from node i to node j • If Wo[i,j] = 1 W[ i,j ] = 1 / | F(i) | • Else W[ i,j ] = 0 • B(i) = { nj | Wo[ j,i ] = 1 } : 哪些node 連到node i • The set of the backward neighbors of ni • F(i) = { nj | Wo[ i,j ] = 1 } : node i 連到哪些node • The set of the forward neighbors of ni • Output : • A vector [ a1,……,a|N| ] • ai represents the score of node ni, i = 1~|N|
The PageRank algorithm 2/4 • PageRank iteratively computes vector a : • The value of ei amounts to an internal source of score for node i • It is constant (=1/|N|) across the iterations and independent from its backward neighbours • In vectorial form, Equation 1 can be written as
The PageRank algorithm 3/4 • In this work • Using the ei values as internal sources of a given ORP (positivity or negativity) for node i • by attributing a null ei value to all but a few “seed” synsets known to possess that ORP • Simple procedure : • PageRank will thus make the ORP flow from the seed synsets, at a rate constant throughout the iterations, into other synsets along the relation, until a stable state is reached; the final ai values can be used to rank the synsets in terms of that ORP.
The PageRank algorithm 4/4 • Run 1: • Run 2:
Why PageRank ? 1/2 • If terms contained in synset Sk occur in the glosses of many positive synsets, and if the positivity scores of these synsets are high, then it is likely that Sk is itself positive (the same happens for negativity). • This justifies the summation of Equation 1.
Why PageRank ? 2/2 • If the gloss of a positive synset that contains a term in synset sk also contains many other terms, then this is a weaker indication that Sk is itself positive • This justifies dividing by |F(j)| in Equation 1 • The ranking resulting from the algorithm needs to be biased in favour of a specific ORP • 已知是ORP的synset的分數會比較高 • This justifies the presence of the ei factor in Equation 1
Full procedure 1/2 • (1) The graph G is generated • Numbers, articles and prepositions occurring in the glosses are discarded • Since they can be assumed to carry no positivity and negativity • This leaves only nouns, adjectives, verbs, and adverbs • (2) The row-normalized adjacency matrix W of G is derived • The graph G is “pruned” by removing “self-loops”
Full procedure 2/2 • (3) PageRank setting • The ei values are loaded into the e vector • All synsets other than the seed synsets of renowned positivity (negativity) are given a value of 0 • We experiment with several different versions of the e vector and several different values of α • (4) PageRank is executed using W and e, iterating until a predefined termination condition is reached • (5) We rank all the synsets of WordNet in descending order of their ai score • The process is run twice, once for positivity and once for negativity
Setup (e) 1/2 • e1(baseline) • all values uniformly set to 1/|N| • e2 • uniform non-null ei scores assigned to the synsets that contain the adjective good (bad) • null scores for all other synsets • e3 • uniform non-null ei scores assigned to the synsets that contain at least one of the seven “paradigmatic” positive (negative) adjectives • Positive : good, nice, excellent, positive, fortunate, correct, superior • Negative : bad, nasty, poor, negative, unfortunate, wrong, inferior • null scores for all other synsets
Setup (e) 2/2 • e4 • The score assigned to a synset (for ei) is proportional to the positivity (negativity) scoreassigned to it by SentiWordNet, and in which all entries sum up to 1. • Using SentiWordNet release 1.0 • SentiWordNet is a lexical resource in which each WordNet synset is given a positivity score, a negativity score, and a neutrality score. • e5 • like as e4 • Using SentiWordNet release 1.1
SentiWordNet • Esuli and Sebastiani • LREC-06 • SentiWordNet is a lexical resource for opinion mining • SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity
The benchmark 1/4 • Micro-WNOp corpus (Cerini et al., 2007) • It consists in a set of 1,105 WordNet synsets, each of which was manually assigned score • The corpus is divided into three parts : • Common: 110 synsets which all the evaluators evaluated by working together, so as to align their evaluation criteria. • Group1: 496 synsets which were each independently evaluated by three evaluators. • Group2: 499 synsets which were each independently evaluated by the other two evaluators.
The benchmark 2/4 • To ensure the creation of a corpus composed by synsets which are relevant to the opinion topic • It was generated by randomly selecting 100 positive + 100 negative + 100 objective terms from the General Inquirer (GI) lexicon (Turney and Littman, 2003) • and including all the synsets that contained at least one such term, without paying attention to Part-Of-Speech.
The benchmark 3/4 • How the information is represented in Micro-WNOp corpus ? • Score = 0 ~ 1
The benchmark 4/4 • In this work • We obtain the positivity (negativity) ranking from Micro-WNOp by averaging the positivity (negativity) scores assigned by the evaluators of each group into a single score, and by sorting the synsets according to the resulting score. • Using Group 1 as a validation set • In order to tune α • Using Group 2 as a test set
The effectiveness measure • The p-normalizedKendallτdistance • 0 ≦τp ≦ 1 • Smaller is better • For example • 若排序完全一致:nd = nu = 0 • nd : the number of discordant pairs • nu : the number of pairs ordered (i.e., not tied) in the gold standard and tied in the prediction • Z : pair 的總數 • P = 1/2
Conclusion • We argue that the binary relation (SiSk) is structurally akin to the relation between hyperlinked Web pages, and thus lends itself to PageRank analysis. • This paper thus presents a proof-of-concept of the model, and the results of experiments support our intuitions.
Reference • eXtended WordNet • http://xwn.hlt.utdallas.edu/ • SentiWordNet (需先註冊) • http://sentiwordnet.isti.cnr.it/ • The MICRO-WNOPCorpus • http://www.unipv.it/wnop/