370 likes | 583 Views
Entity Set Expansion in Opinion Documents. Lei Zhang Bing Liu University of Illinois at Chicago. Introduction to opinion mining. Opinion Mining Computational study of opinions, sentiments expressed in text
E N D
Entity Set Expansion in Opinion Documents Lei Zhang Bing Liu University of Illinois at Chicago
Introduction to opinion mining • Opinion Mining Computational study of opinions, sentiments expressed in text • Why opinion mining now? mainly because of the Web, we can get huge volumes of opinionated text
Why opinion mining is important • Whenever we need to make a decision, we would like to hear other’s advice. In the past. • Individual : Friends or family. • Business : Surveys and consultants. • Word of mouth on the Web People can express their opinions in reviews, forum discussions, blogs…
What is an entity in opinion documents • An entity can be a product, service, person, organization or event in opinion document. • Basically, opinion mining is to extract opinions expressed on entities and their attributes. • “I brought a Sony camera yesterday, and its picture quality is great.” picture quality is the product attribute; Sony is the entity.
Why we need entity extraction • Without knowing the entity, the piece of opinion has little value. • Companies want to know the competitors in the market. This is the first step to understand the competitive landscape from opinion documents.
Related work Named entity recognition (NER) Aims to identity entities such as names of persons, organizations and locations in natural language text. Our problem is similar to NER problem, but with some differences. • Fine grained entity classes (products, service) rather than coarse grained entity classes (people, location, organization ) • Only want a specific type: e.g. a particular type of drug names. • Neologism : e.g. “Sammy” (Sony) , “SE” (Sony- Ericsson) • Feature sparseness (lack of contextual patterns) • Data noise (over-capitalization , under-capitalization)
NER methods • Supervised learning methods The current dominant technique for addressing the NER problem Hidden Markov Models (HMM) Maximum Entropy Models (ME) Support Vector Machines (SVM) Conditional Random Field (CRF) Shortcomings: Rely on large sets of labeled examples. Labeling is labor-intensive and time-consuming.
NER methods • Unsupervised learning methods Mainly clustering. Gathering named entities from clustered groups based on the similarity of context. The techniques rely on lexical resources (e.g., WordNet), on lexical patterns and on statistics computed on a large unannotated corpus. Shortcomings: low precision and recall for the result
NER methods • Semi-supervised learning methods Show promise for identifying and labeling entities. Starting with a set of seed entities, semi-supervised methods use either class specific patterns to populate an entity class or distributional similarity to find terms similar to the seeds. Specific methods: Bootstrapping Co-traning Distributional similarity
Our problem is set expansion problem • To find competing entities, the extracted entities must be relevant, i.e., they must be of the same class/type as the user provided entities. • The user can only provide a few names because there are so many different brands and models. • Our problem is actually a set expansion problem, which expands a set of given seed entities.
Set expansion problem • Given a set Q of seed entities of a particular class C, and a set D of candidate entities, we wish to determine which of the entities in D belong to C. That is, we “grow” the class C based on the set of seed examples Q. • This is a classification problem. However, in practice, the problem is often solved as a ranking problem.
Distributional similarity • Distributional similarity is a classical method for set expansion problem. (Basic idea: the words with similar meanings tend to appear in similar contexts) • It compares the similarity of the word distribution of the surround words of a candidate entity and the seed entities, and then ranking the candidate entities based on the similarity values. • Our result shows this approach is inaccurate.
Bayesian sets • Based on Bayesian inference and was designed specifically for the set expansion problem. • It learns from a seeds set (i.e., a positive set P) and an unlabeled candidate set U.
Bayesian sets • Given and , we aim to rank the elements of by how well they would “fit into” a set which includes (query) • Define a score for each: • From Bayes rule, the score can be re-written as:
Bayesian sets • Intuitively, the score compares the probability that e and were generated by the same model with the sameunknown parameters θ, to the probability that eand came from models with different parameters θ and θ’.
Bayesian sets Compute following equations:
Bayesian sets • The final score can be computed as: Where α and β are hyperparameters obtained from data The top ranked entities should be highly related to seed set Q according to Bayesian sets algorithm
Direct application of Bayesian sets For seeds and candidate entities, the feature vector is created as follows: (1) A set of features is first designed to represent each entity. (2) For each entity, identify all the sentences in the corpus that contain the entity. Based on the context, produce single feature vector to represent the entity. But it produces poor result. (Reason: First, Bayesian sets uses binary feature, multiple occurrences of an entity in the corpus, which give rich contextual information, is not fully exploited; Second, the number of seeds is very small, the result is not reliable)
Improving Bayesian sets We propose a more sophisticated method to use Bayesian Sets. It consists of following two steps. (1) Feature identification: A set of features to represent each entity is designed. (2) Data generation: (a) Multiple feature vector for an entity (b) Feature reweighting (c) Candidate entity ranking
How to get candidate entities Part of Speech (POS) tags – NNP, NNPS and CD as entity indicators; A phrase (could be one word) with a sequence of NNP, NNPS and CD POS tags as one candidate entity. (e.g. “Nokia/NNP N97/CD” as a single entity “Nokia N97”)
How to identity features • Like a typical learning algorithm, one has to design a set of features for learning. Our feature set consists of two subsets: • Entity word features (EWF): characterize the words representing entity themselves. This set of features is completely domain independent. (e.g. “Sony” , “IBM”) • Surrounding word features (SWF): the surrounding words of a candidate entity. (e.g. “ I bought the Sony tv yesterday”)
Data generation • Because for each candidate entity, there are several feature vector is generated. It causes the problem of feature sparseness and entity ranking. • We proposed two techniques to deal with the problems: (1) Feature reweighting (2) Candidate entity ranking
Feature reweighting • Recall the score for an entity from Bayesian set Nis the number of items in the seed set. qijis the feature j of seed entity qi mjis the mean of feature j of all possible entities. kis a scaling factor (which we use 1). In order to make a positive contribution to the final score of entity e, wj must be greater than zero.
Feature reweighting It means if feature j is effective (wj > 0), the seed data mean must be greater than the candidate data mean on feature j. Due to the idiosyncrasy of the data, There are many high-quality features, whose seed data mean may be even less than the candidate data mean.
Feature reweighting For example: In drug data set: “prescribe” is a very good entity feature. “Prescribe EN/NNP” (EN represents an entity, NNP is its POS tag) strongly suggests that EN is a drug. However, the problem is that the mean of this feature in the seed set is 0.024 which is less than its candidate set mean 0.025 It means it is worse than no feature at all !
Feature reweighting In order to fully utilize all features, we change original mj to by multiplying a scaling factor to force all feature weights wj > 0: The idea is that we lower the candidate data mean intentionally so that all the features found from the seed data can be utilized. we let to be greater than N for all features j. to determine t.
Identifying high-quality Features features A and B, same feature frequency => same feature weight In some cases: for feature A, all feature count may come from only one entity in the seed set; for feature B, the feature counts are from different different entities (e.g. Bought + “ Nokia” , “Motorola” “SE”) feature B is a better feature than feature A shared by or associated with more entities.
Identifying high-quality Features r is used to represent feature quality for feature j. h is the number of unique entities that have jth feature. T is the total number of entities in the seed set.
Candidate entity ranking • Each unique candidate entity may generate multiple feature vectors • It is highly desirable to rank those correct and frequent entities at the top Ranking candidate entities Md is the median value for all feature vector scores of candidate entity , n is the candidate entity’s frequency fs(d)is the final score for the candidate entity
Additional technique: enlarge the seed sets • Enlarging the seed set using some high precision syntactic coordination patterns. EN [or | and] EN from EN to EN neither EN nor EN prefer EN to EN [such as | especially | including] EN (, EN)* [or | and] EN EN is the entity name. e.g “Nokia and Samsung do not produce smart phones,”
Additional technique: bootstrapping Bayesian set • This strategy again tries to find more seeds, but using Bayesian Sets itself. run the algorithm iteratively. • At the end of each iteration, pick up k top ranked entities (k = 5 in our experiments). • The iteration ends if no new entity is added to the current seed list.
Similar web-based systems • Google Sets • Boo! Wa!