500 likes | 638 Views
Project Discussion-3. Prof. Vincent Ng Jitendra Mohanty Girish Vaidyanathan Chen Chen. Agenda. Brief recap of last presentation Things we have done so far Finding interests of user Finding user’s gender Finding trending topic Finding Opinion target pair Argument detection Future plans
E N D
Project Discussion-3 Prof. Vincent Ng Jitendra Mohanty Girish Vaidyanathan Chen Chen
Agenda • Brief recap of last presentation • Things we have done so far • Finding interests of user • Finding user’s gender • Finding trending topic • Finding Opinion target pair • Argument detection • Future plans • Argument detection • User profile construction
Brief recap of last presentation • Finding interests of user • 20 categories of interest • Algorithm • Neural network • Support Vector Machine • Passive Aggressive algorithm • Data • Twitter data • Blog data
Data Preparation music 0.3305412193304446 photography 0.13342809041481524 art 0.10207545607595286 reading 0.3219854828471283 movie 0.19912786686170067 sport 0.2109817017635857 writing 0.1620484089090056 travel 0.12136726188833384 cooking 0.0761322551265421 fashion 0.0824524604642177 food 0.06361604062594872 politics 0.04773272983192118 god 0.05087903292578588 singing 0.055998675240802584 dancing 0.05427372836916623 family 0.05007865757734662 animal 0.04193690834322303 shopping 0.043261667540639745 game 0.046159578284988824 Social media 0.05710264123864985 We have 120,778 users totally. 60% are used as training data. 20% are used as development data. (Tune learning algorithm parameter) 20% are used as testing data.
Finding interests of user • Feature groups • POS sequence: 1003 • Named entities: 14 • Social linguistic: 37063 • Bigram : 193985 • Unigram: 273985 • Unigram for description: 18855 • Bigram for description: 15754 • Ngram for user name: 17482 • Ngram for screen name: 19944 • Totally: 578,085
Finding interests of user • Measures • Precision. The fraction of predict users who really have that interest. • Recall. The fraction of relevant instances that are retrieved. • F score. • Accuracy.
Finding interests of user • Neural Network Result
Finding interests of user • Support Vector Machine Result
Finding interests of user • Passive Aggressive
Finding interests of user • Result Comparison
Finding interests of user • Recall Analysis • Recall is low because of the data size per user. • Some users claim they have certain interest, while they have not published any tweets or blogs related to those kinds of interest. However, once they publish some tweets or blogs related to those kinds of interest in the future, our system can make the right prediction. • Precision Analysis • We find some cases that people have published some tweets or blogs related to one certain interest, however, they doesn’t specify it as their interest. • Precision is higher than recall for most interest categories. • Result Analysis
Finding interests of user • Including more features • Tweet POS sequence: 1003 • Named entities: 14 • Social linguistic: 37063 • Bigram for tweets and blogs: 193985 • Unigram for tweets and blogs: 273985 • Unigram for description: 18855 • Bigram for description: 15754 • Ngram for user name: 17482 • Ngram for screen name: 19944 • Gender: 2 • Blog pos sequence: 3075 • Unigram for “About me”: 18753 • Bigram for “About me”: 12391 • Industry: 39 • Location: 4505 • Occupation: 332 • Totally: 617180
Finding interests of user • Neural network result after including more feature
Finding interests of user • Result after including more feature. • After including additional features we are able to predict interests of all categories better than before. Some categories like music, reading, writing, fashion, politics are improved by more than 1 percent.
Finding interests of user • Feature Analysis • Totally there are 16 feature groups. Delete one feature group and then see the result. • As neural network can give the best result, we apply neural network to analyze.
Finding gender of user • Motivation: • Help to construct user’s profile • Help to compare opinions between different gender • Data • Tweet data • Blog data • Feature group: • POS sequence: 1003 • Named entities: 14 • Social linguistic: 37063 • Bigram : 193985 • Unigram: 273985 • Unigram for description: 18855 • Bigram for description: 15754 • Ngram for user name: 17482 • Ngram for screen name: 19944 • Interest features * • Total number: 578085 + 20 (Interest features amount)
Finding gender of user • Result: If we can improve our interest prediction, it may be possible to improve the gender prediction.
Finding Trending Topics • Motivation • Helps in finding the interesting topics that attract people’s attention • Trending topics are helpful in argument detection • Possible Approaches • Naïve Approach • Online Clustering • Latent Dirchlet Allocation (LDA)
Trending Topics Results • Naïve Approach: A brief recap • Visit every tweet in our dataset and find the words and phrases that are occurring frequently. • Those are the probable trending topics in our dataset • The timeframe of a trending topic can be found in a similar fashion by keeping track of the timestamp associated with every tweet and find the minimum and maximum timestamp with respect to every phrase/word
Some Results for the month of December in our dataset in chronological order using Naïve Approach
Problems in Naïve approach • Many irrelevant words or phrases with large counts will be considered as trending topics. • For example, • Youtube video • I arrived • Watching movie
Solution for the problem in the previous slide ( part of future work ) • Possible solutions • Online Clustering • Latent Dirichlet Allocation
Solution for the problem in the previous slide ( part of future work ) • Online clustering algorithm • All the tweets are ordered on the timeline • Tweets are represented vector of tf-idf(term freq. & inverted document freq.) weights • Tweets which have highest similarities are clustered together. • Every cluster corresponds to a trending topic. • Online clustering algorithm works better because it uses tf-idf weights for all terms and find the similarity between a tweet and all the current clusters available. • For Example, consider the following tweets • I love lady gaga • I love gandhi • Lady gaga is the best singer in the world
Solution for the problem in the previous slide ( part of future work ) • Latent Dirchlet Allocation(LDA) • LDA is a bag of words model • In LDA, each tweet is viewed as a mixture of various topics. • Suppose a tweet has a particular trending topic in it, It has a high probability of belonging to that topic.
Opinion-Target: What is it?? • Opinion words in an opinionated sentence, in most cases, are adjectives which act upon directly on its target. For example: • I am so excited that the vacation is coming. • Here the opinion word is excited • And its target is I • The water is green andclear. • Here the opinion word isgreen • And its target isWater • The Dream Lake is a beautiful place. • Here the opinion is beautiful • And its target isDream Lake
Why Opinion-Target pair??Motivation • An opinionated sentence gives a sense of general opinion of a person on a *subject material* or *topic*, called *target* in the research literature. • *Subject material* or *topic* is diverse. • For example, it could be travel article which deals with several tourist attractions. • Last two examples in the previous slides are tourism related opinions by us. • Opinions change over the course of time. Example • At time t1 user p’s view, place x has really good scenic view, let’s go for it. • At time t2 = t1+(1year) user p’s view, place y has better scenic view as compared to place x. • The opinion of the user about the tourist place x has changed over the 1 year time frame. It has changed from positive to negative over the time. • This gives us a sense of belief that by listening to the posts (tweets in our case), we can create a profile that would give us a way to see if there is a change in the interests of an user on a particular topic over a time duration.
Extraction of Opinion-Target Pair • Stanford parser was run over the tweets to give us dependencies among different entities in a tweet. • Following 5 rules were used on the dependency information generated from previous step to generate Opinion-Target pair. • Direct Object Rule • dobj(opinion, target) • I love (opinion1) Firefox(target1) and defended(opinion2) it. • Nominal Subject Rule • nsubj(opinion, target) • IE(target) breaks(opinion) with everything. • Adjective Modifier Rule • amod(target, opinion) • The annoying(opinion) popup(target)The opinion is the adjectival modifier of the target • Prepositional Object Rule • If prep(target1, IN) => pobj(IN, target2) • The prepositional object of a known target is also a target of the same opinionThe annoying(op) popup(tar1) in IE(tar2) • Recursive Modifiers Rule • If conj(adj2, opinion adj1) => amod(target, adj2)
What to do with Opinion-Target pairs extracted?? • Once we have the opinion-target pair, we used subjectivity lexicon of (Wilson et al., 2005), which contains 8221 words to express the polarity of the opinion. The words are nothing but the opinions. • Some samples from the lexicon type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negativetype=weaksubj len=1 word1=abandon pos1=verb stemmed1=y priorpolarity=negativetype=weaksubj len=1 word1=ability pos1=noun stemmed1=n priorpolarity=positivetype=weaksubj len=1 word1=above pos1=anypos stemmed1=n priorpolarity=positive type=strongsubj len=1 word1=amazing pos1=adj stemmed1=n priorpolarity=positivetype=strongsubj len=1 word1=absolutely pos1=adj stemmed1=n priorpolarity=neutraltype=weaksubj len=1 word1=absorbed pos1=verb stemmed1=n priorpolarity=neutral
How does Opinion-Target pairs extracted look like? Tweet_idOpinion-Target pairPolarity-Target pair Tweet:3 will-I I+Tweet:6 evil-I I-Tweet:7 best-books books+Tweet:7 cry-me me-Tweet:7 think-what what*Tweet:7 think-you you*Tweet:9 amazing-houses houses+Tweet:10 love-I I+ Tweet:<More amazing gingerbread houses> The opinion-target pair extracted from this tweet is <amazing-houses> and the corresponding polarity-target pair is <houses+>, where amazing has the positive prior polarity.
What Next ? • Integration • Next, we apply these polarity-target pairs to the tweets to get useful information about the interests of a person.
Integrating opinion-target pair with tweets of trending-topic • Trending topics are those that are immediately popular in tweeter world. • This helps people discover the *most breaking* news stories from across the world. • Polarity-Target pairs are applied to the trending topic tweets generated to find the opinion of a person w.r.t a trending-topic. • It gives us a sense of the user who has posted the tweet, along with the actual message, the topic that the tweet is all about and the corresponding polarity. For example: The following tweet has the tweet_id 746 in a general tweet file. RT @GirlOnMission Hasn't Obama's warranty runs out yet? // It was a limited warranty covering nothing substantial anyway! The above tweet has also been tagged as trending topic tweet under the trending topic *Obama*. Opinion-Target and Polarity-Target pairs for the above tweet generated using the five rules, that we discussed, are as follows: tweet_id: 746 limited-warranty warranty-tweet_id: 746 substantial-nothing nothing+ We have a matched tweet_id from both the scenarios above, which gives us the opinion-target and polarity-target pairs which will be used for argument detection and profile construction.
Drawback • The polarity that we talked about is the prior-polarity. It does not take the *context of the sentence* into consideration. • However, Opinon-Finder does! • why prior-polarity is not always effective?? • Explained with example in later slides
Opinion-Finder • Software developed at University of Pittsburg to predict the contextual polarity of the sentence. Mainly, it was designed for documents and has limitation on the size of the file that it can process as well as with the sentence splitting module. • We modified their software to deal with our purpose, i.e. tweets. • It is extremely slow. For example, processes 27M file in 12 hours approx (totally tweets file size is 18GB)
How is Opinion-Finder different from conventional Opinion-Target/Polarity-Target pair • Consider a tweet: <No one is happy with Barack Obama’s healthcare plan…> • From intuition, we can say that the above tweet has negative connotation. • How is it depicted in Opinion-Finder? • Output of the Opinion-Finder <MPQASENT autoclass1="unknown" autoclass2="obj" diff="3.1">No one is <MPQASD><MPQAPOL autoclass="negative">happy</MPQAPOL></MPQASD> with Barack <MPQASRC>Obama</MPQASRC>'s healthcare plan</MPQASENT>. • Output of the conventional 5-rule system: • Tweet:1 happy-oneone+
How does Contextual-Polarity output look like? • <MPQASENT autoclass1="obj" autoclass2="obj" diff="21.9">@A_ClayChillin what the <MPQAPOL autoclass="negative">hell</MPQAPOL> did she do, <MPQASD>push</MPQASD> him out the truck?</MPQASENT> • <MPQASENT autoclass1="unknown" autoclass2="subj" diff="1.7"><MPQASRC>i</MPQASRC> <MPQASD>think</MPQASD> its time for me to go back to bed, dnl going to bed at 3 and waking up at 7. yuck.</MPQASENT> • <MPQASENT autoclass1="obj" autoclass2="obj" diff="25.2">Quebecor veut vendre Jobboom - secteurs-d-activite - LesAffaires.com - http://bit.ly/5dlkE5</MPQASENT> • <MPQASENT autoclass1="unknown" autoclass2="obj" diff="7.4">wak @mirandamia maacii ya keik boltantemnyaaa,, *senaaaaannggg*</MPQASENT> • <MPQASENT autoclass1="subj" autoclass2="subj" diff="12.5">90% of any <MPQAPOL autoclass="negative">pain</MPQAPOL> comes from trying to keep the <MPQAPOL autoclass="negative">pain</MPQAPOL> secret. You cannot keep a secret and let it go.</MPQASENT>
Status as of now.. • Opinion-Finder is slow. • It runs a pipeline of internal modules, such as Document Preprocessing, Sentence Splitting, Tokenization, POS tagging, Feature Finder, Source Finder etc. • ~20% of the tweets has just completed processing using Opinion-Finder.
Argument Detection • Argument detection is to find the argument people use to support or oppose an issue. • Example: • Obamais bankrupting Americans, he does nothing to improve the economy just drain it. • “Obama” is the issue • “bankrupting American” and “does nothing to improve the economy just drain it” is the argument
Argument Detection • Motivation • To discover the reason why people show positive or negative opinion towards to an issue. • If people suddenly change their opinion towards to an issue because of a particular event, we can infer what exactly the event is from the argument they use. • Argument will be used as an attribute in user’s profile. • There is a step in argument detection, which is to classify the polarity of the tweet towards to the issue. From the result of polarity classification, we can infer public’s attribute about the issue.
Argument Detection Approach • Step 1. Given a trending topic, retrieve all the tweets associated with that trending topic. (Output from trending topic detection) • Assuming trending topics as issues, we will detect people’s argument for the issue from those tweets which are relevant with that topic. • Step 2. Determine whether one tweet is subjective or neutral about the topic. (Going on) • Though some tweets belong to a certain trending topic, they don’t show any subjective opinion to the topic. Example: • Barack Obama makes banks an offer they can’t refuse.
Argument Detection Approach • Step 3. Polarity classification to decide whether this tweet is positive or negative towards to the topic. (Going on) • After we get argument from tweets, we need to know whether the argument is used to support or oppose the issue. So we should know the polarity of the tweet first. • Step 4. Get all opinion target pairs from those tweets which show positive or negative opinion separately. (Output from opinion-target pairs) • We will collect the argument from those opinion-target pairs.
Argument Detection Approach • Step 5. Determine whether this opinion target can be used as argument. (Going on) • There are some opinion-target pairs which can’t be used as argument. Examples: • Tweet: I envy you guys with a leader like Obama. • Opinion-target pairs: • envy-I I- (this opinion-target pair can’t be used as argument) • Use mention co-reference and Mutual Information (MI) to find useful opinion-target pairs.
Argument Detection Approach • Find those targets which are co-referenced with the topic. Example: • Tweet: Obama is still the best president you Americans have had in a very long time:) • Opinion-target pairs: • best-president president+ • Argument • Positive president (best) (Obama and president are co-referenced)
Argument Detection Approach • MI is a quantity that measures the mutual dependence of the two random variables. Calculate MI between topic and target from opinion-target pairs. If the value of MI exceeds a certain threshold, then consider this opinion-target pair. Example: • Tweet: Obama’s nasty army. They aren’t funded yet but take a good look… • Opinion-target pairs: • nasty-army army- • Argument: • Negative army (nasty and little). (MI between Obama and army is high)
Argument Detection Approach • Step 6. Argument cluster (Going on) • Cluster those arguments which have the same meaning into the same cluster by WordNet. • The target of opinion-target pairs may have similar meaning, so we can cluster them. Example :(we can combine troops and army to the same cluster) • Argument: • Negative troops (Little) • Negative army (Nasty)
Future work • Trending Topic Detection • Two more algorithm to overcome the shortage of naïve approach. • Opinion-Target Pairs • Run opinion finder to parse all the tweets. • Compare the results of lexicon polarity and contextual polarity. • Argument Detection • Identify whether the tweet is subjective or objective towards to an topic. • Identify the polarity of those subjective tweets • Identify those useful opinion target for argument detection • Cluster opinion target
Future work • User profile construction. User’s profiles will include those following content: • All the tweets one published • Location, description in user’s tweet account profile • Predicted gender • Predicted interests • Opinion target pairs • Trending topics one have ever discussed and also his opinion towards to those trending topics. • Arguments they use to support or oppose an topic. • …..