530 likes | 912 Views
Deriving Marketing Intelligence from Online Discussion Natalie Glance and Matthew Hurst CMU Information Retrieval Seminar, April 19, 2006 Overview Motivation Content Segment: The Blogosphere Structural Aspects Topical Aspects Deriving market intelligence Conclusion Motivation Social
E N D
Deriving Marketing Intelligence from Online Discussion Natalie Glance and Matthew Hurst CMU Information Retrieval Seminar, April 19, 2006 © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Overview • Motivation • Content Segment: The Blogosphere • Structural Aspects • Topical Aspects • Deriving market intelligence • Conclusion © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Motivation Social Media Mobile phone data The celly 31 is awesome, but the screen is a bit too dim. © 2006 Nielsen BuzzMetrics, A VNU business affiliate
The Blogosphere © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Profile Analysis Hurst, “24 Hours in the Blogosphere”, 2006 AAAI Spring Symposium on Computational Approaches to Analysing Weblogs. © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Hypotheses • Different hosts attract users with different capacity to disclose profile information (?) • Blogspot users are more disposed to disclose information (?) • Different interface implementations perform differently at extracting/encouraging information from users (?) © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Per Capita: Spaces • variance in average age • variance in profiles with age • variance in per capita bloggers © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Per Capita: Blogspot © 2006 Nielsen BuzzMetrics, A VNU business affiliate
The graphical structure of the blogosphere © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Graphical Structure of the Blogosphere • Citations between blogs indicate some form of relationship, generally topical. • A link is certainly evidence of awareness, consequently reciprocal links are evidence of mutual awareness. • Mutual awareness suggests some commonality, perhaps common interests. • The graph of reciprocal links can be considered a social network. • Areciprocal links suggest topical relationships, but not social ones. © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Graph Layout • Hierarchical Force Layout • Graph has 2 types of links: reciprocal links and areciprocal links • Create set of partitions P where each partition is a connected component in the reciprocal graph. • Create a graph whose nodes are the members of P and whose edges are formed from areciprocal links between (nodes within) member of P. • Layout the partition graph. • Layout each partition. © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Japanese r = 2 p = 25 cooking knitting © 2006 Nielsen BuzzMetrics, A VNU business affiliate
r = 2 p = 1 kbcafe/rss scoble engadget instapundit boingboing gizmodo powerline michellemalkin crooksandliars © 2006 Nielsen BuzzMetrics, A VNU business affiliate
r = 3 p = 100 technology The English blogosphere is political. social/politics © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Political Blogosphere L. Adamic and N. Glance, “The Political Blogosphere and the 2004 U.S. Election: Divided They Blog”, 2nd Annual Workshop on the Weblogging Ecosystem, Chiba, Japan, 2005. © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Political Blogs & Readership • Pew Internet & American Life Project Report, January 2005, reports: • 63 million U.S. citizens use the Internet to stay informed about politics (mid-2004, Pew Internet Study) • 9% of Internet users read political blogs preceding the 2004 U.S. Presidential Election • 2004 Presidential Campaign Firsts • Candidate blogs: e.g. Dean’s blogforamerica.com • Successful grassroots campaign conducted via websites & blogs • Bloggers credentialed as journalists & invited to nominating conventions © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Research Goals & Questions • Are we witnessing a cyberbalkination of the Internet? • Linking behavior of blogs may make it easier to read only like-minded bloggers • On the other hand, bloggers systematically react to and comment on each others’ posts, both in agreement and disagreement (Balkin 2004) • Goal: study the linking behavior & discussion topics of political bloggers • Measure the degree of interaction between liberal and conservative bloggers • Find any differences in the structure of the two communities: is there a significant difference in “cohesiveness” in one community over another? © 2006 Nielsen BuzzMetrics, A VNU business affiliate
The Greater Political Blogosphere • Citation graph of greater political blogosphere • Front page of each blog crawled in February 2005 • Directed link between blog A and blog B, if A links to B • Method biases blogroll/sidebar links (as opposed to links in posts) • Results • 91% of links point to blog of same persuasion (liberal vs. conservative) • Conservative blogs show greater tendency to link • 82% of conservative blogs are linked to at least once; 84% link to at least one other blog • 67% of liberal blogs are linked to at least once; 74% link to at least one other blog • Average # of links per blog is similar: 13.6 for liberal; 15.1 for conservative • Higher proportion of liberal blogs that are not linked to at all © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Citations between blogs extracted from posts (Aug 29th – Nov 15th, 2004) • All citations between A-list blogs in 2 months preceding the 2004 election • Citations between A-list blogs with at least 5 citations in both directions • Edges further limited to those exceeding 25 combined citations Only 15% of the citations bridge communities
Are political blogs echo chambers? • Performed pairwise comparison of URL citations and phrase usage from blog posts • Link-based similarity measure • Cosine similarity: cos(A,B) = vA.vB/(||vA||*||vB||), where vA is a binary vector. Each entry = 1 or 0, depending on whether blog A cites a particular URL • Average similarity(L,R) = 0.03; cos(R,R) = 0.083; cos(L,L) = 0.087 • Phrase-based similarity measure • Extracted set of phrases, informative wrt background model • Entries in vA are TF*IDF weight for each phrase = (# of phrase mentions by blog)*log[(# blogs)/(# blogs citing the phrase)] • Average similarity(L,R) = 0.10; cos(R,R) = 0.54; cos(L,L) = 0.57 © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Influence on mainstream media Notable examples of blogs breaking a story • Swiftvets.com anti-Kerry video • Bloggers linked to this in late July, keeping accusations alive • Kerry responded in late August, bringing mainstream media coverage • CBS memos alleging preferential treatment of Pres. Bush during the Vietnam War • Powerline broke the story on Sep. 9th, launching flurry of discussion • Dan Rather apologized later in the month • “Was Bush Wired?” • Salon.com asked the question first on Oct. 8th, echoed by Wonkette & PoliticalWire.com • MSM follows-up the next day © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Deriving Market Intelligence N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton and T. Tomokiyo. Deriving Marketing Intelligence from Online Discussion. Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2005). © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Automating Market Research • Brand managers want to know: • Do consumers prefer my brand to another? • Which features of my product are most valued? • What should we change or improve? • Alert me when a rumor starts to spread! © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Comparative mentions: Halo 2 ‘halo 2’ © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Case Study: PDAs • Collect online discussion in target domain (order of 10K to 10M posts) • Classify discussion into domain-specific topics (brand, feature, price) • Perform base analysis over combination of topics: buzz, sentiment/polarity, influencer identification © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Dell Axim, 11.5% buzz, 3.4 polarity © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Interactive analysis • Top-down approach: drill down from aggregate findings to drivers of those findings • Global view of data used to determine focus • Model parent and child slice • Use data driven methods to identify what distinguishes one data set from the other © 2006 Nielsen BuzzMetrics, A VNU business affiliate
SD card © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Social network analysis for discussion about the Dell Axim © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Drilling down to sentence level • Discussion centers on poor quality of sound hardware & IR ports • “It is very sad that the Axim’s audio AND Irda output are so sub-par, because otherwise it is a great Pocket PC.” • “Long story made short: the Axim has a considerably inferior audio output than any other Pocket PC we have ever tested.” • “When we tested it we found that there was a problem with the audio output of the Axim.” • “The Dell Axim has a lousy IR transmitter AND a lousy headphone jack.” • Note: these examples are automatically extracted. © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Technology • Data Collection: • Document acquisition and analysis • Classification (relevance/topic) • Topical Analysis: • Topic classification using a hierarchy of topic classifiers operating at sentence level. • Phrase mining and association. • Intentional Analysis: • Interpreting sentiment/polarity • Community analysis • Aggregate metrics © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Topical Analysis • Hierarchy of topics with specific ‘dimensions’: • Brand dimension • Pocket PC: • Dell Axim • Toshiba • e740 • Palm • Zire • Tungsten • Feature dimension: • Components • Battery © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Topical Analysis • Each topic is a classifier, e.g. a boolean expression with sentence and/or message scoped sub-expressions. • Measured precision of classifier allows for projection of raw counts. • Intersection of typed dimensions allows for a basic approach to association (e.g. find sentences discussing the battery of the Dell Axim). © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Polarity: What is it? • Opinion, evaluation/emotional state wrt some topic. • It is excellent • I love it. • Desirable or undesirable condition • It is broken (objective, but negative). • We use a lexical/syntactic approach. • Cf. related work on boolean document classification task using supervised classifiers. © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Polarity Identification This car is really great © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Polarity Identification This car is really great POS: DT NN VB RR JJ © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Polarity Identification This car is really great POS: DT NN VB RR JJ Lexical orientation: 0 0 0 0 + © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Polarity Identification This car is really great POS: DT NN VB RR JJ Lexical orientation: 0 0 0 0 + BNP BVP BADJP Chunking: © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Polarity Identification This car is really great POS: DT NN VB RR JJ Lexical orientation: 0 0 0 0 + BNP BVP BADJP Chunking: (parsing): Positive Interpretation: © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Polarity Challenges • Methodological: ‘She told me she didn’t like it.’ • Syntactic: ‘His cell phone works in some buildings, but it others it doesn’t.’ • Valence: • ‘I told you I didn’t like it’, • ‘I heard you didn’t like it’, • ‘I didn’t tell you I liked it’, • ‘I didn’t hear you liked it’: man verbs (tell, hear, say, …) require semantic/functional information for polarity interpretation. • Association © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Polarity Examples © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Polarity Metric • Function of counts of polar statements on a topic: f(size, f top, f top+pos, f top+neg) • Use empirical priors to smooth counts from observed counts (helps with low counts) • Use P/R of system to project true counts and provide error bars (requires labeled data) • Example: +/- ratio metric maps ratio to 0-10 score © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Predicting Movie Sales from Blogger Sentiment G. Mishne and N. Glance, “Predicting Movie Sales from Blogger Sentiment,” 2006 AAAI Spring Symposium on Computational Approaches to Analysing Weblogs. © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Blogger Sentiment and Impact on Sales • What we know: • There is a correlation between references to a product in the blogspace and its financial figures • Tong 2001: Movie buzz in Usenet is correlated with sales • Gruhl et. al.: 2005: Spikes in Amazon book sales follow spikes in blog buzz • What we want to find out: • Does taking into account the polarity of the references yield a better correlation? • Product of choice: movies • Methodology: compare correlation of references to sales with the correlation of polar references to sales © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Experiment • 49 movies • Budget > 1M$ • Released between Feb. and Aug. 2005 • Sales data from IMDB • “Income per Screen” = opening weekend sales / screens • Blog post collection • References to the movies in a 2-month window • Used IMDB link + simple heuristics • Measure: • Pearson’s-R between the Income per Screen and {references in blogs, positive/polar references in blogs} • Applied to various context lengths around the reference © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Results Income per screen vs. positive references • For 80% of the movies, r > 0.75 for pre-release positive sentiment • 12% improvement compared with correlation of movie sales with simple buzz count (0.542 vs. 0.484) © 2006 Nielsen BuzzMetrics, A VNU business affiliate
Conclusion • The intersection of Social Media and Data/Text Mining algorithms presents a viable business opportunity set to replace traditional forms of market research/social trend analysis/etc. • Key elements include topic detection and sentiment mining. • The success of the blogosphere has driven interest in a distinct form of online content which has a long history but is becoming more and more visible. • The blogosphere itself is a fascinating demonstration of social content and interaction and will enjoy many applications of traditional and novel analysis. © 2006 Nielsen BuzzMetrics, A VNU business affiliate