470 likes | 621 Views
I256: Applied Natural Language Processing. Marti Hearst Nov 8, 2006. Today . Comparing term clustering and category output Clustering in Weka Data mining from blogs. LDA. Latent Dirchelet Allocation Blei, Ng, Jordan, JLMR 03. LDA is a hierarchical probabilistic model of documents.
E N D
I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006
Today • Comparing term clustering and category output • Clustering in Weka • Data mining from blogs
LDA • Latent Dirchelet Allocation • Blei, Ng, Jordan, JLMR 03. • LDA is a hierarchical probabilistic model of documents. • “LDA allows you to analyze of corpus, and extract the topics that combined to form its documents.” • http://www.cs.princeton.edu/~blei/lda-c/ • Not really clustering, but in the “soft clustering” ballpark.
LDA on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/Flamenco
LDA on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-newblei/Flamenco
CastaNet • (Semi)automated facet creation • Stoica & Hearst • Build up from WordNet • Algorithm is fully automatic but we think you can improve results manually afterwards.
CastaNet on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/Flamenco
CastaNet on Recipes http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/recipes-automated/Flamenco
TopicSeek on Enron Email • Technique: pLSI (probabilistic LSI, Hofmann 99) • Hand-picked example for website • http://topicseek.com/enron.html
TopicSeek on Medline • Technique: pLSI (probabilistic LSI, Hofmann 99) • Hand-picked example for website • http://topicseek.com/pubmed.html
CastaNet on Medline Journal Titles http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/medicine-automated/Flamenco
Looking at Clustering Results • Weka lets you save cluster results to an ARFF file • I wrote some python code to process this file and pull out the Subject headings for each newsgroup posting in each cluster.
Blog Analysis • What’s special about blogs?
Blog analysis sites • http://dijest.com/bc/ • Called blogcount; lots of stats and news about blogs • http://blogcensus.net/?page=tools • Language, location, marketshare • http://www.perseus.com/blogsurvey/ • Stats about biggest blogs, demographics • http://www.weblogs.com/ • Notify when new content posted • http://blogpulse.com/ • Trends and recent popular topics
Blogs vs. Newsgroups • Posting about products … what can we tell? • Blog: • Newsgroup: Example from Glance, Hurst, and Tomokiyo ‘04
Analyzing Blogs for Market Data • Idea: examine comments about a product (or a product’s competition or market) in an automated fashion. • Application area: handheld electronic devices. Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05
Analyzing Blogs for Market Data Figure from Glance, Hurst, Nigam, Siegler, Stockton, & Tomokiyo, KDD’05
Technology used • Post segmentation • Important phrases • Foreground vs. background corpus • Background: text about product • Foreground: certain negative paragraphs about product • Sentiment classification • What do people talk about when saying negative things about product X? • Social network analysis (on discussion boards) • What does this group of people talk about when saying negative things about product X? • Author dispersion • Many people talking about it, or just a few?
Example • What common phrases to people use when saying negative things about product X?
Example • What do people in this group say when saying negative things about product X?
Example • What do people in this group say when saying negative things about product X?
Predicting Film Sales • Idea: • Use discussion before a film to predict its opening weekend box office scores • Use discussion afterwards to predict longer-term sales • Separate out topic labels from sentiment labels • Outcome: • Good predictor for opening weekend, but not for longer term • Observation: the nature of discussion gets (and thus harder to analyze) after the film has been out a while. Example from Mishne & Glance, 2006
Predicting Film Sales Example from Mishne & Glance, 2006
Prediction Film Sales Example from Mishne & Glance, 2006
Predicting Film Sales Example from Mishne & Glance, 2006
Analyzing Political Blogs • Analyze: • Who links to whom • What the popularity profile looks like • A powerlaw/Zipf/Pareto, of course • Look at structure of topic-specific blogs • By #inbound links Image from blogsphere ecosystem via Shirky
Analyzing Political Blogs • Earlier work examined books bought together in pairs at major retailers • Krebs, Divided we Stand??? http://www.orgnet.com/leftright.html • In other domains the groupings are more distributed.
Analyzing Political Blogs • Study by Adamic and Glance, 2005 • Analyzed 40 most popular political blogs • 2 months preceding 2004 US presidential election • Also study 1000 political blogs on a one day snapshot • Findings for the latter: • Liberal and conservative blogs had distinct lists of favorate news sources, people, and topics, with some overlap on current news • Use labels from aggregator sources • Linking patterns were indeed pretty internal (91% stayed within political leaning) • More and more frequent linking among conservatives • 82% conservative linked out vs. 74% of liberal
Analyzing Political Blogs • For the 40 most popular blogs: • Looked for “echo chamber” effect • The conservative blogs are more tightly interlinked. • Question: do they repeat the same concepts more? • Measured textual similarity among blog posts • Slightly stronger within a political leaning than between, but not one orientation more than the other. • Looked for interaction with “mainstream” media • Found strong distinctions between which sources cited
Next Time • Sentiment and Opinion Analysis