730 likes | 1.06k Views
Information Extraction from Social Media. Tim Finin 10 October 2006. Overview. Motivation Blogs and feeds UMBC research Seedling opportunities Conclusion. Motivation.
E N D
Information Extraction from Social Media Tim Finin 10 October 2006
Overview • Motivation • Blogs and feeds • UMBC research • Seedling opportunities • Conclusion
Motivation “Social media describes the online tools and platforms that people use to share opinions, insights, experiences, and perspectives with each other.” Wikipedia, Sept 06 It’s a dynamic and growing area, that includes blogs, wikis, forums, photo and video sharing sites, etc.
Motivation • We started looking at blogs a year ago because they were rich in metadata • Encoded in RDF and other formats • We’ve found that blogs and other social media are a rich source of problems and opportunities, including • Information integration on the Web • Modeling trust • Extracting facts, opinions and sentiment • Event and trend detection • If static pages form the Web’s long term memory, then the Blogosphere is its stream of consciousness
Overview • Motivation • Blogs and feeds • UMBC research • Seedling opportunities • Conclusion
State of the Blogosphere • 52 million blogs • Doubling in size every six months • 40 new blog posts per second • 57% of online US teens generate content, 40% read blogs, 20% have them • 53% of companies are blogging • One third of blog posts are in English • Sources: • State of the Blogosphere (Technorati), Fortune 500 Business Blogging Wiki , Pew, 11/05, (Guideware 10/05), UMBC studies
50,000,000 Weblogs (July 2006) • Doubling in size every 6 months for the past 3 years Weblogs Cumulative: 03/03 – 07/06
Feeds • RSS: Really Simple Syndication, RichSite Summary or RDF Site Summary • 1997: David Winer introduced an XMLsyndication format for blogs • 1999: Netscape defined RSS using RDF • Very important for blogs and other social media • An efficient way to distribute new items, changes, updates • Simplifies infrastructure, obviating crawling • Google blogs search is really Google feed search • Feeds for “most recent” blog posts, Wikipedia changes, news articles, sensor information, photos, data elements, etc.
Overview • Motivation • Blogs and feeds • UMBC research • Seedling opportunities • Conclusion
Relevant UMBC Research • Splog detection • Feeds that matter • BlogVox: Extracting opinions from blogs • Modeling influence in blog communities • Semnews: NLP for information extraction on the Web • Semdis: Modelling trust in social networks
Knowing and influencing the market • Your goal is to market Apple’s ipod phone • How can you track the buzz about it? • What are the relevant communities andblogs? • Which communities are fans, which aresuspicious, which are put off by the hype? • Is your advertising having an effect? Thedesired effect? • Which bloggers are influential in this market? Of these, which are already onboard and which are lost causes? • To whom should you send details or evaluation samples?
Modeling influence in social media • Key individuals in a social network are those that are influential • Influential nodes often rely on connectors and information propagators for new topics • Influence is topical • Aggregated beliefs and opinions of the masses can have an influence • Influence is polar • Influence is temporal
Modeling influence in social media • Key individuals in a social network are those that are influential • Influential nodes often rely on connectors and information propagators for new topics • Influence is topical • Aggregated opinions of the masses can have an influence • Influence is polar • Influence is temporal
Post was Influenced by NPR and eWeek Influence on the Blogosphere
Influence Models for Blogs Blog Graph Influence Graph 1/3 U 2 2 1 3 3 2/5 1/3 V 1/3 1 1 1 1/5 5 5 2/5 4 4 1/2 1/2 Wu,v = Cu,v / dv U links to V => U is Influenced by V
Basic Influence Models Linear Threshold Model Σ wuv ≥ θv w is the active neighbor of v Cascade Model Puv-probability with which a node can activate each of its neighbors, independent of history. Influence Graph 1/3 Active 2 1 3 2/5 1/3 θv 1/3 1 1 1/5 5 2/5 Active 4 Inactive 1/2 1/2
Greedy Node Selection Heuristic • At each time step select the next node to be added to the target set such that it maximizes: • number of ‘influential’ node • adding the new node causes an increase in the activated node set • consistent with Technorati rank Influence Graph 1/3 2 1 3 2/5 1/3 1/3 1 1 1/5 5 2/5 4 1/2 1/2 Distribution of Technorati ranks in the 100 most frequently selected nodes using greedy heuristics (averaged over 50+ runs)
Modeling influence in social media • Key individuals in a social network are those that are influential • Influential nodes often rely on connectors and information propagators for new topics • Influence is topical • Aggregated opinions of the masses can have an influence • Influence is polar • Influence is temporal
Influence is topical • Gizmodo is very popular • It’s influential for consumer electronics, e.g., PDAs, mobile phones, gadgets • DailyKOS is very popular • It’s influential for politics, especially liberal politics • What’s a good ontology for blog topics? • How can we categorize blogs w.r.t. a topic ontology?
Readership Based Influence Feeds That Matter: http://ftm.umbc.edu/ • 83K publicly listed subscribers • 2.8M feeds, 500K are unique • 26K users (35%) use folders to organize subscriptions • Data collected in May 2006
Tag Merging Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity.
Finding Influential Feeds using “Co-Citations” Feed recommendations Leading blogs about “Politics”. Seed set is top blogs in “politics” from bloglines and blog graph used is from Blogpulse dataset..
Modeling influence in social media • Key individuals in a social network are those that are influential. • Influential nodes often rely on connectors and information propagators for new topics. • Influence is topical. • Aggregated facts and opinions of the masses can have an influence (‘wisdom of the crowds’) • Influence is polar. • Influence is temporal.
Extracting facts and opinions • 2006 TREC blog track: finding opinionated blog posts about a given topic • SemNews: extracting facts from Web documents using the OntoSem NLP system • Note: there are several startups and other companies trying to commercialize opinion mining
TREC Opinion Extraction • Finding opinionated posts, either positive or negative, about a query • 2006 TREC Blog corpus: • 80K blogs • 300K posts • 50 test queries
BlogVox: Opinion Extraction Result Scoring SVM Score Combiner 1 Query Word Proximity Scorer 4 First Occurrence Scorer Query Terms + 2 Query Word Count Scorer 5 Context Words Scorer Opinionated Ranked Results Lucene Search Results 3 Title Word Scorer 6 Lucene Relevance Score External Resources Supporting Lexicons Positive Word List Google Context Words Negative Word List Amazon Review Words
Spam in the Blogosphere • Types: comment spam, ping spam, splogs • Akismet: “87% of all comments are spam” • 75% of update pings are spam (ebiquity 2005) • 56% of blogs are spam (ebiquity 2005) • 20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) • Spam blogs (splogs) are weblogs used to promoting affiliated websites or host ads • “Spings, or ping spam, are pings that are sent from spam blogs” 1Wikipedia
Some queries returned mostly splogs hybrid cars cholesterol
Post Content Identification • Baseline Heuristic • SVM Method
Modeling influence in social media • Key individuals in a social network are those that are influential • Influential nodes often rely on connectors and information propagators for new topics • Influence is topical • Aggregated opinions of the masses can have an influence • Influence is polar • Influence is temporal
Link Polarity / Citation Signal • Linking alone is not indicator of influence • Polarity can indicate the type of influence • All links not made equal • Post • Comment • Trackback • Blogroll • Advertising • Polarity useful in other applications like trust and bias. <books,-0.9> D <Movies, +0.9> B <food, +0.3> <cars,+0.5> <Movies, +0.8> A C <Music, -0.6>
Modeling influence in social media • Key individuals in a social network are those that are influential • Influential nodes often rely on connectors and information propagators for new topics • Influence is topical • Aggregated opinions of the masses can have an influence • Influence is polar • Influence is temporal
Unwind the Influence in Time • Who started the initial wave? • Who jumped on the story at the same time? • How far did the wave propagate? S t1 t2 t3 t1 t4 t5
SemNews: News to OWL • Semantically Search and Browse news • Aggregators collect the RSS news descriptions form various sources. • The sentences are processed by OntoSem and are converted into TMRs • And then into RDF and OWL • Provides intelligent agents with the latest news in a machine readable format • http://semnews.umbc.edu/
Fact Repository Interface Language Processing Data Aggregators 1 11 2 OntoSem RSS Aggregator Ontology & Instance browser 3 4 News Feeds TMRs FR Text Search 12 RDQL Query 13 6 5 OntoSem2OWL Swoogle Index 14 9 Dekade Editor 7 OntoSem Ontology (OWL) Inferred Triples Semantic RSS 15 10 8 Knowledge Editor Environment TMR Semantic Web Tools