Web Communities: The World Online

Web Communities: The World Online Raghu Ramakrishnan Chief Scientist for Audience and Cloud Computing Research Fellow Yahoo! (On leave, Univ. of Wisconsin-Madison)

Evolution of Online Communities

Rate of content creation • Estimated growth of content • Published content from traditional sources: 3-4 Gb/day • Professional web content: ~2 Gb/day • User-generated content: 8-10 Gb/day • Private text content: ~3 Tb/day (200x more) • Upper bound on typed content: ~700 Tb/day (Towards a PeopleWeb, Ramakrishnan &Tomkins, IEEE Computer, August 2007)

Metadata • Estimated growth of metadata • Anchortext: 100Mb/day • Tags: 40Mb/day • Pageviews: 100-200Gb/day • Reviews: Around 10Mb/day • Ratings: <small> Drove most advances in search from 1996-present Increasingly rich and available, but not yet useful in search This is in spite of the fact that interactions on the web are currently limited by the fact that each site is essentially a silo

PeopleWeb: Site-Centric People-Centric Global Object Model Portable Social Environment • Common web-wide id for objects (incl. users) • Even common attributes? (e.g., pixels for camera objects) • As users move across sites, their personas and social networks will be carried along • Increased semantics on the web through community activity (another path to the goals of the Semantic Web) Community Search (Towards a PeopleWeb, Ramakrishnan &Tomkins, IEEE Computer, August 2007)

Content Access and Ownership (Slide courtesy Andrew Tomkins)

Facebook Apps, Open Social • Web site provides canvas • Third party apps can paint on this canvas • “Paint” comes from data on and off-network • Via APIs that each site chooses to expose What is the core asset of a web portal? • What are the computational implications? • App hosting and caching • Dynamic, personalized content • Searching over “spaghetti” information threads

Trends in Search

I want to book a vacation in Tuscany. Start Finish Search and Content Supply • Premise: • People don’t want to search • People want to get tasks done Broder 2002, A Taxonomy of web search

Reserve a table for two tonight at SF’s best Sushi Bar and get a free sake, compliments of OpenTable! Category: restaurant Location: San Francisco Alamo Square Seafood Grill - (415) 440-2828 803 Fillmore St, San Francisco, CA - 0.93mi - map Category: restaurant Location: San Francisco Structure Intent “seafood san francisco” Category: restaurant Location: San Francisco

Y! Shortcuts

Google Base

Search as Killer App for Web Data Semantics • Publishers and search engine collaborate • Example: Abstracts surfacing structured content • Users see richer search experience • Accomplish their tasks faster and more effectively

Social Search

Social Search • Explicitly open up search • Enable communities, sites and consumers to explicitly re-define search results (e.g., SearchMonkey, Boss) • What is the right unit for a “search result”? Can we intelligently “stitch together” more informative abstracts, possibly from multiple sources? • Facilitate creation of specialized ranking engines based on different kinds of tasks, or aimed at different communities of users • Implicitly leverage socially engaged users and their interactions • Learning from shared community interactions, and leveraging community interactions to create and refine content • Expanding search results to include sources of information • E.g., Experts, sub-communities of shared interest, particular search engines (in a world with many, this is valuable!) Reputation, Quality, Trust, Privacy

Opening Up Yahoo! Search Phase 1 Phase 2 BOSS takes Yahoo!’s open strategy to the next level by providing Yahoo! Search infrastructure and technology to developers and companies to help them build their own search experiences. Giving site owners and developers control over the appearance of Yahoo! Search results. (Slide courtesy Prabhakar Raghavan)

What Is It? An open platform for using structured data to build more useful and relevant search results Before After (Slide courtesy Amit Kumar)

What’s New? media product images business photos profile pictures task links buy this user reviews best trips user choice remove report spam send result share this rich result with others favicon structured data review ratings product prices hours of operation (Slide courtesy Amit Kumar)

How Does It Work? 1 Site owners/publishers share structured data with Yahoo!. Site owners & third-party developers build SearchMonkey apps. 2 Consumers customize their search experience with Enhanced Results or Infobars 3 Page Extraction RDF/Microformat Markup Acme.com’s Web Pages Index DataRSS feed Web Services Acme.com’s database (Slide courtesy Amit Kumar)

Publishing Structured Data: Support for Emerging Semantic Web Standards ++ Microformats hCard, hEvent, hReview, hAtom, XFN More as they get adopted RDFa and eRDF markup OpenSearch +extensions to return structured data Atom/RSS Feeds +extensions to embed structured data markup (crawl) apis (pull) push (Slide courtesy Andrew Tomkins)

Infobars: Integrating 3rd Party Data Pull in data from any web service (Slide courtesy Amit Kumar)

Search Results of the Future yelp.com Gawker babycenter New York Times epicurious LinkedIn answers.com webmd (Slide courtesy Andrew Tomkins)

BOSS Offerings BOSS offers two options for companies and developers and has partnered with top technology universities to drive search experimentation, innovation and research into next generation search. • ACADEMIC • Working with the following universities to allow for wide-scale research in the search field: API A self-service, web services model for developers and start-ups to quickly build and deploy new search experiences. CUSTOM Working with 3rd parties to build a more relevant, brand/site specific web search experience. This option is jointly built by Yahoo! and select partners. • University of Illinois Urbana Champaign • Carnegie Mellon University • Stanford University • Purdue University • • MIT • Indian Institute of • Technology Bombay • University of • Massachusetts (Slide courtesy Prabhakar Raghavan)

BOSS Could Enable Custom Search Experiences Social Search Vertical Search Visual Search (Slide courtesy Prabhakar Raghavan)

Partner Examples

Web Search Results for “Lisa” Latest news results for “Lisa”. Mostly about people because Lisa is a popular name 41 results from My Web! Web search results are very diversified, covering pages about organizations, projects, people, events, etc.

Save / Tag Pages You Like Enter your note for personal recall and sharing purpose You can save / tag pages you like into My Web from toolbar / bookmarklet / save buttons You can pick tags from the suggested tags based on collaborative tagging technology Type-ahead based on the tags you have used You can specify a sharing mode You can save a cache copy of the page content (Courtesy: Raymie Stata)

My Web 2.0 Search Results for “Lisa” Excellent set of search results from my community because a couple of people in my community are interested in Usenix Lisa-related topics

Google Co-Op Query-based direct-display, programmed by Contributor This query matches a pattern provided by Contributor… …so SERP displays (query-specific) links programmed by Contributor. Subscribed Link edit | remove Users “opts-in” by “subscribing” to them

Tech Support at COMPAQ “In newsgroups, conversations disappear and you have to ask the same question over and over again. The thing that makes the real difference is the ability for customers to collaborate and have information be persistent. That’s how we found QUIQ. It’s exactly the philosophy we’re looking for.” “Tech support people can’t keep up with generating content and are not experts on how to effectively utilize the product … Mass Collaboration is the next step in Customer Service.” – Steve Young, VP of Customer Care, Compaq

- Partner Experts - - Customer Champions - Employees How It Works QUESTION QUESTION KNOWLEDGE Customer KNOWLEDGE BASE BASE SELF SERVICE SELF SERVICE Answer added to power self service Answer added to power self service ANSWER Support Agent

Timely Answers 77% of answers provided within 24h 6,845 • No effort to answer each question • No added experts • No monetary incentives for enthusiasts 86%(4,328) 74%answered 77%(3,862) 65%(3,247) 40%(2,057) Answers provided in 3h Answers provided in 12h Answers provided in 24h Answers provided in 48h Questions

Power of Knowledge Creation SUPPORT SHIELD 2 SHIELD 1 Knowledge Creation Self-Service *) ~80% Customer Mass Collaboration *) 5-10 % Support Incidents Agent Cases *) Averages from QUIQ implementations

Mass Contribution Users who on average provide only 2 answers provide 50% of all answers Answers 100 % (6,718) Contributed by mass of users 50 % (3,329) Top users Contributing Users 7 %(120) 93 %(1,503)

Interesting Problems • Question categorization • Detecting undesirable questions & answers • Identifying “trolls” • Ranking results in Answers search • Finding related questions • Estimating question & answer quality (Byron Dom: SIGIR talk)

Supplying Structured Search Content • Semantic Web? • Unleash community computing—PeopleWeb! • Three ways to create semantically rich summaries that address the user’s information needs: • Editorial, Extraction, UGC Challenge: Design social interactions that lead to creation and maintenance of high-quality structured content

Better Search via Information Extraction • Extract, then exploit, structured data from raw text: For years, Microsoft CorporationCEOBill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a MicrosoftVP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Select Name From PEOPLE Where Organization = ‘Microsoft’ PEOPLE Name Title Organization Bill GatesCEOMicrosoft Bill VeghteVPMicrosoft Richard StallmanFounderFree Soft.. Bill Gates Bill Veghte (from Cohen’s IE tutorial, 2003)

Community Information Management (CIM) • Many real-life communities have a Web presence • Database researchers, movie fans, stock traders • Each community = many data sources + people • Members want to query and track at a semantic level: • Any interesting connection between researchers X and Y? • List all courses that cite this paper • Find all citations of this paper in the past one week on the Web • What is new in the past 24 hours in the database community? • Which faculty candidates are interviewing this year, where?

DBLife • Integrated information about a (focused) real-world community • Collaboratively built and maintained by the community • Semantic web via extraction & community

DBLife • Faculty: AnHai Doan & Raghu Ramakrishnan • Students: P. DeRose, W. Shen, F. Chen, R. McCann, Y. Lee, M. Sayyadian • Prototype system up and running since early 2005 • Plan to release a public version of the system in Spring 2007 • 1164 sources, crawled daily, 11000+ pages / day • 160+ MB, 121400+ people mentions, 5600+ persons • See DE overview article, CIDR 2007 demo

DBLife Papers • Efficient Information Extraction over Evolving Text Data, F. Chen, A. Doan, J. Yang, R. Ramakrishnan. ICDE-08. • Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach, P. DeRose, W. Shen, F. Chen, A. Doan, R. Ramakrishnan. VLDB-07. • Declarative Information Extraction Using Datalog with Embedded Extraction Predicates, W. Shen, A. Doan, J. Naughton, R. Ramakrishnan. VLDB-07. • Source-aware Entity Matching: A Compositional Approach, W. Shen, A. Doan, J.F. Naughton, R. Ramakrishnan: ICDE 2007. • OLAP over Imprecise Data with Domain Constraints, D. Burdick, A. Doan, R. Ramakrishnan, S. Vaithyanathan. VLDB-07. • Community Information Management, A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. IEEE Data Engineering Bulletin, Special Issue on Probabilistic Databases, 29(1), 2006. • Managing Information Extraction, A. Doan, R. Ramakrishnan, S. Vaithyanathan. SIGMOD-06 Tutorial.

DBLife • Integrate data of the DB research community • 1164 data sources Crawled daily, 11000+ pages = 160+ MB / day

Entity Extraction and Resolution co-authors = A. Doan, Divesh Srivastava, ... Raghu Ramakrishnan

“Proactive Re-optimization write write write Pedro Bizarro Shivnath Babu coauthor coauthor David DeWitt advise advise coauthor Jennifer Widom PC-member PC-Chair SIGMOD 2005 Resulting ER Graph

Challenges • Extraction • Domain-level vs. site-level extraction “templates” • Compositional, customizable approach to extraction planning • Blending extraction with other sources (feeds, wiki-style user edits) • Maintenance of extracted information • Managing information Extraction • Incremental maintenance of “extracted views” at large scales • Mass Collaboration—community-based maintenance • Exploitation • Search/query over extracted structures in a community • Search across communities—Semantic Web through the back door! • Detect interesting events and changes

Mass Collaboration We want to leverage user feedback to improve the quality of extraction over time. Maintaining an extracted “view” on a collection of documents over time is very costly; getting feedback from users can help In fact, distributing the maintenance task across a large group of users may be the best approach

Mass Collaboration: A Simplified Example Not David! Picture is removed if enough users vote “no”.

Mass Collaboration Meets Spam Jeffrey F. Naughton swears that this is David J. DeWitt

Web Communities: The World Online