400 likes | 419 Views
Data Cloud. Yury Lifshits Yahoo! Research http://yury.name. My Beliefs. The key challenge in web search is structured search Part 1: What is structured search? The key challenge in structured search is collecting data Part 2: Data distribution & idea of Data Cloud
E N D
Data Cloud Yury Lifshits Yahoo! Research http://yury.name
My Beliefs The key challenge in web search is structured search Part 1: What is structured search? The key challenge in structured search is collecting data Part 2: Data distribution & idea of Data Cloud Part 3: Demo: numeric data distribution The key challenge in collecting data is incentive design Part 4: Economics of data distribution
Structured Search
Data = data of entities + data of content Data Structured data Entity unit: • Identifier • Metadata: • Explicit key-value pairs • Relational properties • Evaluation Semi-structured data Content unit: • Body: text, video, audio, or image • Metadata: • Explicit key-value pairs • Relational properties • Evaluation
Structured Search Factoid search “what's the value of property X of object Y“ Entity hubs • Domain hubs Structured object search "all concerts this weekend in SF under 20$ sorted by popularity" • Time focus • Ranking focus • Relations focus Structured content search "all videos with Tom Brady" “all comments and blog posts about Bing"
Yury’s Wishlist Business-generated data • Products, services, news, wishlists, contact data Reality stream, sensors • Where what have happened Expert knowledge • Glossary, issues, typical solutions, object databases, related objects graph Events • Sport, concerts, education, corporate, community, private Market graph & signals • Like, interested, use, following, want to buy; votes and ratings
Query analysis Post analysis App 3 App 1 Classic search App 2 Structured Data Web index Search as a Platform App 4
Data Cloud How to collect all structured data in one place?
Data Producers • People: forums, wiki, mail groups, blogs, social networks • Enterprizes: product profiles, corporate news, professional content • Sensors: GPS modules, web cameras, traffic sensors, RFID • Transactional data
Data distributor is any technical solution to accumulate, organize and provide access to structured and semi-structured data Data publisher: the original distributor of some data Data retailer: a consumer-facing distributor of some data Data Distributors
Data Consumers • Humans • Email • Aggregators: news, friend feeds, RSS readers • Search • Browsing / random walks • Intelligence projects • Recommendation systems • Trend mining
Data Cloud Data Cloud is a centralized fully-functional data distribution service Success metric for data cloud strategy = the total “value” of data on the cloud
To-Cloud Solutions • Extraction • DBpedia.org, “web tables” • Semantic markup, data APIs • Yahoo! SearchMonkey • Feeds • Yahoo! Shopping • Disqus.com, js-kit.com, Facebook Connect • Direct publishing
On-Cloud Solutions • Ontology maintenance • Freebase • Normalization, de-duplication, antispam • Named entity recognition, metadata inference, ranking • Data recycling (cross-references) • Amazon Public Data Sets • Viral license • Hosted search • Yahoo! BOSS
From-Cloud Solutions • Search, audience • Y! SearchMonkey, Google Base • Data API, dump access, update stream • Custom notifications • Gnip.com • Data cloud as a primary backend • Access control • Ad distribution. (AT&T and Yahoo! Local deal)
Demo: webNumbr.com Joint work with Paul Tarjan
webNumbr.com: Import • Crawl numbers from the web URL + XPath + regex • Create “numbr pages” • Update their values every hour • Keep the history Anyone can create a numbr http://webnumbr.com/create
webNumbr.com: Export • Embed code • Graphs • Search & browse • RSS
Economics of Data Distribution Joint work with Ravi Kumar and Andrew Tomkins
Two sided market = every product serves consumers of two types A and B Cross-side network effect: the more type-A users product X has, the more attractive it is for type-B consumers and vice versa Examples: operating systems, credit cards, e-commerce marketplaces Two-sided network effects: A theory of information product design G. Parker, M.W. Van Alstyne, N. Bulkley, M. Van Alstyne Network Effect in Two-Sided Markets
Basic model • Distributors D1, … Dk • Producer/consumer joins only one distributor • Initial shares (p1,c1) … (pk,ck) • New consumer selects a distributor with a probability proportional to pi • New producer selects a distributor with probability proportional to ci
Basic model a2 a4 a3 a1 a1 a3 a4 a2
Market Shares Dynamics Theorem 1 Market shares will stabilize Theorem 2 With super-liner preference rule one of distributors will tip Theorem 3 With sub-liner preference rule market shares will flatten
External Factor Preference rule with external factor: ei+ci/(c1+…+ck) • Theorem 4 • Market shares will stabilize on • e1 : e2 : … : ek
Coalition Data Cloud
Coalitions Theorem 5 If all market shares are below 1/sqrt(k) coalition (sharing data) is profitable for all distributors Corollary Coalitions are not monotone Example: 5 : 4 : 1 : 1
Model Variations • Same-side network effect • Different p-to-c and c-to-p rules • Multi-homing (overlapping audiences) • n^2 vs. nlog n revenue models • Mature market: newcomer rate = departing rate • Diverse market (many types of producers and consumers) • Newcoming and departing distributors • Directed coalitions
Marketing • Data demand? • Data offerings? • Requirements for distribution technology?
Incentive design • Incentives for data sharing? • Centralized or distributed? • For profit or non-profit? • Data licensing and ownership? • Monetizing data cloud?
More Challenges Prototyping: • Data marketplace: open data & data demand • Search plugins: related objects, glossaries, object timelines • Publishing tools for structured data • Data client: structured news, bookmarking, notifications Tech design: • Access management • Namespace design User interface: • Structured search UI • Discovery UI
Thanks! Follow my research: http://twitter.com/yurylifshits http://yury.name/blog