90 likes | 261 Views
Web Data Management. Raghu Ramakrishnan. QUIQ Lessons. Structured data management powers scalable collaboration environments ASP Multi-tenancy Massively distributed Fine-grained permissions, hierarchical acls RDBMSs were a lousy fit. “Transactional” Storage & Serving
E N D
Web Data Management Raghu Ramakrishnan
QUIQ Lessons • Structured data management powers scalable collaboration environments • ASP • Multi-tenancy • Massively distributed • Fine-grained permissions, hierarchical acls • RDBMSs were a lousy fit
“Transactional” Storage & Serving E.g., PNUTS, S3, SSDS, UDB Cloud Computing: Computing as a Service Packaged Software Cloud Computing CPU Intensive Data Intensive High-throughput E.g., Condor Analytic E.g., SSDS, Hadoop
Implications • Data management as a service • Scientists and others who’ve resisted (installing, maintaining, and) using DBMSs will find it much easier to reap the benefits • “Data centers” and “Computing Centers” will come into vogue again • Hosted back-ends and RAD tools will make Web application development accessible to all • The Web is becoming open • E.g., OpenSocial, OpenID • Ideas will be the most valuable currency, not the wherewithal to build complex systems • Paradigm shifts possible for how we do research in many fields • Build applications that embed your algorithms and test them directly in the field—Computer Scientists can interact directly with users (ironically, this would still be a breakthrough of sorts after four decades!) • Many other disciplines (e.g., Sociology, microeconomics) can design and conduct online experiments involving unprecedented numbers of participants
A 42342 E A 42342 E A 42342 E B 42521 W B 42521 W B 42521 W C 66354 W C 66354 W C 66354 W D 12352 E D 12352 E D 12352 E E 75656 C E 75656 C E 75656 C F 15677 E F 15677 E F 15677 E PNUTS: DB in the Cloud Indexes and views CREATE TABLE Parts ( ID VARCHAR, StockNumber INT, Status VARCHAR … ) Geographic replication Parallel database Structured, flexible schema Hosted, managed infrastructure
Basic Consistency Model Goal: • Make it easier for applications to reason about updates and cope with asynchrony—alternative to “transactions” in an asynchronous world • What happens to a record with primary key “Brian”? Guarantees: • Every reader will always see some consistent, but possibly stale version • Readers can request a more up-to-date version, but may pay extra latency • Special case: Critical read (writer/readers see their own writes) • Writers can verify that the record is still at the version they expect Record inserted Record inserted Update Record inserted Update Update Update Delete Update Delete Delete v. 2 v. 2 v. 1 v. 3 v. 1 v. 3 v. 4 v. 1 Time Generation 1 Generation 2 Generation 3
Lots of Issues to Re-think • Massive distribution & replication • Asynchrony • Availability • Consistency • DBA to the world • Auto-tuning • Multi-tenancy • Access control (granularity, online ids) • Encryption • App-support • Caching
Querying the Web • Search will become more semantic—best-effort match-making between: • Query intent (NLP, query logs …) • Interpreted web content • Deep web has a lot of structured data • How we get a handle on it is an interesting problem • But this is only part of the problem … lots of data not here • Semantic web isn’t working • Site-wrapping doesn’t scale • Solutions? • Domain-wrapping • Mass collaboration • ??