1 / 31

The Claremont Report on Database Research

The Claremont Report on Database Research. SIGMOD 2008. What is it?. May, 2008 prominent DB researchers, architects, users, pundits met in Berkeley, CA at Claremont Resort Seventh meeting in 20 years Report based on discussion of new directions in DBs. Turning point in DB Research.

elaine
Download Presentation

The Claremont Report on Database Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Claremont Report on Database Research SIGMOD 2008

  2. What is it? • May, 2008 prominent DB researchers, architects, users, pundits met in Berkeley, CA at Claremont Resort • Seventh meeting in 20 years • Report based on discussion of new directions in DBs

  3. Turning point in DB Research • New opportunities for technical advances, impact on society, etc. 1. Big Data • not only traditional enterprises, but also e-science, digital entertainment, natural language processing, social network analysis • Design new custom data management • solutions from simpler components

  4. 2. Data analysis as profit center • Barriers between IT dept. and business units dropping • Data is the business • Data capture, integration, etc. keys to efficiency and profit • BI vendors - $10B (only front-end) • Also need better analytics, sophisticated analysis • non-technical decision makers want data

  5. 3. Ubiquity of structured and unstructured data • Structured data – extracted from text, SW logs, sensors and deep web crawl • Semi-structured – blogs, Web 2.0 communities, instant messaging • Publish and curate structured data • Develop techniques to extract useful data, enable deeper explorations, connect datasets

  6. 4. Expanded developer demands • Adoption of relational DBMS and query languages has grown • MySQL, PostegreSQL, Ruby on Rails • Less interest in SQL, view DBMS as too much to learn relative to other open source components • Need new programming models for Data management

  7. 5. Architectural Shifts in computing • Computing substrates for DM are shifting • Macro: Rise of cloud computing • Democratizes access to parallel clusters • Micro: shift from increasing chip clock speed to increase number of cores, threads • Changes in memory hierarchy • Power consumption • New DM technologies

  8. Research Opportunities • Impact of DB research has not evolved beyond traditional DBs • Reformation • Reform data centric ideas for new applications and architectures • Synthesis • Data integration, information extraction, data privacy • Some topics not mentioned, because still part of significant effort • Must continue with these efforts • Also must continue with • Uncertain data, data privacy and security, e-science, human-centric interactions, social networks, etc.

  9. DB Engines • Big market relational DBs well known limitations • Peak performance: • OLTP with lots of small, concurrent transactions debit/credit workloads • OLAP with few real-mostly, large join, aggregation • Bad for: • Text indexing, server web pages, media delivery

  10. DB engine technology could be useful in sciences and Web 2.0 applications, but not in current bundled DB systems • Petabytes of storage and 1000s processors, but current DB cannot scale • Need schema evolution, versioning, etc • Currently, many DB engine startup companies

  11. 1. Broaden range for multi-purpose DBs 2. Design special purpose DBs • Topics in DB engine area: • Systems for clusters of many processors • Exploit remote RAM and Flash as persistent • Query opt. and data layout continuous • Compress and encrypt data integrated with data layout and optimization • Embrace non-relational DB models • Trade off consistency/availability for performance • Design power aware dBMS

  12. Declarative programming for emerging platforms • Programmer productivity is important • Non-expert must be able to write robust code • Data Centric programming techniques • Map reduce – language and data parallelism • Declarative languages – Data log • Enterprise application programming – Ruby Rails, LINQ

  13. New challenges – programming across multiple machines • Data independence valuable, no assumptions about where data stored • XQuery for declarative programming? • Also need language design, efficient compilers, optimize code across parallel processors and vertical distribution of tiers • Need more expressive languages • Attractive syntax, development tools, etc • Data management – not only storage service, but programming paradigm

  14. Interplay of Structured and Unstructured Data • Data behind forms – Deep Web • Data items in HTML • Data in Web 2.0 services (photo, video sites) • Transition from traditional DBs to managing structured, semi-structured and unstructured data in enterprises and on the web • Challenge of managing dataspaces

  15. On the web • Vertical search engines • Domain independent technology for crawling • Within the enterprise • Discover relationships between structured and unstructured data

  16. Extract structure and meaning from un- and semi-structured data • Information extraction technology – pull entities and relationships from unstructured text • Need: apply and management predictions from independent extractors • Algorithms to determine correctness of extraction • Join with IR and ML communities

  17. Better DB technology needed to manage data in context • Discover implicit relationships, maintain context through storage and computation • Query and derive insight from heterogeneous data • Answer keyword queries over heterogeneous data sources • Analysis to extract semantics • Cannot assume have semantic mappings or domain is known

  18. Develop algorithms to provide best-effort services on loosely integrated data • Pay as you go as semantic relationships discovered • Develop index structures to support querying hybrid data • New notions of correctness and consistency

  19. Innovate on creating data collections • Ad-hoc communities to collaborate • Schema will be dynamic • Consensus to guide users • Need visualization tools to create data that are easy to use • Result of tools may be easier to extract info

  20. Cloud Data Services • Infrastructures providing software and computing facilities as a service • Efficient for applications • Limit up-front capitol expenses • reduce cost of ownership over time • Services hosted in a data center • Shared commodity hardware for computation and storage

  21. Cloud services available today • Application services (salesforce.com) • Storage services (Amazon S3) • Compute services (Google App Enginer, Amazon EC2) • Data services (Amazon SimpleDB, SQL Server Data Services, Google’s Datastore)

  22. Cloud data services offer API more restricted than traditional DBs • Minimalist query languages, limited consistency • More predictable services • Difficult if had to provide full-function SQL data service • Managability important in cloud environments • Limited human intervention • High workloads • Variety of shared infrastructures

  23. No DBA or system admin • Automatically by platform • Large variations in workloads • Economical to user more resources for short bursts • Service tuning depends upon virtualization • HW virtual machines as programming interface (EC2) • Multi-tenant hosting many independent schemas in single managed DBMS (salesforce.com)

  24. Need for manageability • Adaptive online techniques • New architectures and APIs • Depart from SQL and transactions semantics when can • SQL DBs cannot scale to thousands of nodes • Different transactional implementation techniques or different storage semantics?

  25. Query processing and optimization • Cannot exhaust search plan if 1000s sites • More work needed to understand scaling realities • Data security and privacy • No longer physical boundaries of machines or networks

  26. New scenarios • Specialized services with pre-loaded data sets (stock prices, weather) • Combine data from private and public domains • Reaching across clouds (scientific grids) • Federated cloud architectures

  27. Mobile applications and virtual worlds • Manage massive amounts of diverse user-created data, synthesize intelligently and provide real-time services • Mobile space • Large user bases • Emergence of mobile search and social networks • Timely information to users depending on locations, preference, social circles, extraneous factor and context in which operate • Synthesize user input and behavior to determine location and intent

  28. Virtual worlds – Second Life • Began as simulations for multiple users • Blur distinction with real-world • Co-space, for both virtual and physical worlds • Events in physical captured by sensors, materialized in virtual • Events in virtual can affect physical • Need to process heterogeneous data streams • Balance privacy against sharing person RT info • Virtual actors requires large-scale parallel programs • Efficient storage, data processing, power sensitive

  29. Moving Forward • DB research community doubles in size last decade • Increasing technical scope make it difficult to keep track of field • Review load for papers growing • Quality of reviews decreasing over time • Need more technical books, blogs, wikis • Open source software development in DB • Competition: system components for cloud computing • Large-scale information extraction

More Related