1 / 37

The Future of Data Management or The Structure of (Computer) Scientific Revolutions

The Future of Data Management or The Structure of (Computer) Scientific Revolutions. Michael Franklin UC Berkeley & Amalgamated Insight, Inc. EECS BEARS Conference February 2007. Semi-Structured (schema-later). Unstructured (schema-never). Structured (schema-first).

Download Presentation

The Future of Data Management or The Structure of (Computer) Scientific Revolutions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Future of Data ManagementorThe Structure of (Computer) Scientific Revolutions Michael Franklin UC Berkeley & Amalgamated Insight, Inc. EECS BEARS Conference February 2007

  2. Semi-Structured (schema-later) Unstructured (schema-never) Structured (schema-first) Relational Database XML Plain Text Tagged Text/Media Media Formatted Messages The Structure Spectrum Michael Franklin EECS BEARS Conference - February 2007

  3. Structured Data Management Michael Franklin EECS BEARS Conference - February 2007

  4. A “Modern” View of Data Management Michael Franklin EECS BEARS Conference - February 2007

  5. Whither Structured Data? • Conventional Wisdom: only 20% of data is structured. • Decreasing due to: • Consumer applications • Enterprise search • Media applications Michael Franklin EECS BEARS Conference - February 2007

  6. Structured Data Management Two reasons why this is where the future is: • The Data Integration quagmire: The perennial IT problem. Structure provides crucial cues. Michael Franklin EECS BEARS Conference - February 2007

  7. Structured Data Management Two reasons why this is where the future is: • The Data Integration quagmire: The perennial IT problem. Structure provides crucial cues. • The “Data Industrial Revolution*”: Data used to be hand-crafted, now it’s machine-generated! * Credit to Prof. Joe Hellerstein for this analogy. Michael Franklin EECS BEARS Conference - February 2007

  8. Mediated Schema Semantic mappings wrapper wrapper wrapper wrapper wrapper Courtesy of Alon Halevy Reason 1: Data Integration • The ultimate schema-first problem. • In the future, required for all applications. • Structure is both an enabler and a key impediment. Michael Franklin EECS BEARS Conference - February 2007

  9. Why Structure? What if you wanted to find out which actors donated to John Kerry’s 2004 presidential campaign… Michael Franklin EECS BEARS Conference - February 2007

  10. Why Structure? Michael Franklin EECS BEARS Conference - February 2007

  11. Why Structure? What if you wanted to find out which actors donated to John Kerry’s 2004 presidential campaign… Michael Franklin EECS BEARS Conference - February 2007

  12. Why Structure? • Text “Search” can return only what’s been previously “stored”. Michael Franklin EECS BEARS Conference - February 2007

  13. What if you wanted to… • find out the average donation of actors to each candidate? • compare actor donations this campaign to the last one? • find out who gave the most to each candidate? • organize the information by source or age? Michael Franklin EECS BEARS Conference - February 2007

  14. A“Deep-Web” Query Approach SELECT y.name,f.occupation,… FROM Yahoo_Actors y, FECInfo f WHERE y.name = f.name Michael Franklin EECS BEARS Conference - February 2007

  15. Did it Work? Michael Franklin EECS BEARS Conference - February 2007

  16. What’s Missing? • Common Schema • Any Schema • Strong Identifiers (keys) • Data Independence • Metadata • Consistency Guarantees • Access Control Michael Franklin EECS BEARS Conference - February 2007

  17. Functionality Time (and cost) The Fundamental Tradeoff Structure enables computers to help users manipulate and maintain the data. Semi-Structured (schema-later) Structured (schema-first) Unstructured (schema-less) Michael Franklin EECS BEARS Conference - February 2007

  18. “Flexible” Structure: Dataspaces* • Deal with all the data from an enterprise – in whatever form • Data co-existence no integrated schema, no single warehouse • Pay-as-you-go services • Keyword search is bare minimum. • Data manipulation and increased consistency as you add work. * “From Databases to Dataspaces: A New Abstraction for Information Management”, Michael Franklin, Alon Halevy, David Maier, SIGMOD Record, December 2005. Michael Franklin EECS BEARS Conference - February 2007

  19. Data Coexistence Autonomous Sources Search, Browse, Approximate Answer Structured Query Best Effort Guarantees Single Schema Centralized Administration Structured Query Strict Integrity Constraints Databases vs. Dataspaces Michael Franklin EECS BEARS Conference - February 2007

  20. The World of Dataspaces Web Search Far Virtual Organization Administrative Proximity Federated DBMS Near Desktop Search DBMS High Low Semantic Integration Michael Franklin EECS BEARS Conference - February 2007

  21. DataSpace Technology • Probabilistic Databases • Schema Matching • Judicious use of User Input • Approx. Query Answering • Probabilistic Reasoning • Uncertainty Management • Data Model Learning • Structured & Unstructured Search Michael Franklin EECS BEARS Conference - February 2007

  22. Reason 2: Data Industrial Revolution Bell’s Law:Every decade, a new, lower cost, class of computers emerges, defined by platform, interface, and interconnect • Mainframes 1960s • Minicomputers 1970s • Microcomputers/PCs 1980s • Web-based computing 1990s • Devices (Cell phones, PDAs, wireless sensors, RFID) 2000’s Enabling a new generation of applications for Operational Visibility, monitoring, and alerting. Michael Franklin EECS BEARS Conference - February 2007

  23. Data Streams  Data Flood PoS System Barcodes Phones Sensors RFID • Exponential data growth • New challenges: continuous, inter-connected, distributed, physical • Shrinking business cycles • More complex decisions Inventory Transactional Systems Telematics Clickstream Michael Franklin EECS BEARS Conference - February 2007

  24. Device Data Management • Devices generate streams of structured data. • Wide-spread deployment will lead to huge data volumes. • Can we develop the right infrastructure to support large-scale data streaming apps? • Can we incorporate devices into existing (legacy) IT infrastructure? Michael Franklin EECS BEARS Conference - February 2007

  25. High Fan In Systems* • A data management infrastructure for large-scale data streaming environments. • UniformDeclarative Framework • Every node is a SQL data stream processor stream-oriented queries at all levels • Hierarchical, stream-based views as an organizing principle. • Can impose a “view” over messy devices. *Design Considerations for High Fan In Systems - The HiFi Approach; CIDR 2005 Michael Franklin EECS BEARS Conference - February 2007

  26. HiFi - Taming the Data Flood Hierarchical Aggregation • Spatial • Temporal Headquarters Regional Centers In-network Stream Query Processing and Storage Warehouses, Stores Fast Data Path vs. Slow Data Path Dock doors, Shelves Receptors Michael Franklin EECS BEARS Conference - February 2007

  27. “Virtual Device (VICE) API” VICE: Virtual Device Interface[Jeffery et al., Pervasive 2006, VLDBJ 07] Vice API is a natural place to hide much of the complexity arising from physical devices. Michael Franklin EECS BEARS Conference - February 2007

  28. Device Issues: example Shelf RIFD Test - Ground Truth Michael Franklin EECS BEARS Conference - February 2007

  29. Actual RFID Readings “Restock every time inventory goes below 5” Michael Franklin EECS BEARS Conference - February 2007

  30. Query-based Data Cleaning Smooth CREATE VIEW smoothed_rfid_stream AS (SELECT receptor_id, tag_id FROM cleaned_rfid_stream [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= count_T) Point Michael Franklin EECS BEARS Conference - February 2007

  31. Query-based Data Cleaning Arbitrate CREATE VIEW arbitrated_rfid_stream AS (SELECT receptor_id, tag_id FROM smoothed_rfid_stream rs [range by ’5 sec’, slide by ’5 sec’] GROUP BY receptor_id, tag_id HAVING count(*) >= ALL (SELECT count(*) FROM smoothed_rfid_stream [range by ’5 sec’, slide by ’5 sec’] WHERE tag_id = rs.tag_id GROUP BY receptor_id)) Smooth Point Michael Franklin EECS BEARS Conference - February 2007

  32. After Query-based Cleaning “Restock every time inventory goes below 5” Michael Franklin EECS BEARS Conference - February 2007

  33. SQL Abstraction Makes it Easy • “Soft Sensors” • Quality and lineage • Optimization (power, etc.) • Pushdown of external validation information • Data archiving • Imperative processing • … Michael Franklin EECS BEARS Conference - February 2007

  34. “Operational” BI/BAM Centralized Distributed Predictive Analytics Data Analytics Products Data Mining Next-Generation Business Intelligence Event-Driven Query-Driven Analysis Appliance Accelerators In-Memory Reporting Data Warehouse Database/Data Warehouse Products RDBMS Amalgamated Insight: The Company Complexity Performance Michael Franklin EECS BEARS Conference - February 2007

  35. Integrated Event Handling and Alerting Interfaces to Operational Systems Drill Down, Replay, Reports Stream Query Processing is the Key “What’s happening now?” “Tell me when something happens.” Visibility Intelligent Action Notification “Automatically react when things happen.” “Why is it happening and how to improve it?” Learning Michael Franklin EECS BEARS Conference - February 2007

  36. Technology Company Overview • Founded November 2005 • Headquarters in Foster City, CA • Series A Financing: May 2006 • 10 Employees (and growing!) • Breakthrough technology for stream query processing • Proven software base – leveraging open source platform • Used in demanding high-volume networked applications • Boyd Pearce, President and CEO • Michael Franklin, Ph.D., CTO • Michael Trigg, EVP, Marketing • Sailesh Krishnamurthy, Ph.D., Chief Architect • Robert Krauss, VP, Business Development Key Team Members Michael Franklin EECS BEARS Conference - February 2007

  37. Conclusions • Structured data increasingly important. • In fact, there will be lots more of it. • and it must be processed as fast as it is created. • Traditional (structured) database technology is not up to the task. • Great opportunities for innovation. • HiFi, Dataspaces (and Amalgamated Insight!) are examples. http://www.cs.berkeley.edu/~franklin Michael Franklin EECS BEARS Conference - February 2007

More Related