1 / 48

Sam Madden madden@csail.mit With a cast of many….

Data Hub: A Collaborative Data Analytics and Visualization Platform 
. Sam Madden madden@csail.mit.edu With a cast of many…. Data. BIG. Example: Medical Costs. Largest cancer database in the world (173,301 patients) Based on national tumor registry Cross linked with death registry

alvaro
Download Presentation

Sam Madden madden@csail.mit With a cast of many….

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Hub: A Collaborative Data Analytics and Visualization Platform 
 Sam Madden madden@csail.mit.edu With a cast of many….

  2. Data MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY BIG

  3. Example: Medical Costs • Largest cancer database in the world (173,301 patients) • Based on national tumor registry • Cross linked with death registry • Includes billing, reports, labs, imagery, genome SNPs MGH Cancer Center “Super-Database” • Question: • What are the factors driving costs for lung cancer patients? • Some results: • No correlation of cost with • Stage of presentation • Survival • Strong correlation of cost with • oncologist! - Dr. James Michaelson, PhD, MGH, Harvard Medical School MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  4. Challenge: Making Data Accessible Super Duper Indexes Main Memory DBs Column Oriented DBs Map Reduce • Beyond scalable platforms What does the data look like? How do I correlate it with other data sets? How do I present it to users/execs? Where are these anomalies and outliers coming from? MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  5. Challenge: Making Data Accessible • Introducing Datahub + = DB Technology Octocat, the Github mascot MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  6. Introducing Datahub • Data Commons • Secure, Hosted Data Storage • (“Database Service”) • Selective Sharing and Access Control • Easy to Find, Combine, Clean Data Sets • Ability to Browse, Visualize, and Query Data in situ MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  7. Lots of other places to find data! Datahub: “five-star” integrated, browse-able, & query-able repository of linked data Aka … Just a bunch of zip files Versus open, linked data (Tim Berners Lee Taxonomy) ★ make your stuff available on the Web under an open license ★★ make it available as structured data ★★★ use non-proprietary formats (e.g., CSV instead of Excel) ★★★★ use URIs to denote things, so that people can point at your stuff ★★★★★ link your data to other data to provide context For example: MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  8. DatahubInterface AnantBhardwaj MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  9. DatahubInterface MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  10. DatahubInterface MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  11. “Wrangling” Features Wrangler: Interactive Visual Specification of Data Transformation Scripts Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  12. Post-Wrangling MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  13. More Datahub Interface Versions Browsing and Visualization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  14. MIT Living Lab A Dogfood Eating Exercise Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  15. MIT Living Lab MIT Data Hub Organizational Data Personal Data Personal Data: location/GPS, calendar, video/pictures, exercise/physio data, application usage, meetings… MIT data: ID card swipes, network packets, expense reports, medical data, payroll, parking garages, buses and cars, course catalogs, registrar, benefits, on-campus events/seminars, Infrastructure: energy, HVAC, maintenance, etc. Academic/Research: publications, presentations, research data… Public Data Relevant Linked Data: local transit / transport data, crime data, nearby restaurants, events etc. Goal: allow MIT community to access, selectively share, and use data about itself, using DataHub. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  16. What Will Data Hub Enable at MIT? • Campus “Quantification” • is going to class correlated with better grades? • which dining facilities are most popular amongst different groups? • Transportation planning: • bus utilizationand on demand routing • parking lot utilization • carpool finding, etc • Health + Medical: • campus wide public health, e.g., flu tracking, • observing who is missing class, depressed • Health signals: exerciseand eating habits; partners; • outpatient care • Research: • expert finding; • data sharing between groups MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  17. Challenges: It’s Not All Fuzzy Stuff We also don’t want our research to be like this guy  Monomi MapD Scorpion Platform Challenges: How to efficiently store thousands or millions of databases? How to anonymize data, control access, etc? How to keep data private and allowing querying over it? Challenges in Improving Interaction with Databases: Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  18. Private Data Problem • Confidential data leaks • 2012: hackers extracted 6.5 million hashed passwords from the DB of LinkedIn Threat:passive DB server attacks User 1 DB Server Application SQL User 2 Sensitive content System administrator User 3 Hackers Datahub

  19. How to protect data confidentiality? DB Server [request] Client [result] • Encrypt data server may not be able to process queries! Sensitive content Sensitive content • Compute on encrypted data! • Without giving server encryption key! General approach has been proposed several times…

  20. Monomi / CryptDB Threat 1:passive DB server attacks User 1 DB Server Application SQL User 2 Sensitive content User 3 • Process SQL queries on encrypted data • Hide DB from sys. admins., outsource DB to the cloud Modest overhead No changes to DBMS (e.g., Postgres, MySQL) and no changes to applications w/ RalucaPopa, Stephen Tu, Hari Balakrishnan, FransKaashoek, NickolaiZeldovich

  21. Application Deterministic encryption Randomized encryption SELECT * FROM emp WHERE salary = 100 table1/emp SELECT * FROM table1 WHERE col3 = x5a8c34 col1/rank col2/name col3/salary Proxy x934bc1 x4be219 60 x5a8c34 100 x95c623 x5a8c34 ? ? x5a8c34 x84a21c x2ea887 800 x5a8c34 x5a8c34 x5a8c34 x17cea7 100 SQL Queries on Encrypted Data Example

  22. Application OPE (order) encryption Deterministic encryption SELECT * FROM emp WHERE salary≥100 table1 (emp) SELECT * FROM table1 WHERE col3 ≥ x638e54 col1/rank col2/name col3/salary Proxy x934bc1 60 x5a8c34 100 x638e54 x84a21c 800 x922eb4 x5a8c34 100 x638e54 x638e54 x638e54 x1eab81 x922eb4

  23. Monomi: Protecting Data in Datahub • Extensions to CryptDB to efficiently support OLAP queries • Show how to run all of TPC-H, rather than just 4 of 22 queries • Key insight: split queries, run as much as possible on untrusted DBMS, compute remainder on trusted client

  24. Monomivs PlaintextTPC-H SF10, Postgres See Stephen Explain How it Really Works Right after this Talk! Monomi Runtime vs Plaintext Takeaway: median overhead 1.24x,

  25. Many Open Problems Understanding performance more broadly How to reason about security of non-randomized schemes? Auditing, information flow, etc. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  26. DataHub Research Challenges Monomi MapD Scorpion Platform Challenges: How to efficiently store thousands or millions of databases? How to anonymize data, control access, etc? How to keep data private and allowing querying over it? Challenges in Improving Interaction with Databases: Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  27. Interactive Large-Scale Visualizationusing a GPU Database

  28. The Need for Interactive Analytics • DataHub needs to support browsing massive data sets • Browsing is best supported through visualization  ad-hoc analytics, with millisecond response times

  29. MapD: GPU Accelerated SQL Database • Key insight: GPUs have enough memory that a cluster of them can store substantial amounts of data • Not an accelerator, but a full blown query processor! • Massive parallelism enables interactive browsing interfaces • 4x GPUs can provide > 1 TB/sec of bandwidth • 12 Tflops compute • Order of magnitude speedups over CPUs, when data is on GPU • “Shared nothing” arrangement

  30. Demo

  31. Next Steps • Scale out to many nodes, automate layout algorithms • Add various advanced analytics (e.g., machine learning algorithms) • Generalize visualization beyond maps

  32. DataHub Research Challenges Monomi MapD Scorpion Platform Challenges: How to efficiently store thousands or millions of databases? How to anonymize data, control access, etc? How to keep data private and allowing querying over it? Challenges in Improving Interaction with Databases: Data Cleaning and Integration Interactive Data Presentation Understanding Why Results are the Way They Are How to Leverage Experts in an Organization MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  33. Visual Provenance: Scorpion Eugene Wu Visualization of data is most common form of big data analysis Common problem: outliers Would be nice to have a tool that identifies why outliers exist

  34. Definition of Why Output Visualization i = Input Data Outlier Group p p = predicate Given an outlier group, find a predicate over the inputs that makes the output no longer an outlier. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  35. Definition of Why Output Visualization i = Input Data p p = predicate Given an outlier group, find a predicate over the inputs that makes the output no longer an outlier. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  36. Definition of Why Output Visualization i = Input Data p {Bill Gates, Steve Ballmer} Removing the predicate makes US no longer an outlier What are common properties of those records? p: Company = MSFT Given an outlier group, find a predicate over the inputs that makes the output no longer an outlier. MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  37. Why is this hard? Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  38. Why is this hard? AVG(rows) = 2.7 Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  39. Why is this hard? AVG(rows) = 2.9 Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  40. Why is this hard? AVG(rows) = 2.2 Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  41. Why is this hard? AVG(rows) = 3.3 Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  42. Why is this hard? See Eugene Explain How it Really Works this Afternoon! AVG(rows) = 3.1 … Exponential search space over records, attributes In general, each candidate predicate requires re-running aggregation Desire for simple, understandable predicates and a general purpose visualization framework MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  43. Next Steps • A general purpose visualization language for expressing visualizations with provenance support References to underlying data set MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

  44. Conclusion • Big Data is a cry for help from non DB people • Lots of exciting work on scalable systems • DB community should be doing a much better job of helping users use data • We risk losing mindshare • Datahub aims to make data easy to find, visualize, and query, securely and efficiently • Many fascinating, hard problems! • (Monomi, MapD, Scorpion) MIT COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE LABORATORY

More Related