1 / 18

Enabling data management in a big data world

Enabling data management in a big data world. Craig Soules Garth Goodson Tanya Shastri. The problem with data management. Hadoop is a collection of tools Not tightly integrated Everyone’s stack looks a little different Everything falls back to files. Agenda.

holden
Download Presentation

Enabling data management in a big data world

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri

  2. The problem with data management • Hadoop is a collection of tools • Not tightly integrated • Everyone’s stack looks a little different • Everything falls back to files

  3. Agenda • Traditional data management • Hadoop’s eco-system • Natero’s approach to data management

  4. What is data management? • What do you have? • What data sets exist? • Where are they stored? • What properties do they have? • Are you doing the right thing with it? • Who can access data? • Who has accessed data? • What did they do with it? • What rules apply to this data?

  5. Traditional data management External Data Sources Extract Transform Load Data Warehouse Users Data processing SQL Integrated storage

  6. Key lessons of traditional systems • Data requires the right abstraction • Schemas have value • Tables are easy to reason about • Referenced by name, not location • Narrow interface • SQL defines the data sources and the processing • But not where and how the data is kept!

  7. Hadoop eco-system Oozie Hive Metastore (HCatalog) External Data Sources Cloudera Navigator Users HiveQL Pig Mahout Sqoop + Flume Processing Framework (Map-Reduce) HBase HDFS storage layer

  8. Key challenges More varied data sources with many more access / retention requirements Oozie Hive Metastore (HCatalog) External Data Sources Cloudera Navigator Users HiveQL Pig Mahout Sqoop + Flume Processing Framework (Map-Reduce) HBase HDFS storage layer

  9. Key challenges Data accessed through multiple entry points Oozie Hive Metastore (HCatalog) External Data Sources Cloudera Navigator Users HiveQL Pig Mahout Sqoop + Flume Processing Framework (Map-Reduce) HBase HDFS storage layer

  10. Key challenges Lots of new consumers of the data Oozie Hive Metastore (HCatalog) External Data Sources Cloudera Navigator Users HiveQL Pig Mahout Sqoop + Flume Processing Framework (Map-Reduce) HBase HDFS storage layer

  11. Key challenges Oozie Hive Metastore (HCatalog) External Data Sources Cloudera Navigator Users HiveQL Pig Mahout Sqoop + Flume Processing Framework (Map-Reduce) HBase HDFS storage layer One access control mechanism: files

  12. Steps to data management • Provide access at the right level • Limit the processing interfaces • Schemas and provenance provide control • Enforce policy 1 2 3 4

  13. Case study: Natero • Cloud-based analytics service • Enable business users to take advantage of big data • UI-driven workflow creation and automation • Single shared Hadoop eco-system • Need customer-level isolation and user-level access controls • Goals: • Provide the appropriate level of abstraction for our users • Finer granularity of access control • Enable policy enforcement • Users shouldn’t have to think about policy • Source-driven policy management

  14. Natero application stack 4 1 Users Access-aware workflowcompiler Policy and Metadata Manager Provenance-aware scheduler External Data Sources 3 Pig HiveQL Mahout Schema Extraction Processing Framework (Map-Reduce) Sqoop + Flume HBase HDFS storage layer 2

  15. Natero execution example Job Job Compiler Natero UI • Fine-grainaccess control • Auditing • Enforceable policy • Easy for users Metadata Manager Scheduler Sources

  16. The right level of abstraction • Our abstraction comes with trade-offs • More control, compliance • No more raw Map-Reduce • Possible to integrate with Pig/Hive • What’s the right level of abstraction for you? • Kinds of execution

  17. Hadoop projects to watch • HCatalog • Data discovery / schema management / access • Falcon • Lifecycle management / workflow execution • Knox • Centralized access control • Navigator • Auditing / access management

  18. Lessons learned • If you want control over your data, you also need control over data processing • File-based access control is not enough • Metadata is crucial • Users aren’t motivated by policy • Policy shouldn’t get in the way of use • But you might get IT to reason about the sources

More Related