Enabling data management in a big data world

Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri

The problem with data management • Hadoop is a collection of tools • Not tightly integrated • Everyone’s stack looks a little different • Everything falls back to files

Agenda • Traditional data management • Hadoop’s eco-system • Natero’s approach to data management

What is data management? • What do you have? • What data sets exist? • Where are they stored? • What properties do they have? • Are you doing the right thing with it? • Who can access data? • Who has accessed data? • What did they do with it? • What rules apply to this data?

Traditional data management External Data Sources Extract Transform Load Data Warehouse Users Data processing SQL Integrated storage

Key lessons of traditional systems • Data requires the right abstraction • Schemas have value • Tables are easy to reason about • Referenced by name, not location • Narrow interface • SQL defines the data sources and the processing • But not where and how the data is kept!

Hadoop eco-system Oozie Hive Metastore (HCatalog) External Data Sources Cloudera Navigator Users HiveQL Pig Mahout Sqoop + Flume Processing Framework (Map-Reduce) HBase HDFS storage layer

Key challenges More varied data sources with many more access / retention requirements Oozie Hive Metastore (HCatalog) External Data Sources Cloudera Navigator Users HiveQL Pig Mahout Sqoop + Flume Processing Framework (Map-Reduce) HBase HDFS storage layer

Key challenges Data accessed through multiple entry points Oozie Hive Metastore (HCatalog) External Data Sources Cloudera Navigator Users HiveQL Pig Mahout Sqoop + Flume Processing Framework (Map-Reduce) HBase HDFS storage layer

Key challenges Lots of new consumers of the data Oozie Hive Metastore (HCatalog) External Data Sources Cloudera Navigator Users HiveQL Pig Mahout Sqoop + Flume Processing Framework (Map-Reduce) HBase HDFS storage layer

Key challenges Oozie Hive Metastore (HCatalog) External Data Sources Cloudera Navigator Users HiveQL Pig Mahout Sqoop + Flume Processing Framework (Map-Reduce) HBase HDFS storage layer One access control mechanism: files

Steps to data management • Provide access at the right level • Limit the processing interfaces • Schemas and provenance provide control • Enforce policy 1 2 3 4

Case study: Natero • Cloud-based analytics service • Enable business users to take advantage of big data • UI-driven workflow creation and automation • Single shared Hadoop eco-system • Need customer-level isolation and user-level access controls • Goals: • Provide the appropriate level of abstraction for our users • Finer granularity of access control • Enable policy enforcement • Users shouldn’t have to think about policy • Source-driven policy management

Natero application stack 4 1 Users Access-aware workflowcompiler Policy and Metadata Manager Provenance-aware scheduler External Data Sources 3 Pig HiveQL Mahout Schema Extraction Processing Framework (Map-Reduce) Sqoop + Flume HBase HDFS storage layer 2

Natero execution example Job Job Compiler Natero UI • Fine-grainaccess control • Auditing • Enforceable policy • Easy for users Metadata Manager Scheduler Sources

The right level of abstraction • Our abstraction comes with trade-offs • More control, compliance • No more raw Map-Reduce • Possible to integrate with Pig/Hive • What’s the right level of abstraction for you? • Kinds of execution

Hadoop projects to watch • HCatalog • Data discovery / schema management / access • Falcon • Lifecycle management / workflow execution • Knox • Centralized access control • Navigator • Auditing / access management

Lessons learned • If you want control over your data, you also need control over data processing • File-based access control is not enough • Metadata is crucial • Users aren’t motivated by policy • Policy shouldn’t get in the way of use • But you might get IT to reason about the sources

Enabling data management in a big data world