1 / 41

How to Protect Big Data in a Containerized Environment

How to Protect Big Data in a Containerized Environment. Thomas Phelan Chief Architect, BlueData @tapbluedata. Outline. Securing a Big Data Environment Data Protection Transparent Data Encryption Transparent Data Encryption in a Containerized Environment Takeaways.

conti
Download Presentation

How to Protect Big Data in a Containerized Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Protect Big Data in a Containerized Environment • Thomas Phelan • Chief Architect, BlueData • @tapbluedata

  2. Outline • Securing a Big Data Environment • Data Protection • Transparent Data Encryption • Transparent Data Encryption in a Containerized Environment • Takeaways

  3. In the Beginning … • Hadoop was used to process public web data • No compelling need for security • No user or service authentication • No data security

  4. Then Hadoop Became Popular Security is important.

  5. Layers of Security in Hadoop • Access • Authentication • Authorization • Data Protection • Auditing • Policy (protect from human error)

  6. Hadoop Security: Data Protection Reference: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_edh_overview.html

  7. Focus on Data Security • Confidentiality • Confidentiality is lost when data is accessed by someone not authorized to do so • Integrity • Integrity is lost when data is modified in unexpected ways • Availability • Availability is lost when data is erased or becomes inaccessible Reference: https://www.us-cert.gov/sites/default/files/publications/infosecuritybasics.pdf

  8. Hadoop Distributed File System (HDFS) • Data Security Features • Access Control • Data Encryption • Data Replication

  9. Access Control • Simple • Identity determined by host operating system • Kerberos • Identity determined by Kerberos credentials • One realm for both compute and storage • Required for HDFS Transparent Data Encryption

  10. Data Encryption • Transforming data

  11. Data Replication • 3 way replication • Can survive any 2 failures • Erasure Coding • Can survive more than 2 failures depending on parity bit configuration

  12. HDFS with End-to-End Encryption • Confidentiality • Data Access • Integrity • Data Access + Data Encryption • Availability • Data Access + Data Replication

  13. Data Encryption • How to transform the data? 101011100010010001011100010100011101010101010100011101010101110 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Ciphertext Cleartext

  14. Data Encryption – At Rest • Data is encrypted while on persistent media (disk)

  15. Data Encryption – In Transit • Data is encrypted while traveling over the network

  16. The Whole Process Ciphertext

  17. HDFS Transparent Data Encryption (TDE) • End-to-end encryption • Data is encrypted/decrypted at the client • Data is protected at rest and in transit • Transparent • No application level code changes required

  18. HDFS TDE – Design • Goals: • Only an authorized client/user can access cleartext • HDFS never stores cleartext or unencrypted data encryption keys

  19. HDFS TDE – Terminology • Encryption Zone • A directory whose file contents will be encrypted upon write and decrypted upon read • An EZKEY is generated for each zone

  20. HDFS TDE – Terminology • EZKEY – encryption zone key • DEK – data encryption key • EDEK – encrypted data encryption key

  21. HDFS TDE - Data Encryption • The same key is used to encrypt and decrypt data • The size of the ciphertext is exactly the same as the size of the original cleartext • EZKEY + DEK => EDEK • EDEK + EZKEY => DEK

  22. HDFS TDE - Services • HDFS NameNode (NN) • Kerberos Key Distribution Center (KDC) • Hadoop Key Management Server (KMS) • Key Trustee Server

  23. HDFS TDE – Security Concepts • Division of Labor • KMS creates the EZKEY & DEK • KMS encrypts/decrypts the DEK/EDEK using the EZKEY • HDFS NN communicates with the KMS to create EZKEYs & EDEKs to store in the extended attributes in the encryption zone • HDFS client communicates with the KMS to get the DEK using the EZKEY and EDEK.

  24. HDFS TDE – Security Concepts • The name of the EZKEY is stored in the HDFS extended attributes of the directory associated with the encryption zone • The EDEK is stored in the HDFS extended attributes of the file in the encryption zone $ hadoop key … $ hdfs crypto …

  25. HDFS Examples • Simplified for the sake of clarity: • Kerberos actions not shown • NameNode EDEK cache not shown

  26. HDFS – Create Encryption Zone 3. Create EZKEY /encrypted_dir xattr: EZKEYNAME EZKEYNAME = KEY

  27. HDFS – Create Encrypted File 1. Create file 2. Create EDEK 5. Return Success 4. Store EDEK 3. Create EDEK /encrypted_dir/file encrypted data /encrypted_dir/file xattr: EDEK

  28. HDFS TDE – File Write Work Flow /encrypted_dir/file xattr: EDEK 3. Request DEK from EDEK & EZKEYNAME 4. Decrypt DEK from EDEK 5. Return DEK /encrypted_dir/file write encrypted data read unencrypted data

  29. HDFS TDE – File Read Work Flow /encrypted_dir/file xattr: EDEK 3. Request DEK from EDEK & EZKEYNAME 4. Decrypt DEK from EDEK 5. Return DEK /encrypted_dir/file read encrypted data write unencrypted data

  30. Bring in the Containers (i.e. Docker) • Issues with containers are the same for any virtualization platform • Multiple compute clusters • Multiple HDFS file systems • Multiple Kerberos realms • Cross-realm trust configuration

  31. Containers as Virtual Machines • Note – this is not about using containers to run Big Data tasks:

  32. Containers as Virtual Machines • This is about running Hadoop / Big Data clusters in containers: cluster

  33. Containers as Virtual Machines • A true containerized Big Data environment:

  34. KDC Cross-Realm Trust • Different KDC realms for corporate, data, and compute • Must interact correctly in order for the Big Data cluster to function CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals

  35. KDC Cross-Realm Trust • Different KDC realms for corporate, data, and compute • One-way trust • Compute realm trusts the corporate realm • Data realm trusts corporate realm • Data realm trusts the compute realm

  36. KDC Cross-Realm Trust CORP.ENTERPRISE.COM Realm KDC: CORP.ENTERPRISE.COM user@CORP.ENTERPRISE.COM COMPUTE.ENTERPRISE.COM Realm DATALAKE.ENTERPRISE.COM Realm KDC: COMPUTE.ENTERPRISE.COM KDC: DATALAKE.ENTERPRISE.COM Hadoop Key Management Service Hadoop Cluster HDFS: hdfs://remotedata/ rm@COMPUTE.ENTERPRISE.COM

  37. Key Management Service • Must be enterprise quality • Key Trustee Server • Java KeyStore KMS • Cloudera Navigator Key Trustee Server

  38. Containers as Virtual Machines • A true containerized Big Data environment: DataLake DataLake DataLake CORP.ENTERPRISE.COM End Users CORP.ENTERPRISE.COM End Users CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals

  39. Key Takeaways • Hadoop has many security layers • HDFS Transparent Data Encryption (TDE) is best of breed • Security is hard (complex) • Virtualization / containerization only makes it potentially harder • Compute and storage separation with virtualization / containerization can make it even harder still

  40. Key Takeaways • Be careful with a build vs. buy decision for containerized Big Data • Recommendation: buy one already built • There are turnkey solutions (e.g. BlueData EPIC) Reference: www.bluedata.com/blog/2017/08/hadoop-spark-docker-ten-things-to-know

  41. @tapbluedata www.bluedata.com BlueData Booth #1508 in Strata Expo Hall

More Related