1 / 42

How to Protect Big Data in a Containerized Environment

How to Protect Big Data in a Containerized Environment. Thomas Phelan Chief Architect, BlueData @tapbluedata. Outline. Securing a Big Data Environment Data Protection Transparent Data Encryption Transparent Data Encryption in a Containerized/Virtualized Environment Takeaways.

paulaking
Download Presentation

How to Protect Big Data in a Containerized Environment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Protect Big Data in a Containerized Environment • Thomas Phelan • Chief Architect, BlueData • @tapbluedata

  2. Outline • Securing a Big Data Environment • Data Protection • Transparent Data Encryption • Transparent Data Encryption in a Containerized/Virtualized Environment • Takeaways

  3. In the beginning … • Hadoop was used to process public web data • No compelling need for security • No user or service authentication • No data security

  4. Then Hadoop Became Popular Security is important.

  5. Hadoop: Security in Depth

  6. Layers of Security in Hadoop • Perimeter • Authentication • Authorization • Container/OS • Data Protection • Big Data as a Service (BDaaS)

  7. Hadoop: Security in Depth

  8. Focus on Data Security • Confidentiality • Confidentiality is lost when data is accessed by someone not authorized to do so • Integrity • Integrity is lost when data is modified in unexpected ways • Availability • Availability is lost when data is erased or becomes inaccessible Reference: https://www.us-cert.gov/sites/default/files/publications/infosecuritybasics.pdf

  9. Hadoop Distributed File System • Data Security Features • Access Control • Data Encryption • Data Replication

  10. Access Control • Simple • Identity determined by host operating system • Kerberos • Identity determined by Kerberos credentials • Most common to have one realm for both compute and storage

  11. Data Encryption • Transforming data • cleartext -> ? -> ciphertext

  12. Data Replication • 3 way replication • Can survive any 2 failures • Erasure Coding • New in Hadoop 3.0 • Can survive > 2 failures depending on parity bit configuration

  13. HDFS with End to End Encryption • Confidentiality • Data Access • Integrity • Data Access + Data Encryption • Availability • Data Access + Data Replication

  14. Data Encryption • What is data encryption? 101011100010010001011100010100011101010101010100011101010101110 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Ciphertext Cleartext

  15. Data Encryption used in HDFS • Symmetric-key encryption • The same key is used to encrypt and decrypt data • Iterated block cipher • The cipher is applied to a fixed sized unit (block) of data. The size of the ciphertext is the same as the size of the original cleartext • Kerberos access control required for HDFS TDE

  16. Data Encryption – At Rest • Data is encrypted while on persistent media (disk)

  17. Data Encryption – In Transit • Data is encrypted while traveling over the network

  18. HDFS Transparent Data Encryption • End-to-end encryption • Data is encrypted/decrypted at the client • Data is protected at rest and in transit • Transparent • No application level code changes required

  19. End-to-End Encryption Ciphertext

  20. HDFS TDE - Design • Goals: • Only an authorized client/user can access cleartext • HDFS never stores cleartext or unencrypted data encryption keys

  21. HDFS TDE – Terminology I • Encryption Zone • A directory whose file contents will be encrypted upon write and decrypted upon read • An EZKEY is generated for each zone

  22. HDFS TDE – Terminology II • EZKEY – encryption zone key • DEK – data encryption key • EDEK – encrypted data encryption key • Symmetric-key encryption • EZKEY + DEK => EDEK • EDEK + EZKEY => DEK

  23. HDFS TDE - Services • HDFS NameNode (NN) • Hadoop Key Management Server (KMS) • Key Trustee Server • Kerberos Key Distribution Center (KDC)

  24. HDFS TDE – Security Concepts • KMS creates the EZKEY & DEK • KMS encrypts/decrypts the DEK/EDEK using the EZKEY

  25. HDFS TDE – Security Concepts • The name of the EZKEY is stored in the HDFS extended attributes of the directory associated with the encryption zone • The EDEK is stored in the HDFS extended attributes of the file in the encryption zone $ hadoop key … $ hdfs crypto …

  26. HDFS TDE – Security Concepts • The HDFS NN communicates with the KMS to create EZKEYs & EDEKs to store in the extended attributes in the encryption zone • The HDFS client communicates with the KMS to get the DEK using the EZKEY and EDEK.

  27. HDFS Examples • Simplified diagrams to avoid confusion/distraction: • Kerberos actions not shown • NameNode EDEK cache not shown

  28. HDFS - Encryption Zone Create 3. Generate EZKEY

  29. HDFS TDE – File Create Work Flow Using EZKEY

  30. HDFS TDE – File Write Work Flow

  31. HDFS TDE – File Read Work Flow Using EZKEY

  32. Bring in the Containers • Issues are the same for any virtualization platform • Multiple Compute Clusters • Multiple HDFS File Systems • Multiple Kerberos Realms • Cross realm trust configuration

  33. Containers as Virtual Machines • This is not using containers to run Big Data tasks:

  34. Containers as Virtual Machines • This is running Big Data clusters in containers: cluster

  35. Containers as Virtual Machines • A true containerized Big Data environment:

  36. KDC Cross Realm Trust • Different KDC Realms for corporate, data, and compute • Must interact correctly in order for the Big Data cluster to function CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals

  37. KDC Cross Realm Trust • Different KDC Realms for corporate, data, and compute • One way trust • Compute Realm trusts the Corporate Realm • Data Realm trusts Corporate Realm • Data Realm trusts the Compute Realm

  38. KDC Cross Realm Trust CORP.ENTERPRISE.COM Realm KDC: CORP.ENTERPRISE.COM user@CORP.ENTERPRISE.COM COMPUTE.ENTERPRISE.COM Realm DATALAKE.ENTERPRISE.COM Realm KDC: COMPUTE.ENTERPRISE.COM KDC: DATALAKE.ENTERPRISE.COM Hadoop Key Management Service Hadoop Cluster HDFS: hdfs://remotedata/ rm@COMPUTE.ENTERPRISE.COM

  39. Key Management Service • Must be enterprise quality • Key Trustee Server • Java KeyStore KMS • Cloudera Navigator Key Trustee Server

  40. Containers as Virtual Machines • A true containerized Big Data environment: DataLake DataLake DataLake CORP.ENTERPRISE.COM End Users CORP.ENTERPRISE.COM End Users CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals

  41. Key Takeaways • Hadoop has many security layers • HDFS Transparent Data Encryption is best of breed • Security is hard (complex) and virtualization only makes it harder • Compute and Storage separation with virtualization makes it harder still

  42. Tom Phelan @tapbluedata www.bluedata.com Visit BlueData Booth #211 in Strata Expo Hall

More Related