420 likes | 430 Views
How to Protect Big Data in a Containerized Environment. Thomas Phelan Chief Architect, BlueData @tapbluedata. Outline. Securing a Big Data Environment Data Protection Transparent Data Encryption Transparent Data Encryption in a Containerized/Virtualized Environment Takeaways.
E N D
How to Protect Big Data in a Containerized Environment • Thomas Phelan • Chief Architect, BlueData • @tapbluedata
Outline • Securing a Big Data Environment • Data Protection • Transparent Data Encryption • Transparent Data Encryption in a Containerized/Virtualized Environment • Takeaways
In the beginning … • Hadoop was used to process public web data • No compelling need for security • No user or service authentication • No data security
Then Hadoop Became Popular Security is important.
Layers of Security in Hadoop • Perimeter • Authentication • Authorization • Container/OS • Data Protection • Big Data as a Service (BDaaS)
Focus on Data Security • Confidentiality • Confidentiality is lost when data is accessed by someone not authorized to do so • Integrity • Integrity is lost when data is modified in unexpected ways • Availability • Availability is lost when data is erased or becomes inaccessible Reference: https://www.us-cert.gov/sites/default/files/publications/infosecuritybasics.pdf
Hadoop Distributed File System • Data Security Features • Access Control • Data Encryption • Data Replication
Access Control • Simple • Identity determined by host operating system • Kerberos • Identity determined by Kerberos credentials • Most common to have one realm for both compute and storage
Data Encryption • Transforming data • cleartext -> ? -> ciphertext
Data Replication • 3 way replication • Can survive any 2 failures • Erasure Coding • New in Hadoop 3.0 • Can survive > 2 failures depending on parity bit configuration
HDFS with End to End Encryption • Confidentiality • Data Access • Integrity • Data Access + Data Encryption • Availability • Data Access + Data Replication
Data Encryption • What is data encryption? 101011100010010001011100010100011101010101010100011101010101110 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Ciphertext Cleartext
Data Encryption used in HDFS • Symmetric-key encryption • The same key is used to encrypt and decrypt data • Iterated block cipher • The cipher is applied to a fixed sized unit (block) of data. The size of the ciphertext is the same as the size of the original cleartext • Kerberos access control required for HDFS TDE
Data Encryption – At Rest • Data is encrypted while on persistent media (disk)
Data Encryption – In Transit • Data is encrypted while traveling over the network
HDFS Transparent Data Encryption • End-to-end encryption • Data is encrypted/decrypted at the client • Data is protected at rest and in transit • Transparent • No application level code changes required
End-to-End Encryption Ciphertext
HDFS TDE - Design • Goals: • Only an authorized client/user can access cleartext • HDFS never stores cleartext or unencrypted data encryption keys
HDFS TDE – Terminology I • Encryption Zone • A directory whose file contents will be encrypted upon write and decrypted upon read • An EZKEY is generated for each zone
HDFS TDE – Terminology II • EZKEY – encryption zone key • DEK – data encryption key • EDEK – encrypted data encryption key • Symmetric-key encryption • EZKEY + DEK => EDEK • EDEK + EZKEY => DEK
HDFS TDE - Services • HDFS NameNode (NN) • Hadoop Key Management Server (KMS) • Key Trustee Server • Kerberos Key Distribution Center (KDC)
HDFS TDE – Security Concepts • KMS creates the EZKEY & DEK • KMS encrypts/decrypts the DEK/EDEK using the EZKEY
HDFS TDE – Security Concepts • The name of the EZKEY is stored in the HDFS extended attributes of the directory associated with the encryption zone • The EDEK is stored in the HDFS extended attributes of the file in the encryption zone $ hadoop key … $ hdfs crypto …
HDFS TDE – Security Concepts • The HDFS NN communicates with the KMS to create EZKEYs & EDEKs to store in the extended attributes in the encryption zone • The HDFS client communicates with the KMS to get the DEK using the EZKEY and EDEK.
HDFS Examples • Simplified diagrams to avoid confusion/distraction: • Kerberos actions not shown • NameNode EDEK cache not shown
HDFS - Encryption Zone Create 3. Generate EZKEY
HDFS TDE – File Create Work Flow Using EZKEY
HDFS TDE – File Read Work Flow Using EZKEY
Bring in the Containers • Issues are the same for any virtualization platform • Multiple Compute Clusters • Multiple HDFS File Systems • Multiple Kerberos Realms • Cross realm trust configuration
Containers as Virtual Machines • This is not using containers to run Big Data tasks:
Containers as Virtual Machines • This is running Big Data clusters in containers: cluster
Containers as Virtual Machines • A true containerized Big Data environment:
KDC Cross Realm Trust • Different KDC Realms for corporate, data, and compute • Must interact correctly in order for the Big Data cluster to function CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals
KDC Cross Realm Trust • Different KDC Realms for corporate, data, and compute • One way trust • Compute Realm trusts the Corporate Realm • Data Realm trusts Corporate Realm • Data Realm trusts the Compute Realm
KDC Cross Realm Trust CORP.ENTERPRISE.COM Realm KDC: CORP.ENTERPRISE.COM user@CORP.ENTERPRISE.COM COMPUTE.ENTERPRISE.COM Realm DATALAKE.ENTERPRISE.COM Realm KDC: COMPUTE.ENTERPRISE.COM KDC: DATALAKE.ENTERPRISE.COM Hadoop Key Management Service Hadoop Cluster HDFS: hdfs://remotedata/ rm@COMPUTE.ENTERPRISE.COM
Key Management Service • Must be enterprise quality • Key Trustee Server • Java KeyStore KMS • Cloudera Navigator Key Trustee Server
Containers as Virtual Machines • A true containerized Big Data environment: DataLake DataLake DataLake CORP.ENTERPRISE.COM End Users CORP.ENTERPRISE.COM End Users CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals
Key Takeaways • Hadoop has many security layers • HDFS Transparent Data Encryption is best of breed • Security is hard (complex) and virtualization only makes it harder • Compute and Storage separation with virtualization makes it harder still
Tom Phelan @tapbluedata www.bluedata.com Visit BlueData Booth #211 in Strata Expo Hall