410 likes | 414 Views
How to Protect Big Data in a Containerized Environment. Thomas Phelan Chief Architect, BlueData @tapbluedata. Outline. Securing a Big Data Environment Data Protection Transparent Data Encryption Transparent Data Encryption in a Containerized Environment Takeaways.
E N D
How to Protect Big Data in a Containerized Environment • Thomas Phelan • Chief Architect, BlueData • @tapbluedata
Outline • Securing a Big Data Environment • Data Protection • Transparent Data Encryption • Transparent Data Encryption in a Containerized Environment • Takeaways
In the Beginning … • Hadoop was used to process public web data • No compelling need for security • No user or service authentication • No data security
Then Hadoop Became Popular Security is important.
Layers of Security in Hadoop • Access • Authentication • Authorization • Data Protection • Auditing • Policy (protect from human error)
Hadoop Security: Data Protection Reference: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_edh_overview.html
Focus on Data Security • Confidentiality • Confidentiality is lost when data is accessed by someone not authorized to do so • Integrity • Integrity is lost when data is modified in unexpected ways • Availability • Availability is lost when data is erased or becomes inaccessible Reference: https://www.us-cert.gov/sites/default/files/publications/infosecuritybasics.pdf
Hadoop Distributed File System (HDFS) • Data Security Features • Access Control • Data Encryption • Data Replication
Access Control • Simple • Identity determined by host operating system • Kerberos • Identity determined by Kerberos credentials • One realm for both compute and storage • Required for HDFS Transparent Data Encryption
Data Encryption • Transforming data
Data Replication • 3 way replication • Can survive any 2 failures • Erasure Coding • Can survive more than 2 failures depending on parity bit configuration
HDFS with End-to-End Encryption • Confidentiality • Data Access • Integrity • Data Access + Data Encryption • Availability • Data Access + Data Replication
Data Encryption • How to transform the data? 101011100010010001011100010100011101010101010100011101010101110 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Ciphertext Cleartext
Data Encryption – At Rest • Data is encrypted while on persistent media (disk)
Data Encryption – In Transit • Data is encrypted while traveling over the network
The Whole Process Ciphertext
HDFS Transparent Data Encryption (TDE) • End-to-end encryption • Data is encrypted/decrypted at the client • Data is protected at rest and in transit • Transparent • No application level code changes required
HDFS TDE – Design • Goals: • Only an authorized client/user can access cleartext • HDFS never stores cleartext or unencrypted data encryption keys
HDFS TDE – Terminology • Encryption Zone • A directory whose file contents will be encrypted upon write and decrypted upon read • An EZKEY is generated for each zone
HDFS TDE – Terminology • EZKEY – encryption zone key • DEK – data encryption key • EDEK – encrypted data encryption key
HDFS TDE - Data Encryption • The same key is used to encrypt and decrypt data • The size of the ciphertext is exactly the same as the size of the original cleartext • EZKEY + DEK => EDEK • EDEK + EZKEY => DEK
HDFS TDE - Services • HDFS NameNode (NN) • Kerberos Key Distribution Center (KDC) • Hadoop Key Management Server (KMS) • Key Trustee Server
HDFS TDE – Security Concepts • Division of Labor • KMS creates the EZKEY & DEK • KMS encrypts/decrypts the DEK/EDEK using the EZKEY • HDFS NN communicates with the KMS to create EZKEYs & EDEKs to store in the extended attributes in the encryption zone • HDFS client communicates with the KMS to get the DEK using the EZKEY and EDEK.
HDFS TDE – Security Concepts • The name of the EZKEY is stored in the HDFS extended attributes of the directory associated with the encryption zone • The EDEK is stored in the HDFS extended attributes of the file in the encryption zone $ hadoop key … $ hdfs crypto …
HDFS Examples • Simplified for the sake of clarity: • Kerberos actions not shown • NameNode EDEK cache not shown
HDFS – Create Encryption Zone 3. Create EZKEY /encrypted_dir xattr: EZKEYNAME EZKEYNAME = KEY
HDFS – Create Encrypted File 1. Create file 2. Create EDEK 5. Return Success 4. Store EDEK 3. Create EDEK /encrypted_dir/file encrypted data /encrypted_dir/file xattr: EDEK
HDFS TDE – File Write Work Flow /encrypted_dir/file xattr: EDEK 3. Request DEK from EDEK & EZKEYNAME 4. Decrypt DEK from EDEK 5. Return DEK /encrypted_dir/file write encrypted data read unencrypted data
HDFS TDE – File Read Work Flow /encrypted_dir/file xattr: EDEK 3. Request DEK from EDEK & EZKEYNAME 4. Decrypt DEK from EDEK 5. Return DEK /encrypted_dir/file read encrypted data write unencrypted data
Bring in the Containers (i.e. Docker) • Issues with containers are the same for any virtualization platform • Multiple compute clusters • Multiple HDFS file systems • Multiple Kerberos realms • Cross-realm trust configuration
Containers as Virtual Machines • Note – this is not about using containers to run Big Data tasks:
Containers as Virtual Machines • This is about running Hadoop / Big Data clusters in containers: cluster
Containers as Virtual Machines • A true containerized Big Data environment:
KDC Cross-Realm Trust • Different KDC realms for corporate, data, and compute • Must interact correctly in order for the Big Data cluster to function CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals
KDC Cross-Realm Trust • Different KDC realms for corporate, data, and compute • One-way trust • Compute realm trusts the corporate realm • Data realm trusts corporate realm • Data realm trusts the compute realm
KDC Cross-Realm Trust CORP.ENTERPRISE.COM Realm KDC: CORP.ENTERPRISE.COM user@CORP.ENTERPRISE.COM COMPUTE.ENTERPRISE.COM Realm DATALAKE.ENTERPRISE.COM Realm KDC: COMPUTE.ENTERPRISE.COM KDC: DATALAKE.ENTERPRISE.COM Hadoop Key Management Service Hadoop Cluster HDFS: hdfs://remotedata/ rm@COMPUTE.ENTERPRISE.COM
Key Management Service • Must be enterprise quality • Key Trustee Server • Java KeyStore KMS • Cloudera Navigator Key Trustee Server
Containers as Virtual Machines • A true containerized Big Data environment: DataLake DataLake DataLake CORP.ENTERPRISE.COM End Users CORP.ENTERPRISE.COM End Users CORP.ENTERPRISE.COM End Users COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals COMPUTE.ENTERPRISE.COM Hadoop/Spark Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals DATALAKE.ENTERPRISE.COM HDFS Service Principals
Key Takeaways • Hadoop has many security layers • HDFS Transparent Data Encryption (TDE) is best of breed • Security is hard (complex) • Virtualization / containerization only makes it potentially harder • Compute and storage separation with virtualization / containerization can make it even harder still
Key Takeaways • Be careful with a build vs. buy decision for containerized Big Data • Recommendation: buy one already built • There are turnkey solutions (e.g. BlueData EPIC) Reference: www.bluedata.com/blog/2017/08/hadoop-spark-docker-ten-things-to-know
@tapbluedata www.bluedata.com BlueData Booth #1508 in Strata Expo Hall