310 likes | 320 Views
Learn how to deploy a secure Spark cluster for sensitive data analytics, ensure HIPAA compliance, and automate security mechanisms with Sahara on OpenStack. Discover the safeguards and design of a secure Spark service with detailed privacy and security rules under HIPAA.
E N D
Automating the Deployment of a Secure, Multi-user HIPAA Enabled Spark Cluster using Sahara Michael Le Jayaram Radhakrishnan Daniel Dean Shu Tao OpenStack Summit, Austin, 2016
Analytics on Sensitive Data Personal Sensitive Data Sensitive Data Health information e.g., lab reports, medication records, hospital bill, etc. Financial information e.g., income, investment portfolio, bank accounts, etc. Secure Hadoop YARN Regulations Numerous governmental regulations to ensure data privacy, integrity and access control, e.g., HIPAA/HITECH, PCI, etc. Require generation of audit trails: “who has access to what data and when?”
Challenging to Deploy Secure Analytics Platform • Mapping regulations to security mechanism not always straightforward • Security mechanisms have lots of “things” to configure: • User accounts, access control lists, keys, certificates, key/certificate databases, security zones, etc. • Fast, repeatable deployment Approach: Use Sahara to expose security building blocks and automate deploymentPrototype with Spark on YARN Focus on HIPAA
Outline • Introduction • HIPAA and Spark platform • Deploying with Sahara • Automating security enablement • Supporting multi-user cluster • Supporting both Spark/Hadoop jobs using Sahara API • Supporting deployment on SoftLayer • Summary, lessons and improvements
Health Insurance Portability and Accountability Act (HIPAA) • What data is protected? • Any Protected Health Information (PHI): e.g., medication, diseases, health related measurements, etc. • Who does this apply to? • Entity that access/store data, e.g., health care clearinghouses, health care providers, and businessassociates • HIPAA has two rules: Privacy and Security Rules • Privacy rules protect privacy of PHI, access rights, and use of PHI • Security rules operationalizes Privacy rules for electronic PHI (e-PHI) • In general (must maintain w.r.t to PHI): • Confidentiality, integrity, availability • Must protect against anticipated threats to security and integrity • Must protect against anticipated impermissible uses or disclosures • Ensure compliance by workforce • Continuous risk analysis and management (extension with HITECH) • Mechanisms to log and audit access to PHI (Breach Notification Rule) • Continuous evaluation and maintain appropriate security protections (Enforcement Rule)
HIPAA Safeguards for Security Rules • Administrative Safeguards • Physical Safeguards • Technical Safeguards • Access controls - ensure only authorized persons can access PHI • Audit controls - record and examine access and other activities in information systems containing PHI • Integrity controls - ensure and confirm PHI not improperly altered or destroyed • Transmission security - prevent unauthorized access to PHI being transmitted over network
Secure Spark Service • Goal: Allow Spark to be pluggable into HIPAA-enabled platform • Assumptions • Two types of users: trusted admin and user of Spark • Multi-user (multiple doctors in Hospital A running analytics on multiple patients) • One YARN cluster per customer (e.g. Hospital A, Institution B) • YARN clusters created by different customers are not necessarily separated by VLANs • External data sources should provide drivers to encrypt data ingestion and outputs
Secure Spark Service – Design • Spark Cluster • Each user has separate account on cluster • Spark executors run inside secure containers created under the credentials of authenticated user • Data files isolated from different users • Isolation of compute resources among containers, min/max CPU and RAM commitment • Security among Spark components: • Encrypt communication among Spark executors using SSL and data shuffling via SASL • Spark components authenticate among each other using shared secrets distributed by YARN • Auto renewal of Kerberos credentials to access HDFS • Secure access to Spark: • Access to YARN components must go through Kerberos; prevents unauthorized VM joins, impersonations, resource access, job submissions Spark Cluster Secure Container Secure Container Secure Container … Spark Executor Spark Executor Spark Executor Kerberos Realm YARN + HDFS Resource manager Name Node Node Manager Data Node Kerberos … Sahara (Spark Service Manager) Cluster Mgmt API Job Mgmt API
Outline • Introduction • HIPAA and Spark platform • Deploying with Sahara • Automating security enablement • Supporting multi-user cluster • Supporting both Spark/Hadoop jobs using Sahara API • Supporting deployment on SoftLayer • Summary, lessons and improvements
Sahara Provisioning Flow 0. Create base VM image, upload to Glance, register with Sahara SaharaInterface 1. Define VM and cluster templates Launch SaharaProvisioningEngine 2. 5. Heat Template SSH to VMs:- start daemons - configure nodes 3. 4. Instantiate Issues commands OpenStack Components:Nova, Neutron, Glance VMs
Sahara on SoftLayer – Creating Node/Cluster Template HDFS Encryption Secure Mode Kerberos Server Location Kerberos KMS iPython Spark Job Server
Sahara on SoftLayer - Launching Cluster Credentials to IaaS Base VM image to use
PoC - HIPAA Enabled Spark Service Horizon Cluster for Analytics KeyStone Sahara SoftLayer CCI Heat (w/ SoftLayer resource extension) SoftLayer Swift Object Storage
Sahara Provisioning Flow – SoftLayer 0. Create base VM image, upload to SoftLayer SaharaInterface 1. Define VM and cluster templates Heat Template Launch SaharaProvisioningEngine 2. 6. SSH to VMs:- start daemons - configure nodes 3. SL Python Client Bindings 4. 5. SL::Compute::VM Heat resource plugin VMs SoftLayer
Automating Secure Spark/YARN Deployment Sahara • Create VM image with base binary • Create VM node and cluster templates • Instantiate VMs utilizing IaaS API • Configure security options for YARN and Spark • Create Kerberos credentials and SSL certificate • Create user account setup and per user local/HDFS folders Master VM Kerberos Server Workers VM
Preparing VM Image for Deployment VM Image for Spark Cluster: • Hadoop YARN 2.7.1 • Spark 1.5.2 • Kerberos • Spark Job Server • Jupyter (iPython) Currently, manual process of creating VM image for SoftLayer Future: extend Disk Image Builder and automate uploading to SoftLayer
Authentication using Kerberos Service Session Ticket Kerberos Server 6 Ticket Granting Server Users or Client Processes Service request 4 1 Ticket Request Access Service 7 Authentication Server Service lookup 5 YARN/Hadoop Service TGS Session Key, TGT 3 User lookup 2 Key Manager (KDC) • Users and YARN/Hadoop Processes • authenticated with Keytabs (encrypted keys associated with principal passwords) • Keytabs created while installing YARN/Hadoopand stored on local file system
Kerberos Server Configuration Sahara Kerberos Server Master VM Master VM Worker VMs Kerberos Server Master VM Spark Cluster 1 Worker VMs Kerberos Server Worker VMs Spark Cluster N Spark Cluster 2
Security Configuration Operations • Create cluster • Per node: • Configure for Kerberos: • Fixup Kerberos configuration: /etc/krb5.conf • Create principals: host, each service • RM, NM, NN, DN, KMS • Create keytabs for each service/host pair • Configure secure container: • Fixup Hadoop configuration • Set correct permissions on Hadoop files and folder • Per user: • Configure for Kerberos: • Create principals: each user in cluster • Create keytabs for each principal • Copy keytabs to node that interacts with YARN • Create user account on all nodes • Create HDFS directory • Generate HDFS encryption key and upload to KMS • Per cluster: • Configure SSL: • On ‘master’ node: • Create SSL keystore for each node by obtaining a certificate from CA (self-signed for testing) • Create single SSL truststore for entire cluster • Copy keystore to respective node • Copy truststore to all nodes Sahara Master VM Workers VM Kerberos Server • Scale out • Configure Kerberos: • Perform “Create Cluster, Per Node” operations • Copy user keytabs to node if needed to interact with YARN • Configure SSL: • Per node being added: • Create SSL keystore for each node • Copy keystore to respective node • Scale in • Configure Kerberos: • Per node being removed: Remove all principals associated with node
Multi-user Cluster Requirements Single ‘Spark’ user not sufficient for privacy as analytics job potentially can access other users’ data Expose new API to make on-boarding of new users easier • Create user account on each node of cluster • Create per user local directories and HDFS secure directories
Multi-user Cluster Support • Extend REST API v10.py:/clusters/add_user/<user_name> • This operations creates a new user account on specified cluster • Create new Unix user account on each host • Create HDFS directory and secure zone key • Create and distribute Kerberos key • Specific to our use case:/clusters/add_user/<user_name_port> • Creates new user account on specified cluster and starts a new iPython server process • Returns IP of node running iPython Challenge: Sahara hardcodes ‘hadoop’ user!
Submit Spark Jobs on YARN Through Sahara • Extend Sahara to allow Spark and Hadoop jobs to submitted to vanilla Hadoop cluster • No changes to REST API in Sahara Implementation: Detect Spark job type and construct spark-submit string accordingly based on plugin “engine” name • On cluster creation: configure spark_env.sh to point to Hadoop’s configuration files • Modify: plugins/vanilla/vx_x_x/versionhandler.py • return Spark EDP engine when job type is Spark • Modify: run_job() in Spark EDP engine.py: • Check cluster type (plugin name, e.g., vanilla or stand-alone Spark or even Mesos!) • Craft the spark-submit string to match platform
Supporting Deployment on SoftLayer • Modification to Sahara at Heat template creation • Create new heat plugin: SL_compute • Create new resource type – SL::Compute::VM • Communicates using SoftLayer’s (SL) REST API to create CloudLayer Computing Instances(CCIs) • Translates OpenStack VM properties to SL’s equivalent • Generate SL heat template • Add new fields in Sahara to accept and use SoftLayer credentials, data centers, object stores, etc. • Add new heat template resource files
Example Sahara Heat Template Output "AWSTemplateFormatVersion" : "2010-09-09", "Description" : "Data Processing Cluster by Sahara on SoftLayer", "Resources" : { "test-spark1-6-0-1-allwkrbsec-001" : { "Type" : "SL::Compute::VM", "Properties" : { "host_name" : "test-spark1-6-0-1-allwkrbsec-001", "domain" : "whc.softlayer.com", "cpus" : "2", "memory" : "4096", "hourly" : "True", "disk_type" : "SAN", "datacenter" : "hou02", "nic_speed" : "1000", "image_id" : “ccca6a66-f8ce-4ff1-a647-a42bacfxdadsf", "public_vlan": "Spark-FE-1", "private_vlan": "Spark-BE-1", "ssh_keys": [], "sl_user": “user1", "sl_apikey": “xxxxxxxyyyyyyyyyyzzzzzzzzzzz", "user_data": { "Fn::Join" : ["\n", ["#!/bin/bash", "echo \"ssh-rsaAAAAB3NzaC1yc2EAAAADAQABAAABAQC5KsWE2BjPsz7ubVw+UDVtpCUsFiYOcKdxx2uEHem8qUGoF5HvIu0hTs43KiELB9v8PJ8XwOuJCBmKgW1unUkmaUfoF7fzMlhn77tq1KOc3+8sDdxdgYIGi+xGEASPQbBvh4MziTRx0cz6+PXRaUGvDqBUpXwbzyVCHVGlaLWd9oDsoqghUPPYRjBTKpztINasdfasdfasdfasdfasdfasdfasdfasdfasdfasdfadsfasdfP0PtC8RRnf1BsVR+gEXAG4mDpw/z/GCx5Fnwi4BLBiXm6DkRmNEKd6pD8mr+7aeXDdcDk7rR7YN8c7it9X/CDvfFMvLR+uPGenerated by Sahara", "\" >> /root//.ssh/authorized_keys", ""]]}, "Outputs": { "test-spark1-6-0-1-allwkrbsec-001_ip" : { "Description": "IP address of the instance", "Value": { "Fn::GetAtt": ["test-spark1-6-0-1-allwkrbsec-001", "PublicIp"] } }, "test-spark1-6-0-1-allwkrbsec-001_priv_ip" : { "Description": "Private IP address of the instance", "Value": { "Fn::GetAtt": ["test-spark1-6-0-1-allwkrbsec-001", "PrivateIp"] } }, "test-spark1-6-0-1-allwkrbsec-001_password" : { "Description": "login password for the instance", "Value": { "Fn::GetAtt": ["test-spark1-6-0-1-allwkrbsec-001", "passwords"] } } } }
Summary • Show how HIPAA requirements are mapped to security mechanisms needed in analytics cluster • Extended Sahara to provide easy way to deploy a HIPAA enabled Spark cluster on the cloud • Automated security configuration (authentication, encryption, access control) • Provide ability to add new users to running system • Extended Sahara to submit both Spark and Hadoop jobs to single vanilla YARN cluster • Extended Sahara to make use of SoftLayer Heat extension
Lessons Learned • Not necessary to enable “all” security mechanisms to satisfy regulations • Enabling all security features drastically reduces performance • Not always clear what is minimum “security” needed to be compliant • Componentization of security mechanisms a useful thing! • Can we break down security features even more to allow fine-grained security mechanism selection and comparison? • Benefit: Allow easy/fast testing of different combinations of security components and assess tradeoffs (performance vs. additional security) • Provide a few “desirable” cluster templates very useful • Users are daunted by myriads of options (even with GUI)
Further Improvements • Sahara: • Further componentize security features and expose those features as part of cluster creation options, e.g., no authentication needed but encrypt all data on disk, specify amount of logs collected, etc. • Add new user account management mechanisms: • Delete users, change permissions, etc. • Remove assumption about hardcoded ‘hadoop’ user • API to retrieve data from analytics run • Better support for other clouds like SoftLayer • Security mechanisms: • Confirm integrity of data, e.g. automated checksum and verification • Manage keytab files better