130 likes | 210 Views
Flint: Making Sparks ( and Sharks and HDFSs too!). Jim Donahue | Principal Scientist Adobe Systems Technology Lab. Flint: Bring BDAS to the AWS Masses @ Adobe. How to effectively evangelize BDAS @ Adobe? Looking for intrepid, curious users who want to experiment
E N D
Flint: Making Sparks (and Sharks and HDFSs too!) Jim Donahue | Principal Scientist Adobe Systems Technology Lab
Flint: Bring BDAS to the AWS Masses @ Adobe • How to effectively evangelize BDAS @ Adobe? • Looking for intrepid, curious users who want to experiment • Curiosity is always tempered by cost of startup • Most of the data for experimental applications likely in AWS
Flint: Design Principles • Shared Nothing • Get your own AWS account and go • Simple Configuration • Write a little JSON, run a couple of scripts • Efficient, flexible scaling • As simple or complex as you want/need • Full access to tools • Batch, Spark/Shark shells, Shark Server, web UIs, … • Access to all the Spark/Shark tuning parameters • Very simple hardwired “spark-env.sh” • Tuned to Adobe environment • Port choices determined by our firewall
LocalSpark/Shark Flint: Architecture RemoteAccess ClusterSetup • Local Spark/Shark, Slaves can use S3 storage for files • Remote Access runs shells on SSH Server • Components use S3, SimpleDB for state management • Flint distributes shared AWS credentials among components • Flint manages master, SSHServer startup • Slave elasticity managed by master, can leverage spot pricing Local FlintServer Flint Spark Master SSHServer(Shells) S3 SparkSlave(s) AWS SimpleDB
Flint: Setup RemoteAccess LocalSpark/Shark ClusterSetup • Flint instance manages encrypted AWS credentials • Create S3 buckets to hold JAR files • Create SimpleDB tables to hold state • Create key pair, security group for instances Local FlintServer Flint S3 AWS SimpleDB
Flint: Provisioning RemoteAccess LocalSpark/Shark ClusterSetup • Define clusters through JSON spec (“master instance configuration is x, slave instance configuration is y, scaling rule is …”) • Define configurations through JSON spec (“spark master uses AMI x, running service y, with properties a, b, …”) and JAR file containing services code • “Getting started” set of clusters, configurations provided • AMI provided with all the requisite Spark / Shark / Hadoop / Kafka bits Local FlintServer Flint S3 AWS SimpleDB
LocalSpark/Shark Flint: Cluster Start RemoteAccess ClusterSetup • Local Flint Instance launches “master” instance (using cluster definition in SimpleDB) • Master reads SimpleDB and S3 for configuration and code, installs master services • Starting services launches Spark and/or HDFS masters through command line • Master puts “connect URL” in SimpleDB Local FlintServer Flint Spark Master S3 AWS SimpleDB
LocalSpark/Shark Flint: Slave(s) Start RemoteAccess ClusterSetup • Master “scaling service” launches slave instance(s) • Slave reads SimpleDB and S3 for configuration and code, installs worker services • Slave gets master “connect URL” from SimpleDB • Slave launches Spark and/or HDFS workers through command line Local FlintServer Flint Spark Master S3 SparkSlave(s) AWS SimpleDB
LocalSpark/Shark Flint: Client Start RemoteAccess ClusterSetup • Flint instance launches “client” instance (using cluster definition in SimpleDB) • Client reads SimpleDB and S3 for configuration and code, installs (SSHServer) services • Client reads SimpleDB for authentication info, master connect URL • Service startup starts SSHServer connected to right “shell factory” Local FlintServer Flint Spark Master SSHServer(Shells) S3 SparkSlave(s) AWS SimpleDB
LocalSpark/Shark Flint: Client Connect (Remote Shells) RemoteAccess ClusterSetup • Flint server finds “appropriate client” • SSH client launched to connect • SSHServer connects to master on client’s behalf Local FlintServer Flint Spark Master SSHServer(Shells) S3 SparkSlave(s) AWS SimpleDB
Flint: Client Asynchronous Requests • Flint clients can also make asynchronous requests • Each Flint master runs service that pulls request from SQS queue • Request progress/results stored in SDB • Requests include: • Move data between HDFS and S3 • Mount EBS volume and cache in HDFS (AWS public data sets) • Run batch job • Client can make request even if cluster not alive • Simplifies startup sequencing • Can use monitoring of “cluster queues” to start cluster “on demand”
Flint: Where We Are Now • Have some intrepid, curious users • The big issue is always “Do I really want to use Spark/Shark?” • SQL is a big selling point • Scala is a mild put-off • Spark Streaming may help settle the issue • Open Sourcing is under discussion • If you’re interested, let me know!