170 likes | 362 Views
Garuda: A Cloud-based Job Scheduler. Ashish Patro MinJae Hwang Thanumalayan S. Thawan Kooburat. Agenda. Overview Job Scheduler Characteristics Scheduling Prototypes Performance Data. Introduction. Key Idea Centralized job scheduler in the Cloud Benefit and Motivation
E N D
Garuda: A Cloud-based Job Scheduler Ashish Patro MinJae Hwang Thanumalayan S. Thawan Kooburat
Agenda • Overview • Job Scheduler Characteristics • Scheduling Prototypes • Performance Data
Introduction • Key Idea • Centralized job scheduler in the Cloud • Benefit and Motivation • Simplify deployment and maintenance • Deploy only worker daemon • Scalability of the Cloud • Infinite scalability and pay-as-you go model • Simplify system design • Reliable services
Platform Choice • Amazon – EC2 • On-demand VMs • Google App Engine • Web application hosting platform • Reliable and scalable storage: DataStore • Automatic load balancing • Other services: • Memcached, Instant Messaging, Email, Cron Jobs, …
Job Scheduler Characteristics • Condor Job Scheduling (revisit) Match Maker ClassAd Storage CentralManager Negotiator Collector C C C C C Schedd Startd Job Queue Worker Node
Job Scheduler Characteristics • Job Scheduler need to process large amount of data Job Request Daemon Process Google App Engine Memcache Datastore Servlet
Memory Hierarchy Volatile Serialization Cost Query Global Namespace Local memory (Static variable) Memcached DataStore
Scheduling Prototypes M M M M M M M M M Batch ClassAd Online ClassAd Online + Batch DB Query Local memory (Static variable) C C C C C C C C C C C C J J C J C C Memcached DataStore
DataStore • DataStore API take lots of CPU cycle • Easy to reach hard DataStore limit (20 CPU-sec/sec) • Storing each ClassAd take 0.22 CPU-sec • DataStore rejects requests on high contention cells • DataStore is faster to retrieve a large amount of data • Query predicates have to match pre-defined indices
Memcache • Memcached size limit • 10K entries of 5K ClassAd • Memcached latency • Retrieving 1.5K entries of 5K ClassAd takes 30 secs. • Memcached parallelism • Multiple concurrent requests to memcached do not degrade performance • Only provide get/set interface • Cannot traverse/query memcached
Hosting Platform • GAE dynamically spawns JVM processes • Spawns only when all process is busy • Each JVM has only 1 thread • Maximum 10 JVM processes for free account • 10 request at a time • Memory limit per each JVM is around 110MB • JVM process is short-lived • Get killed after 110 seconds idle • Use Cron job to keep JVM process alive
Other Useful Services • Instant messaging: XMPP • Convenient communication protocol • Between CM and worker • Between CM and users (Google Talk) • Email • Offline notification
Testimonial • Google App Engine is crippled J2EE+RDBMS • Scale with money • Good testing platform • Look promising on documents
Job Request Daemon Process Google App Engine Memcache Datastore Servlet
Central Manager Match Maker ClassAd Storage Negotiator Collector C C Schedd Startd C C C Job Queue Worker Node