180 likes | 267 Views
Common Execution Infrastructure (CEI) Subsystem. OOI CI System Architecture Team:. CEI Developers. CEI Developer Patrick Armstrong University of Chicago. CEI Senior Developer Pierre Riteau University of Chicago. 2. CEI Developer John Bresnahan Argonne National Lab (part-time).
E N D
Common Execution Infrastructure (CEI) Subsystem OOI CI System Architecture Team:
CEI Developers CEI Developer Patrick Armstrong University of Chicago CEI Senior Developer Pierre Riteau University of Chicago 2 CEI Developer John Bresnahan Argonne National Lab (part-time) CEI Developer Pierre Riteau University of Chicago (part-time) 8/31/2014
Subsystem Purpose • Allow OOI applications and system to • Provide Highly Available (HA) services • Scale to demand • Enact OOI deployment policies in elastic environment • Provide a deployment foundation for OOI CI
CEI Scope • Elastic Computing Services • Implement elastic computing services to provide on-demand scaling and high availability. • Execution Engine Catalog & Repository Services • Working with operations and ITV to develop and refine tools to upload and sync the different deployable type representations adapted to each site. • Process Management Services • Provide the management services for policy-based process execution within specified deployable types intended to support the data distribution services; as such the processes are sequential and require primarily a process to resource match. • Process Catalog & Repository Services • The Process Catalog and Repository Services maintain process definitions as well as lists active processes. • Integration with the National Computing Infrastructure • Provide the capability to deploy OOI processing on the Amazon cloud services as well as academic clouds
High Availability and Scaling • High Availability • Towards an always-on service model • Failures in outsourced resources • Providing a pool of replenishable compute resources • Autoscaling • Provide resources for peaks in demand • Ensure good utilization during “valleys” in demand • Flexible resource mix
Resources for HA and Scaling • Cloud resources are available on-demand, but any particular resource may fail at any time • Applications/processes can absorb new resources • Applications/processes can tolerate failures EPU EPU Management Monitor and regulate set properties based on system-specific and application-specific metrics
Elastic Processing Unit (EPU) Management create instance AMQP EPU Management EPU Management EPU Management Other Provisioner DTRS Decision Engine IaaS CB EE ioncore 1.2 EE matlab 6.1 EE ioncore 1.3 ou-agent ou-agent ou-agent context-agent context-agent context-agent
Making the EPU HA AMQP Other create instance Bootstrap EPU Provisioner/DTRS Dedicated DE IaaS cloudinit.d ou-agent ou-agent ou-agent EPU Worker EPU Worker EPU Worker EPU Worker EPU Worker EPU Worker EPU Worker EPU Worker EPU Worker
Creating a Process I AMQP Other Process Dispatcher enter Process Instance Registry EE type A instance launch ee-agent Decision Engine lookup Process Definition Registry request to activate process X
Creating a Process II create instance AMQP EPU Management Other Provisioner/DTRS IaaS request instance Process Dispatcher enter Process Instance Registry EE type A instance launch ee-agent Decision Engine lookup Process Definition Registry request to activate process X
Inside an Execution Engine C – create M – monitor R – restart K – kill O – I/O AMQP Other EE type A instance C supervisord context-agent Matlab script CMR CMKO EPU Management CC instance ou-agent C supervisord M supervisord CMKO CMR CMK Process Dispatcher CC instance ee-agent process (adapter) 1 Package Server datastream subscription result
Adventures in Availability Mean time between failures • Time to repair (TTR) • Diagnosis • Time to scale (TTS) • PENDING (request) • STARTED (deployment) • RUNNING (contextualization) MTBF A = MTBF+MTTR Mean time to repair TTS: preliminary results for 2,000 VMs provisioned on AWS EC2
R3 Scope • Process management • Activation and validation • New execution site registration • Integration with National Infrastructure • Framework for integration of academic cloud providers, TeraGrid and OSG • Integration with Microsoft cloud
R3 Activities • Refine/change scope to achieve a complete and maintainable system • Decide on specific solutions for R3 scope