1.01k likes | 1.23k Views
Monitoring, Configuration and Resource Management of Service Workflows in Virtualized Clusters and Clouds. Yi Wei Advisor: Prof. M. Brian Blake and Prof. Greg Madey April 11, 2013. Outlines. Introduction and Motivations Backgrounds and Definitions Technical Approaches and Evaluations
E N D
Monitoring, Configuration and Resource Management of Service Workflows in Virtualized Clusters and Clouds Yi Wei Advisor: Prof. M. Brian Blake and Prof. Greg Madey April 11, 2013
Outlines • Introduction and Motivations • Backgrounds and Definitions • Technical Approaches and Evaluations • Workflow Configuration • Service Monitoring • Resource Management • Cross-workflow Coordination • Conclusion and Future Work
The Cloud - illustrated Image from cloudtweaks.com
Introduction: Managing Service Workflows in Cloud Environments Cloud B Long-standing Workflows Cloud A VM VM VM VM VM Cloud C Cloud-oriented Resources
Real World Distributed Service Workflows A workflow for detecting tumor types through microarray analysis (part of the CancerGrid environment).
Evolution of the Topic • Integrate Service-oriented • Computing and Cloud • Computing by the means of an • improved service lifecycle • model, a semi-automated • software framework and its • accompanying discovery and • management methods and tools. • SoC service workflows • Service modeling, discovery and management management of service workflows • Quantitative evaluations
A More Focused Thesis Deployment and management of service workflows across virtualized clusters and clouds. Relevance: New requirements from hosting complex applications in clouds Illusion of infinite resource pool is just an illusion Benefit of separation of concerns
Major Challenge in Resource Allocation L L • How to differentiate services in a workflow and determine the amount of resources allocated to each service? • Existing approach: SLA or QoS-based allocation (Van et. al, Zhu et. al) • My approach: infer service ranks based on workflow structure and use these values to guide resource allocation S1 S2 S4 S3 M S L M
Major Challenge in Resource Management T0 T1 T2 T3 • How to fulfill the dynamic resource requirements of services in the workflow? • Existing approach: reactive or policy-based scaling (Amazon Cloud) • My approach: proactive scaling, use idle resources from underutilized services to satisfy the needs of busy services, and cross-workflow coordination S1 M S M L S2 M M
Overall Novelty of Dissertation • Considers service workflows instead of individual services • Provides an end-to-end solution for workflow deployment and management • Deploys and manages workflows from a resource perspective • Offers mechanisms for cross-workflow coordination
Contributions Overview: Model, algorithms, and generic, agent-based framework to monitor, configure, and allocate service workflows across virtualized resources in a cloud. • A conceptual framework for managing service workflows • An algorithm to deploy workflows onto a virtualized resource pool • An algorithm to adaptively monitor deployed services • An algorithm to dynamically manage running workflows • An algorithm to realize cross-workflow coordination
Target: Services and Service Workflows • A service is a self-contained software artifact equipped with standardized APIs, designed to finish a specific task. • A service workflow is a series of interdependent and loosely coupled services. Its goal is to accomplish complex business logics or scientific processes.
Environment: Virtualized Clusters and Clouds User Requests A highly virtualized resource pool managed for the users and providing on-demand provisioning based on user requests and resource availabilities. Virtualized Resource Pool PM1 PM2 V1 V2 V3 V4 PM3 … V5 V6 V7 V8
Virtualized Resources • A portion of the hosting physical machine’s capacity • Multiple predefined configurations (different CPU number, RAM size, etc) • Support on-demand creation and termination PM PM XLarge Large Large PM … M M M M
Operations: Deployment and Management L L • Deploy new workflows • Monitor deployed services • Manage resources for deployed workflows S1 S2 S4 S3 L S M M 16
Why Agents? • Distributed and autonomous nature • Manifest self-organization and self-steering • Model behaviors of entities of different (maybe conflicting) interests and their interactions with the environment and between themselves
The Management Framework Cloud Cloud Management Agent Resource adjustment requests Decisions Workflow Workflow Management Agent Resource adjustment requests Decisions Service Service Management Agent Monitoring request Status information Instance/VM Instance Monitoring Agent
Workflow Deployment The goal is to produce a mapping plan from services to VMs, then to PMs. S1 S2 Workflow L M M VM PM P1 P2 P3
Considerations and Constraints • Budget of the workflow • Service priority • Capacity of each PM at deployment time • Predicted capacity of each PM
Configuration Algorithm Steps • Service sorting and ranking • PM sorting based on predictive and current free capacities • Expendable budget calculation and VM size search • Residual budget distribution • VM to PM mapping Complete Algorithm VM Splitting
Experiment Configuration • Simulation on traces from real world clusters • Randomly generated DAG workflows as input • Number of overloaded PMs as performance metric • FirstFit and BestFit as baseline comparisons
Algorithm Comparison Additional Results
Motivation Different services have different degrees of availability, so it is unnecessary to check them using the same interval.
The Check Period Relaxation (CPR) Algorithm • Inspired by congestion control mechanism in TCP protocol • Successful status check doubles next check interval (to a certain extent) • Failed status check half next check interval
Finite State Automaton for CPR Chk=F | (Chk=S& CP>=FRL) Chk=F & FailCount = 3 FinalChk=F Chk=S & CP<FRL FR CR INV FAIL FinalChk=S Chk=S& SuCount=3 Chk: Check results SuCount: Successful check count FailCount: Fail check count S: Successful F: Fail FinalChk: Final check Chk=S| (Chk=F& FailCount<3) FR: Fast Relax StateCR: Cautious Relax State INV: Inactive FRL: Fast Relax Limit CP: Check Period
Observations • CPR can reduce message count across different availability values. • Performance is not very good at 80% to 90% availability levels.
Two Separate Modifications • Allow one additional failure before reducing the check period to filter out transient errors (CPR_2e) • Add an additional state to filter out unstable services (M_CPR)
New Results Additional Results
The Problem Dynamically manage resources (allocate or release) of a workflow so that the load levels of its component services stay within the specified range.
Limitation of Existing Approaches State-of-the-art approaches usually rely on reactive scaling or predefined rules. These approaches can be inflexible and inefficient under various situations.
Resource Reallocation Use idle resources of underutilized services to meet the needs of busy services. Workflow S1 S2 L M M VM PM P1 P2 P3
Billing and Management Cycles billing cycle management cycle m1 m2 m3 m4 A releases V1 to workflow agent, B request a VM V1 is allocated to B V1 started and allocated to A B releases V1
Management Algorithm Overview • Service level prediction and decision making • Workflow level matching • Cloud level allocation
Algorithm Process Service Load Prediction Resource Adjustment Calculation Resource Request Processing Service Service Workflow Internal VM Mapping and Request Forwarding Resource Assignment Cloud Workflow
Synthetic Data Generation • Four simple generators • Complex patterns are the linear combinations of its component generators • The generated data is called required capacity
Workflow Level Comparison Additional Results
Side Effect : Average VM size Shrinkage Total capacity is 273 at step 801 Total capacity is 242 at step 401
VM Merge Mechanism • Happens when a load level is stable • One merge per request • Merge requests have a lower priority than allocation requests
Motivations • Same service is used in multiple workflows • Different workflows have different load patterns • For the same service, resources from one instance group can be used to serve another instance group