Monitoring, Configuration and Resource Management of Service Workflows in Virtualized Clusters and Clouds

Monitoring, Configuration and Resource Management of Service Workflows in Virtualized Clusters and Clouds Yi Wei Advisor: Prof. M. Brian Blake and Prof. Greg Madey April 11, 2013

Outlines • Introduction and Motivations • Backgrounds and Definitions • Technical Approaches and Evaluations • Workflow Configuration • Service Monitoring • Resource Management • Cross-workflow Coordination • Conclusion and Future Work

The Cloud - illustrated Image from cloudtweaks.com

Introduction: Managing Service Workflows in Cloud Environments Cloud B Long-standing Workflows Cloud A VM VM VM VM VM Cloud C Cloud-oriented Resources

Real World Distributed Service Workflows A workflow for detecting tumor types through microarray analysis (part of the CancerGrid environment).

Evolution of the Topic • Integrate Service-oriented • Computing and Cloud • Computing by the means of an • improved service lifecycle • model, a semi-automated • software framework and its • accompanying discovery and • management methods and tools. • SoC  service workflows • Service modeling, discovery and management  management of service workflows • Quantitative evaluations

A More Focused Thesis Deployment and management of service workflows across virtualized clusters and clouds. Relevance: New requirements from hosting complex applications in clouds Illusion of infinite resource pool is just an illusion Benefit of separation of concerns

Major Challenge in Resource Allocation L L • How to differentiate services in a workflow and determine the amount of resources allocated to each service? • Existing approach: SLA or QoS-based allocation (Van et. al, Zhu et. al) • My approach: infer service ranks based on workflow structure and use these values to guide resource allocation S1 S2 S4 S3 M S L M

Major Challenge in Resource Management T0 T1 T2 T3 • How to fulfill the dynamic resource requirements of services in the workflow? • Existing approach: reactive or policy-based scaling (Amazon Cloud) • My approach: proactive scaling, use idle resources from underutilized services to satisfy the needs of busy services, and cross-workflow coordination S1 M S M L S2 M M

Overall Novelty of Dissertation • Considers service workflows instead of individual services • Provides an end-to-end solution for workflow deployment and management • Deploys and manages workflows from a resource perspective • Offers mechanisms for cross-workflow coordination

Contributions Overview: Model, algorithms, and generic, agent-based framework to monitor, configure, and allocate service workflows across virtualized resources in a cloud. • A conceptual framework for managing service workflows • An algorithm to deploy workflows onto a virtualized resource pool • An algorithm to adaptively monitor deployed services • An algorithm to dynamically manage running workflows • An algorithm to realize cross-workflow coordination

Backgrounds & Definitions

Target: Services and Service Workflows • A service is a self-contained software artifact equipped with standardized APIs, designed to finish a specific task. • A service workflow is a series of interdependent and loosely coupled services. Its goal is to accomplish complex business logics or scientific processes.

Environment: Virtualized Clusters and Clouds User Requests A highly virtualized resource pool managed for the users and providing on-demand provisioning based on user requests and resource availabilities. Virtualized Resource Pool PM1 PM2 V1 V2 V3 V4 PM3 … V5 V6 V7 V8

Virtualized Resources • A portion of the hosting physical machine’s capacity • Multiple predefined configurations (different CPU number, RAM size, etc) • Support on-demand creation and termination PM PM XLarge Large Large PM … M M M M

Operations: Deployment and Management L L • Deploy new workflows • Monitor deployed services • Manage resources for deployed workflows S1 S2 S4 S3 L S M M 16

Agent-based Management Framework

Why Agents? • Distributed and autonomous nature • Manifest self-organization and self-steering • Model behaviors of entities of different (maybe conflicting) interests and their interactions with the environment and between themselves

The Management Framework Cloud Cloud Management Agent Resource adjustment requests Decisions Workflow Workflow Management Agent Resource adjustment requests Decisions Service Service Management Agent Monitoring request Status information Instance/VM Instance Monitoring Agent

Workflow Configuration

Workflow Deployment The goal is to produce a mapping plan from services to VMs, then to PMs. S1 S2 Workflow L M M VM PM P1 P2 P3

Considerations and Constraints • Budget of the workflow • Service priority • Capacity of each PM at deployment time • Predicted capacity of each PM

Configuration Algorithm Steps • Service sorting and ranking • PM sorting based on predictive and current free capacities • Expendable budget calculation and VM size search • Residual budget distribution • VM to PM mapping Complete Algorithm VM Splitting

Experiment Configuration • Simulation on traces from real world clusters • Randomly generated DAG workflows as input • Number of overloaded PMs as performance metric • FirstFit and BestFit as baseline comparisons

Algorithm Comparison Additional Results

Service Monitoring

Motivation Different services have different degrees of availability, so it is unnecessary to check them using the same interval.

The Check Period Relaxation (CPR) Algorithm • Inspired by congestion control mechanism in TCP protocol • Successful status check doubles next check interval (to a certain extent) • Failed status check half next check interval

Finite State Automaton for CPR Chk=F | (Chk=S& CP>=FRL) Chk=F & FailCount = 3 FinalChk=F Chk=S & CP<FRL FR CR INV FAIL FinalChk=S Chk=S& SuCount=3 Chk: Check results SuCount: Successful check count FailCount: Fail check count S: Successful F: Fail FinalChk: Final check Chk=S| (Chk=F& FailCount<3) FR: Fast Relax StateCR: Cautious Relax State INV: Inactive FRL: Fast Relax Limit CP: Check Period

Message Count Comparison

Observations • CPR can reduce message count across different availability values. • Performance is not very good at 80% to 90% availability levels.

Two Separate Modifications • Allow one additional failure before reducing the check period to filter out transient errors (CPR_2e) • Add an additional state to filter out unstable services (M_CPR)

New Results Additional Results

Workflow Resource Management

The Problem Dynamically manage resources (allocate or release) of a workflow so that the load levels of its component services stay within the specified range.

Limitation of Existing Approaches State-of-the-art approaches usually rely on reactive scaling or predefined rules. These approaches can be inflexible and inefficient under various situations.

Resource Reallocation Use idle resources of underutilized services to meet the needs of busy services. Workflow S1 S2 L M M VM PM P1 P2 P3

Billing and Management Cycles billing cycle management cycle m1 m2 m3 m4 A releases V1 to workflow agent, B request a VM V1 is allocated to B V1 started and allocated to A B releases V1

Management Algorithm Overview • Service level prediction and decision making • Workflow level matching • Cloud level allocation

Algorithm Process Service Load Prediction Resource Adjustment Calculation Resource Request Processing Service Service Workflow Internal VM Mapping and Request Forwarding Resource Assignment Cloud Workflow

Synthetic Data Generation • Four simple generators • Complex patterns are the linear combinations of its component generators • The generated data is called required capacity

Workflow Level Comparison Additional Results

Average VM Lifespan Comparison

VM Creation/Termination Comparison

Allocation Request Fulfillment Composition

Side Effect : Average VM size Shrinkage Total capacity is 273 at step 801 Total capacity is 242 at step 401

VM Merge Mechanism • Happens when a load level is stable • One merge per request • Merge requests have a lower priority than allocation requests

Evaluation of the Merge Mechanism

Cross-Workflow Coordination

Motivations • Same service is used in multiple workflows • Different workflows have different load patterns • For the same service, resources from one instance group can be used to serve another instance group

Monitoring, Configuration and Resource Management of Service Workflows in Virtualized Clusters and Clouds

Monitoring, Configuration and Resource Management of Service Workflows in Virtualized Clusters and Clouds

Presentation Transcript

Configuration and Change Management

Network Configuration and Management

Clouds, Grids, Clusters and FutureGrid

Multifaceted Resource Management in Virtualized Providers

Configuration and Change Management

Power-aware Consolidation of Scientific Workflows in Virtualized Environments

General and Effective Monetary Optimizations for Workflows in IaaS Clouds

Dynamic Resource Monitoring and Allocation in a virtualized environment

Portable Resource Management for Data Intensive Workflows

Installation and configuration of gLite Resource Broker

Profiling and Modeling Resource Usage of Virtualized Applications

CWG10 Control, Configuration and Monitoring

Scheduling and Resource Management for Next-generation Clusters

Resource monitoring and discovery in OGSA

Power-Aware Service Allocation and Reallocation in Clusters and Clouds

Power-aware Consolidation of Scientific Workflows in Virtualized Environments

Configuration Management and RCS

Groups, Clusters and Clusters of Clusters

Monitoring Clusters

Clouds , Grids and Clusters

Clouds , Grids and Clusters