240 likes | 391 Views
Zodiac: Efficient Impact Analysis for Storage Area Networks. Aameek Singh * Madhukar Korupolu Kaladhar Voruganti Georgia Tech IBM Almaden Research Center. * Work done at IBM Almaden. Outline. Storage Management Change management Policy based management
E N D
Zodiac: Efficient Impact Analysis for Storage Area Networks Aameek Singh* Madhukar Korupolu Kaladhar Voruganti Georgia Tech IBM Almaden Research Center * Work done at IBM Almaden
Outline • Storage Management • Change management • Policy based management • Zodiac Architecture • Efficient Policy Evaluation Algorithms • Performance Analysis • Related Work • Summary of Contributions • Conclusions and Future Work
Storage Management Security Availability Disaster Recovery “ I don't know about you, but the storage industry makes me crazy! … Storage is perhaps the last vestige of the Wild West of information technology” – Storage Management Forum Future Growth Performance Change Management Inter-operability Policy based management
Change Management WHY: 70 - 80% of all changes resulting in downtime are initiated by people within [ Gartner ] the organization Why and how to manage changes to your SAN ? HOW: Reactive Proactive Analyzing impact on a. Performance b. Security c. Availability, DR …. With increasing complexity/scale Ad-hoc Techniques
Policy based Management • Policy a.k.a. best practices, rules, rules of thumb, constraints, threshold violations, goals, service classes • Policy [Webster Dictionary] • a definite course or method of action selected in light of given conditions to determine present and future decisions • Mainframes in 80s (DFSMS), networking in 90s (PBNM) .… • Abstracting goals from specific implementations • 4-tuple Event-Condition-Action (ECA) • < Scope , If , Then, Priority > SCOPE: Zone IF: Linux host included in a windows-exclusive zone THEN: Alert administrator / Automatic rezone PRIORITY: High
Host Host Host Host Host HBA HBA HBA HBA HBA HBA Port HBA Port HBA Port HBA Port HBA Port Switch-Port Switch-Port Switch-Port Switch-Port Switch-Port Switch Switch Switch-Port Switch-Port Switch-Port Switch-Port Controller-Port Controller-Port Controller Volume Volume Impact Analysis in Policy-enabled SANs - Policy evaluation may require repeated traversals of the graph - Policy violations may trigger actions that further impact the SAN - Policy addition would require evaluation for the whole SAN • Keep Paths • Too many! • Keep hosts to storage array information • What if there are additional conditions (connected through a Vendor-S switch)? SAN Size Action Chaining Three Dimensions of Complexity # Policies
ZODIAC ZODIAC • Proactive Impact Analysis • What happened ? What can happen ? • Fast and Efficient evaluation of policies • Optimization Techniques • Temporal Analysis • Impact after N hours • What-if I deploy this new application? • What-if I add that host to this zone ? • What-if I add this new SAN policy ? • Performance requirements • will not be met • Cannot add a linux host to a windows exclusive zone • Hosts-1,2,3 violate it
ZODIAC – Architecture Interaction with external modules SAN Monitor SAN Monitor: Obtain SAN State Workload Schedule:Temporal Analysis Policies DB: System policies Resource Model: Device Models Workload Schedule Policies DB Optimizations Session Classification SAN State Aggregates Cache Resource Models Internal modules Processing Engine Visualization Engine SAN State: SAN snapshot Session: Impact Analysis session Processing Engine: Evaluation Optimizations: For efficient evaluation Visualization: What-if GUI and impact visualization Zodiac
ZODIAC: What-if run-through SAN Monitor • Deploying 10 new hosts running Application-A Initialization Workload Schedule Input proposed action Defined workload schedule Obtain SAN snapshot and initialize session Find relevant region of SAN Evaluate Policies User-defined policies Policies DB Optimizations Find relevant policies Session If policies triggered, simulate actions Classification SAN State Aggregates Cache Temporal Analysis GUI Resource Models Processing Engine Visualization Engine Device resource models (expert-generated/observed)
Policy Evaluation • Current approaches • Scopes and sub-scopes (Consider - Vendor-A host should be connected only to Vendor-S storage controller) • Run periodically in only configuration validation mode • Do not exploit locality of data across policies • Zodiac Optimizations • Policy Classification • Identify relevant region of the SAN • Intermediate Result Caching • Reusing results • Aggregation • Utilizing aggregates
Policy Classification (1) • Understand what policies are • Agrawal et al contains 64 types of policies collected from field experts • Analyze from evaluation perspective and not specification ! Zoning/LUN Masking (ZL) Similar to EC policies, with entities being zones or LUN Mask sets Multiple Path (MPTH) Policy needs evaluation over multiple paths Single Path (SPTH) Policy holds on each and every path -> only new paths need to be checked Entity Class (EC) Dependent on attributes of entities of the same type (e.g. HBAs) Individual (ZL-Ind) Dependent on attributes of entities within the affected zone only Collection (ZL-Col) Dependent on attributes of entities in other zones also Individual (EC-Ind) Dependent on attributes of the affected entity only (e.g. added HBA) Collection (EC-Col) Dependent on entire entity class
Policy Classification (2) • WHY do this ? • Identify policies that require evaluation on only a few specific regions • Earlier approaches always took the worst-case for correct evaluation! • Does it work ? • Able to classify all policies in Agrawal et al • Experiments discussed later • Other Benefits • Policy evaluation (not just impact analysis) • Another perspective on policy specification • How to use specification to understand policy class? (ongoing work)
Host Host Host Host Host HBA HBA HBA HBA HBA HBA Port HBA Port HBA Port HBA Port HBA Port Switch-Port Switch-Port Switch-Port Switch-Port Switch-Port Switch Switch Switch-Port Switch-Port Switch-Port Switch-Port Controller-Port Controller-Port Controller Volume Volume Result Caching getController() getController() Data reuse • Locality across policies • Execution of policy for different instances of SAN elements • Multiple executions of the same policy getController() getController() getController() getController() HIT Supports filters for conditions like Vendor-A hosts connected to Vendor-S storage through Vendor-W switch
Aggregation • IDEA: Keep intermediate aggregates for more frequent policies • Unique Policies • For example, all devices must have unique world wide names (WWN), all fibre channel switches must have unique domain IDs • Keep a hash table for the unique attribute • Counts Policies • For example, all zones must have at least M and at most N ports, the number of zones in the fabric should be less than N • Keep a count for the attribute • Transformable • Convert policies into lower complexity policies • For example, all storage should be from same vendor (EC-Col) -> all storage should be from Vendor-S (EC-Ind)
Experimental Evaluation • Policy Set • No publicly available trace of SAN policies • Micro-benchmark of eight different classes of policies • 4 EC, 1 SPTH, 1 MPTH, 2 ZL • SAN Simulation • 1000 hosts, 200 controllers (1000/200), 750/150, 500/100, 250/50 • Policy Evaluation implementations • base: existing evaluation approaches with no optimizations • class: base + classification optimization (able to identify SAN region) • cach: base + caching optimization • agg: base + aggregation optimization • all: base + class + cach + agg (fully optimized implementation)
0/0 250/50 500/100 750/150 1000/250 0/0 250/50 500/100 750/150 1000/250 Results (1) (SPTH) An ESS storage array is not available to open systems if an iSeries host is configured to array base implementation checks for all devices class implementation identifies the SPTH policy and checks only for newly created paths cach implementation uses intermediate result caching agg implementation performs the same as base all utilizes both classification and caching (EC & Unique) All devices must have unique WWNs base implementation checks for all devices class implementation identifies the EC policy cach implementation performs the same as base agg implementation utilizes a hash table all utilizes both classification and hash table
0/0 250/50 500/100 750/150 1000/250 0/0 250/50 500/100 750/150 1000/250 Results (2) The number of ports of type X in the fabric is less than N No two host types should exist in the same zone • Each optimization has a niche of policy classes it works best for • Together, the optimizations significantly reduce response times (the all implementation) • Exploring parallel policy evaluation and MPTH evaluation
Related Work • Autonomic Management • Self-* Framework (CMU) • Planning: Minerva, Hippodrome, Ergastulum, Appia (HP) • Disaster Recovery Planning (Keeton et al, HP) • What-if Analysis • Auto-Admin (Microsoft), SAN and disk simulators, some planning work, IBM Volume Placement Advisor • What-if for self-* (Thereska et al, CMU) • Onaro SANscreen™ • Policy based Storage Management • All major vendors (EMC, IBM, HP, Veritas) • Agrawal et al (IBM) ZODIAC
Summary of Contributions • Impact analysis for policy enabled SANs • Impact of user actions or new policies • Impact due to triggered user actions • Temporal impact analysis • Efficient Policy Evaluation • Classification • Caching • Aggregation • Policy Classification Framework
Conclusions and Future Work • Change management • Proactive impact analysis is important at reducing lead times • Analysis needs to incorporate policy based management • Policy based management • Increasingly accepted management philosophy • Tools, like Zodiac, needed to support its deployment • Policy specification • Should hint towards how to evaluate … • Possible static analysis • Other optimizations • SMI-S layer • Batch policy evaluation (similar to SQL query optimization ?)
THANK YOUComments and suggestions are welcomeaameek@cc.gatech.edu, {madhukar, kaladhar}@us.ibm.com
a. Due to a workload burst, utilization threshold reached. A policy is violated. The action throttles the workloads. b. Workload throughput drops. Some perf. policies are violated. c. All violated policies are initial perf. policies. d. Link utilization drops after workload throttling. e. More performance policies get violated. f. Next set of violated policies trigger some operations, e.g. rezoning to create more flows. Some zoning policies also get violated. g. The violated zoning policy shuts down the system. h. Storage utilization stops when system shuts down. i. Zero link traffic when system shuts down. h d a Controller Capacity Util (%) Link-1 Traffic (MB/s) i 1 day 1 day 0 Time 0 Time g f e Total Policy Violations (#) Perf. Policy Violations (#) c b 0 0 1 day 1 day Time Time 0 1 week 2 weeks 1 day