440 likes | 602 Views
Autonomic Runtime System: Design and Evaluation for SAMR Applications *. Salim Hariri High Performance Distributed Computing Laboratory The University of Arizona http://www.ece.arizona.edu/~hpdc Supported by: NSF, DOE, DARPA, Intel, Raytheon and AOL grants. Outline.
E N D
Autonomic Runtime System: Design and Evaluationfor SAMR Applications* Salim Hariri High Performance Distributed Computing Laboratory The University of Arizona http://www.ece.arizona.edu/~hpdc Supported by: NSF, DOE, DARPA, Intel, Raytheon and AOL grants
Outline • Motivation and objectives • Autonomia: An Autonomic Control and Management Environment • Self-Optimization • Self-Protection • Conclusion Remarks
Information Technology and Biology Convergence • Our system design methods and management tools seem to be inadequate for handling the complexity, size, and heterogeneity of today and future Information systems • Biological systems have evolved strategies to cope with dynamic, complex, highly uncertain constraints
Current Design and Development of Computing Systems Different fields evolved separately and Targeted few domains/applications
New System Construction:Part to The Whole Approach Adds Complexity High-Cost Interoperability Issues
Autonomic Computing System: Wholestic Approach Secure , Fault-Tolerant System High-Performance, Fault-Tolerant System Autonomic Building Block Self - Healing Component Self - Optimizing Component Self - Configuring Component Self - Protecting Component Autonomic Computing Systems
Autonomia: An Autonomic Control and Management • Provide dynamically programmable control and management services to support the development and deployment of autonomic applications • Provide Autonomic Runtime Services (self-healing, self-configuring, self-protecting, self-optimizing) • Provide automated deployment, registration, discovery of autonomic components • Provide automated configuration of autonomic applications and system resources
User’s Application Application Management Editor Autonomic Runtime System Autonomic Runtime Services • AIK Repository • ACA Specifications • Policy • Component State • Resource State Self Configuring Self Healing Self Optimizing Self Protecting Policy Engine Application Runtime Manager (ARM) Planning Engine CRM: Component Runtime Manager VEE: Virtual Execution Environment Monitoring &Analysis Engine Scheduling Engine Know- ledge EventServer Coordinator CRM ACA2 ACA1 ACA1 ACA2 ACA2 … ACA3 Computational Component … ACA3 ACAj ACAm ACA3 ACA1 … VEE(App1) VEE(App2) VEE(Appn) High Performance Computing Environment (HPCE)
Autonomia Process Flow User’s Application Application Management Editor Autonomic Runtime System 1 Autonomic Runtime Services • AIK Repository • ACA Specifications • Policy • Component State • Resource State Self Configuring Self Healing Self Optimizing Self Protecting Policy Engine 3 3 Application Runtime Manager (ARM) Planning Engine CRM: Component Runtime Manager VEE: Virtual Execution Environment Monitoring &Analysis Engine Scheduling Engine Know- ledge EventServer Coordinator 4 4 2 2 3 CRM ACA2 ACA1 ACA1 ACA2 ACA2 4 2 … ACA3 Computational Component … ACA3 ACAj ACAm ACA3 ACA1 … VEE(App1) VEE(App2) VEE(Appn) High Performance Computing Environment (HPCE)
Beowulf SP2 Cluster NC M t t IBM SP2 IBM SP2 Linux Georeferenced Distributed DB ResourceState ApplicationState System Capability Module Memory Bandwidth Monitor Availability Access Policy VCU VCU Virtual Computation Unit PlanningEngine Resource History Module MPP VCU Virtual Resource Unit Actual Wild Fire Model Development Environment TerrainCharacteristic Regional Weather Local WeatherTemp Humidity Wind Speed Wind Direction Clouds Precipitation Lightning Sensors Survey Flights Fuel Conditions Firefighting Activities Smoke Locationsand concentration GPS Satellite Predicted Fire BehaviorLocation Intensity Geometry Propagation Wildfire Autonomic Runtime Manager (WARM) AnalysisObjectives Dynamic Data Driven Wildfire Model Natural Region Characterization ActivePerformanceModel NR2 NR2Burning NR3Unburned NR1Burned CPU KnowledgeRepository Heterogeneous, Dynamic Computational Environment AutonomicScheduling Execution
Forest Fire Cell Space:Dynamic Repartitioning Initial partitioning NR2 Burning zone finer gridding NR5 NR3 NR2 Burned zone coarser gridding NR3
Wild Fire Simulation Physics • The entire area is represented as a 2-D cell-space.The weather and vegetation conditions are assumed to be uniform within a cell, but may vary in the entire cell space • When a cell is ignited, its state will change from “unburned” to “burning”. During its “burning” phase, the fire will propagate to its eight neighbors along the eight directions as shown below. • As the simulation time advances, the fire will propagate from the first ignition cell to other cells.
Parallel Wild Fire Simulation Analysis • The composition of execution time at time step t for 4 processors. • To decrease T(t), make the computation time on each processor as even as possible, which minimizing the synchronization time. • Imbalance Ratio (IR) characterizes the imbalance situation
Fire Simulation Example • The example above describes the imbalance ratio at different time steps. As the simulation advances, imbalance situation will get worse. t = 1 t = N t = 2N
Self-Optimization • Monitors the state of fire simulation to obtain the computation load at any time step • Monitors the states of the underlying system to obtain the computation capacity • Monitor the imbalance ratio at any time step. • If the imbalance ratio is larger than a given threshold, dynamically adjust the workload among processors at run time.
Self-Optimization Algorithm • Obtain the total workload at time t • Estimate the computation time of one burning cell on processor p with the consideration of system load Where L(p,t) is the length of CPU queueon processor p at time t • Calculate the average execution time of one burning cell
Self-Optimization Algorithm(cont’d) • To balance the load on each processor, processor allocation factor (PAF) is defined as inversely proportional to the processor execution time with respect to the average execution time. • Calculate the Processor Load Ratio (PLR) that characterize the capacities of processors Note that: • Calculate the workload assigned to processor p at time step t, workload(p,t)
Fire Simulation Example with Self-Optimization Algorithm • With the self-optimization algorithm, the imbalance situation will be dramatically decreased. t = N t = 2N t = 1
Run (DDWM) 8 1 7 3 4 2 5 6 Wildfire Autonomic Runtime Manager (WARM) Online Planning Online Monitoring and Analysis ActivePerformanceModel Monitor NR2 NR2Burning Beowulf SP2 Cluster NR3Unburned NR1Burned NR1Burned NC M t t IBM SP2 IBM SP2 Linux PlanningEngine ResourceState KnowledgeRepository ApplicationState Autonomic Scheduling VCU CPU VCU System Capability Module Memory Virtual Computation Unit Bandwidth Scheduler Availability Resource History Module Access Policy Execution(DDWM VRUs) Heterogeneous, Dynamic Computational Environment VCU Virtual Resource Unit MPP Wildfire Autonomic Runtime Manager
Experimental results • Problem size is 64K and number processors is 8 • With self-optimization, the imbalance ratio will be controlled as close to the threshold. But without self-optimization, the imbalance ration will get larger as the simulation advances
Experimental results (cont’d) • Problem size is 64K and number processors is 8. • Without self-optimization, the execution times of processors for one time step will be heterogeneous as the simulation advances. • With self-optimization, the execution times of processors for one time step will be almost evenly distributed as the simulation advances.
Number of Processors Number of Processors Execution Time with Static Partition (s) Execution Time with Static Partition (s) Execution Time with Dynamic Partition (s) Execution Time With Dynamic Partition (s) Percentage Improvement Percentage Improvement 8 8 16868.04 2441.88 1540.58 11244.40 36.91% 33.34% 16 16 11121.66 1824.43 1132.79 7859.89 37.91% 29.33% 32 9093.39 6092.23 33% Experimental results (cont’d) • Problem size (256*256 = 64K) • Problem size (512*512 = 256K)
Memory-based Proactive Runtime Partitioning • Optimize performance using memory-based approach • minimize number of page faults and balance work among processors • Memory function model for RM3D • W is application workload, ai are PF-based heuristics • Memory-based processor grouping and workload partitioning • Lightly (X -), moderately (X), or heavily (X +) loaded groups based on 2-level threshold with N -, N, and N + processors respectively • Work in group X - transferred to X + with unit of work being • Sort processors in X + in ascending order of available memory • Checks are made for processors with corresponding least available memory • Threshold conditions for work transfers must be met • After work transfers, new memory-based work partitioning ratios are computed as
Memory-based Proactive Runtime Partitioning • Better performance → moderately, heavily loaded scenarios • Most processors have less available memory • Frequent page faults resulting in long application delays • Memory-based algorithm yields better performance Memory-based proactive adaptation performance gain for RM3D application with base grid size 128*32*32 on 8 processors
CPU-based Proactive Runtime Partitioning • Adaptive system sensitive partitioner uses system capacities and obtained performance function to compute the relative computational capacities of each processor • System Capacity Calculation • N processors, the total work to be assigned is L • Runtime monitors application and system state • Application state: level of refinement, number, shape and aspect ratio of refined patches • System state: computational load, memory availability, link bandwidth • Performance engine selects the appropriate performance function to predict the execution time of the application for next time step • is the execution time on processor k • The PF of RM3D on processor k for a given load X1 and AMR level X2 is empirically defined as:
CPU Based Proactive System Sensitive Runtime Partitioning CPU-based proactive partitioning performance gain on 16 processors. (Base grid size: 641616)
Event server Mobile Agent System APPLICATION FAULT MANAGER Self healing monitoring and analyzing engine Self healing planning and execution engine execution planning analyzer monitoring Knowledge Component FAult Manager Component FAult Manager Autonomia Self-Healing Application Management Editor User application AUTONOMIC RUNTIME SYSTEM Autonomic Middleware Services SELF-HEALING SERVICE AIK APPLICATION RUNTIME MANAGER Heterogeneous Environment Component FAult Manager
AdaptiveAnalysis OnlineMonitoring Self Healing Engine Data mining Statistic Engine Real Network Running Environment Self-Protection Methodology
Measurement Attributes for Different Protocols • Inside a network element, the measurement attributes can be monitored at different protocol layers. • During the attack (DoS attack, SQL slammer worm, email worm, etc.), significant behaviors will be observed.
12 routers and 30 servers - server networks 150 clients, 30 routers -client networks Client Net 2 Client Net 3 Server Net 2 Traffic Configuration Legitimate client traffic through same interface as attack traffic to other servers Legitimate client traffic through different interface to attacked server Legitimate client traffic through same interface to attacked server and towards attack targets Legitimate server traffic (heavy) through different interface and towards other clients. Attack traffic Client Net 1 Server Net1 Client Net 0 Illustrative Network Example 100 Mbps, router to router links. Router to client node links are 30 Mbps and 10 Mbps
ADTCP-out Packet Number Abnormality Distance (AD) • Abnormality Distance of measurement attributes is used as an abnormality metric for profile modeling of the component behavior. where and are the mean and variance under the normal operation condition corresponding to the online measurement of attribute k. Right figure shows the ADtcp_out based on the single measurement attribute measure where the larger magnitude of the ADtcp_out indicates the abnormal behavior that might be due to an attack.
Multivariate Analysis Techniques on Network Attack Detection • Measurement Attributes • tcpOut: legitimate outgoing TCP segments rate • tcpTotal: legitimate outgoing and spoofed outgoing TCP segments rate • NRC: Normal Region Center, which is the baseline profile for the normal state • AD: Abnormality Distance Normal Region UCLtcptotal AD A tcpTotal NRC LCLtcptotal LCLtcpout UCLtcpout tcpOut
Validation on Attacker Side – Spoofed TCP SYN Attack • Attack intensity and duration are adjustable • TCP SYN attack traffic is spoofed • Number of incoming/outgoing packets only won’t detect the attack existence • Jointly with the total TCP network activity analysis can reveal the attack.
Autonomia Self-Protection Architecture Change Network Topology Online Monitoring Autonomic Runtime Engine Change Network Configuration Parameters Raw Traffic w.r.t. metric 1 Policy Translator Information Theory Raw Traffic w.r.t. metric 2 Abnormality function w.r.t metrics 1 .. m Normal/ Abnormal Characterization Raw Traffic w.r.t. metric n Analysis Engine
Working Flow of the Analysis Engine • Information theory is used to identify the most important features that can be extracted from network data. • Genetic algorithm is used to train data and obtain the threshold and coefficients used by the linear rule for detection. • Threshold and coefficients are used to detect a wide range of attacks in the period of testing.
Network Attack Feature Extraction Total Dataset Probe + Normal DoS + Normal R2L + Normal U2R+Normal • Discrete Features • Base dataset has a larger sample size • Discrete feature provides little semantics information
Network Attack Feature Extraction (Cont.) Discrete Features on Total Dataset Continuous Features on Total Dataset • Continuous Features • Compared with the discrete features, some continuous features will provide more information to the final detection • Information provided by the continuous features is much more meaningful • Partition strategy is deployed in the discretization of the continuous features • Heuristic algorithms (e.g. Genetic Algorithm) is used to determine the optimal partition • Combining both discrete and continuous features will provide better detection rate
Experimental Results • We compare our approach that is based on discrete features with fuzzy classifier evolved using Ctree and those of the winner group in the KDDCup’99 contest.
Results – Discrete vs. Cont. & Combined • We compare the results of using discrete and continuous features respectively
Summary and Concluding Remarks • Increased complexity, heterogeneity, uncertainty, and scale require new paradigms to design, control and manage systems and applications • Systems and Applications need to operate reliably, securely, efficiently and cost-effectively • Need Wholestic Approach that can dynamically integrate and address all these issues simultaneously at the layers of the system and application hierarchy • Autonomic Computing Provides an interesting, pragmatic approach to address these issues • Many challenges are ahead including composing and analyzing in real-time the operations and states of systems and applications need new bio-inspired metrics that accurately characterize and quantify the system and application normal and abnormal states