290 likes | 311 Views
This workshop focuses on managing the complex and dynamic nature of grid and web services. It addresses the challenges of configuration, monitoring, and fault tolerance, while ensuring scalability, performance, interoperability, and usability. The workshop also explores the use of messaging infrastructure to provide a robust management architecture.
E N D
ManagingGrid and Web Servicesand their exchanged messages OGF19 Workshop on Reliability and RobustnessFriday Center Chapel Hill NCJanuary 31 2007 Authors HarshawardhanGadgil (his PhD topic), Geoffrey Fox, ShrideepPallickara, Marlon Pierce Community Grids Lab, Indiana University Presented by Geoffrey Fox gcf@indiana.edu
Management Problem I • Characteristics of today’s (Grid) applications • Increasing complexity • Components widely dispersed and disparate in nature and access • Span different administrative domains • Under differing network / security policies • Limited access to resources due to presence of firewalls, NATs etc… (major focus in prototype) • Dynamic • Components (Nodes, network, processes) may fail • Services must meet • General QoS and Life-cycle features • (User defined) Application specific criteria • Need to “manage” services to provide these capabilities • Dynamic monitoring and recovery • Static configuration and composition of systems from subsystems
Management Problem II • Management Operations* include • Configuration and Lifecycle operations (CREATE, DELETE) • Handle RUNTIME events • Monitor status and performance • Maintain system state (according to user defined criteria) • Protocols like WS-Management/WS-DM define inter-service negotiation and how to transfer metadata • We are designing/prototyping a system that will manage a general world wide collection of services and their network links • Need to address Fault Tolerance, Scalability, Performance, Interoperability, Generality, Usability • We are starting with our messaging infrastructure as • we need this to be robust in Grids we are using it in (Sensor and material science) • we are using it in management system • and it has critical network requirements *From WS – Distributed Managementhttp://devresource.hp.com/drc/slide_presentations/wsdm/index.jsp
Core Features of Management Architecture • Remote Management • Allow management irrespective of the location of the resource (as long as that resource is reachable via some means) • Traverse firewalls and NATs • Firewalls complicate management by disabling access to some transports and access to internal resources • Utilize tunneling capabilities and multi-protocol support of messaging infrastructure • Extensible • Management capabilities evolve with time. We use a service oriented architecture to provide extensibility and interoperability • Scalable • Management architecture should be scale as number of managees increases • Fault-tolerant • Management itself must be fault-tolerant. Failure of transports OR management components should not cause management architecture to fail.
Management Architecture built in terms of • Hierarchical Bootstrap System – Robust itself by Replication • managees in different domains can be managed with separate policies for each domain • Periodically spawns a System Health Check that ensures components are up and running • Registry for metadata (distributed database) – Robust by standard database techniques and our system itself for Service Interfaces • Stores managee specific information (User-defined configuration / policies, external state required to properly manage a managee) • Generates a unique ID per instance of registered component
Architecture:Scalability: Hierarchical distribution Spawns if not present and ensure up and running Replicated ROOT • Passive Bootstrap Nodes • Only ensure that all child bootstrap nodes are always up and running EUROPE US … • Active Bootstrap Nodes • /ROOT/EUROPE/CARDIFF • Responsible for maintaining a working set of management components in the domain • Always the leaf nodes in the hierarchy CGL FSU CARDIFF
Management Architecture built in terms of • Messaging Nodes form a scalable messaging substrate • Message delivery between managers and managees • Provides transport protocol independent messaging between distributed entities • Can provide Secure delivery of messages • Managers – Active stateless agents that manage managees. • Both general and managee specific management threads performs actual management • Multi-threaded to improve scalability with many managees • Managees – what you are managing (managee / service to manage) – Our system makes robust • There is NO assumption that Managed system uses Messaging nodes • Wrapped by a Service Adapter which provides a Web Service interface • Assumed that ONLY modest state needed to be stored/restored externally. Managee could front end and restore itself a huge database
Always ensure up and running Always ensure up and running Architecture:Conceptual Idea (Internals) WS Management Periodically Spawn Manager processes periodically checks available managees to manage. Also Read/Write managee specific external state from/to registry Connect to Messaging Node for sending and receiving messages User writes system configuration to registry
Architecture:User Component • “Managee Characteristics” are determined by the user. • Events generated by the Managees are handled by the manager • Event processing is determined by via WS-Policy constructs • E.g. Wait for user’s decision on handling specific conditions • “Auto Instantiate” a failed service but service responsible for doing this consistently even when failed service not failed but just unreachable • Administrators can set up services (managees) by defining characteristics • Writing information to registry can be used to start up a set of services
Issues in the distributed systemConsistency • Examples of inconsistent behavior • Two or more managers managing the same managee • Old messages / requests reaching after new requests • Multiple copies of managees existing at the same time / Orphan managees leading to inconsistent system state • Use a Registry generated monotonically increasing Unique Instance ID (IID) to distinguish between new and old instances of Managers, Managees and Messages • Requests from manager thread A are considered obsolete IF IID(A) < IID(B) • Service Adapter stores the last known MessageID (IID:seqNo) allowing it to differentiate between duplicates AND obsolete messages • Periodic renewal with registry • IFIID(manageeInstance_1) < IID(manageeInstance_2) • THEN manageeInstance_1 was deemed OBSOLETE • SOEXECUTE Policy (E.g. Instruct manageeInstance_1 to silently shutdown)
Issues in the distributed systemSecurity • Security – Provide secure communication between communicating parties (e.g. Manager <-> Managee) • Publish/Subscribe:- Provenance, Lifetime, Unique Topics • Secure Discovery of endpoints • Prevent unauthorized users from accessing the Managers or Managees • Prevent malicious users from modifying message (Thus message interactions are secure when passing through insecure intermediaries) • Utilize NaradaBrokering’s Topic Creation and Discovery* and Security Scheme# • *NB-Topic Creation and Discovery (Grid2005) http://grids.ucs.indiana.edu/ptliupages/publications/NB-TopicDiscovery-IJHPCN.pdf • #NB-Security (Grid2006) http://grids.ucs.indiana.edu/ptliupages/publications/NB-SecurityGrid06.pdf
Implemented: • WS – Specifications • WS – Management (June 2005) parts (WS – Transfer [Sep 2004], WS – Enumeration [Sep 2004] and WS – Eventing) (could use WS-DM) • WS – Eventing (Leveraged from the WS – Eventing capability implemented in OMII) • WS – Addressing [Aug 2004] and SOAP v 1.2 used (needed for WS-Management) • Used XmlBeans 2.0.0 for manipulating XML in custom container. • Currently implemented using JDK 1.4.2 but will switch to JDK1.5 • Released on http://www.naradabrokering.org in February 2007
Performance EvaluationResults • Extreme case with many catastrophic failures • Response time increases with increasing number of concurrent requests • Response time is MANAGEE-DEPENDENT and the shown times are typical • MAY involve 1 or more Registry access which will increase overall response time • Increases rapidly as no. of Managees > (150 – 200) managees
Performance EvaluationHow much infrastructure is required to manage N managees ? • N = Number of managees to manage • M = Max. no. of entities connected to a single messaging node • D = Max. no of managees managed by a single manager process • R = min. no. of registry service instances required to provide fault-tolerance • Assume every leaf domain has 1 messaging node. Hence we have N/M leaf domains. • Further, No. of managers required per leaf domain is M/D • Total Components in lowest level = (R registry + 1 Bootstrap Service + 1 Messaging Node + M/D Managers) * (N/M such leaf domains) = (2 + R + M/D) * (N/M) • Thus percentage of additional infrastructure is = [(2 +R)/M + 1/D] * 100 %
Performance EvaluationResearch Question:How much infrastructure is required to manage N managees ? • Additional infrastructure = [(2 +R)/M + 1/D] * 100 % • A Few Cases • Typical values of D and M are 200 and 800 and assuming R = 4, then Additional Infrastructure = [(2+4)/800 + 1/200] * 100 % ≈ 1.2 % • Shared Registry => there is one registry interface per domain, R = 1, then Additional Infrastructure = [(2+1)/800 + 1/200] * 100 % ≈ 0.87 % • If NO messaging node is used (assume D = 200), then Additional Infrastructure = [(R registry + 1 bootstrap node + N/D managers)/N] * 100 % = [(1+R)/N + 1/D] * 100 % ≈ 100/D % (for N >> R) ≈ 0.5%
Performance EvaluationResearch Question:How much infrastructure is required to manage N managees ?
Performance EvaluationXML Processing Overhead • XML Processing overhead is measured as the total marshalling and un-marshalling time required. • In case of Broker Management interactions, typical processing time (includes validation against schema) ≈ 5 ms • Broker Management operations invoked only during initialization and failure from recovery • Reading Broker State using a GET operation involves 5ms overhead and is invoked periodically (E.g. every 1 minute, depending on policy) • Further, for most operation dealing with changing broker state, actual operation processing time >> 5ms and hence the XML overhead of 5 ms is acceptable.
Prototype:Managing Grid Messaging Middleware • We illustrate the architecture by managing the distributed messaging middleware: NaradaBrokering • This example motivated by the presence of large number of dynamic peers (brokers) that need configuration and deployment in specific topologies • Runtime metrics provide dynamic hints on improving routing which leads to redeployment of messaging system (possibly) using a different configuration and topology • Can use (dynamically) optimized protocols (UDP v TCP v Parallel TCP) and go through firewalls • Broker Service Adapter • Note NB illustrates an electronic entity that didn’t start off with an administrative Service interface • So add wrapper over the basic NB BrokerNode object that provides WS – Management front-end • Allows CREATION, CONFIGURATION and MODIFICATION of broker and broker topologies
Typical use of Grid Messaging in NASA Sensor Grid implementing using NB GIS Grid NB June 19, 2006 20 Community Grids Lab, Bloomington IN :CLADE 2006: Datamining Grid
NaradaBrokering Management Needs NaradaBrokering Distributed Messaging System consists of peers (brokers) that collectively form a scalable messaging substrate. Optimizations and configurations include: Where should brokers be placed and how should they be connected, E.g. RING, BUS, TREE, HYPERCUBE etc…, each TOPOLOGY has varying degree of resource utilization, routing, cost and fault-tolerance characteristics. Static topologies or topologies created using static rules may be inefficient in some cases E.g., In CAN, Chord a new incoming peer randomly joins nodes in the network. Network distances are not taken into account and hence some lookup queries may span entire diameter of network Runtime metrics provide dynamic hints on improving routing which leads to redeployment of messaging system (possibly) using a different configuration and topology Can use (dynamically) optimized protocols (UDP v TCP v Parallel TCP) and go through firewalls but no good way to make choices dynamically These actions collectively termed as Managing the Messaging Middleware 21
Prototype:Costs (Individual Managees are NaradaBrokering Brokers)
Recovery:Typical Time • Assuming 5ms Read time from registry per managee object
Prototype:ObservedRecovery Cost per managee • Time for Create Broker depends on the number & type of transports opened by the broker • E.g. SSL transport requires negotiation of keys and would require more time than simply establishing a TCP connection • If brokers connect to other brokers, the destination broker MUST be ready to accept connections, else topology recovery takes more time.
Conclusion • We have presented a scalable, fault-tolerant management framework that • Adds acceptable cost in terms of extra resources required (about 1%) • Provides a general framework for management of distributed entities • Is compatible with existing Web Service specifications • We have applied our framework to manage Managees that are loosely coupled and have modest external state (important to improve scalability of management process) • Outside effort is developing a Grid Builder which combines BPEL and this management system to manage initial specification, composition, and execution of Grids of Grids (of Services)