"Grid middleware is easy to install, configure, secure, debug and manage - across multiple sites"

"Grid middleware is easy to install, configure, secure, debug and manage - across multiple sites" "One can't believe impossible things" UK OGSA Evaluation Project (UCL, Imperial, Newcastle, Edinburgh) (Full list of project members) Paul Brebner University College London P.Brebner@cs.ucl.ac.uk

Grid Complexity – The Grid will be BIG

Grid Complexity - growing

Grid Complexity – built on the internet

Grid Complexity – but more complex

Grid Simplicity – Start with something simple • OGSA • OGSI • GT3.2 – exemplar of a Grid SOA • Initially evaluate installation, configuration, and security • Then performance and scalability, deployment, architectural choices, etc.

Grid Realism – But realistic test-bed • Heterogeneous platforms • Linux, Solaris, Windows • Cross-organisational • Four nodes • Independently administered • Firewalls and access restrictions • Security • UK e-Science CA

Grid Confusion – What is Globus? • How is Globus intended to be used? • 1: Science as first-order services: Middleware for building and hosting Grid Applications, by exposing science code as Grid services. • 2: Middleware as services: As a set of high level Grid services, composed to provide new Grid functionality. Science isn’t first-order service, but managed by Grid services.

Grid Confusion – Science services or Grid services Client 1 E=mc2

Grid Confusion – Science services or Grid services Client 1 D=A+2B+C2 E=mc2

Grid Confusion – Science services or Grid services Client 2 1 D=A+2B+C2 D=A+2B+C2 E=mc2 E = mc2

Grid Confusion – How to evaluate • Do we evaluate GT3 as middleware for hosting Grid services, or as a toolkit for constructing Grid middleware? • If the first, only need GT3 Core – just the container. If the second, need “All Services” (and more – there’s no scheduler).

Grid Simplicity – Incremental • Start with Core Package • Add Security • Then try “All Services” • Simple enough – in theory

Grid Steps – single node GT3 Install Install OS/HW

Grid Steps – single node Configure GT3 Install Install OS/HW

Grid Steps – single node Deploy Configure GT3 Install Install OS/HW

Grid Steps – single node Run Deploy Configure GT3 Install Install OS/HW

Grid Steps – Multiple sites GT3

Grid Steps – Multiple sites GT3 GT3 GT3 GT3

Grid Steps – Multiple sites Interoperate GT3 GT3 GT3 GT3

Grid Steps – Multiple sites Secure Interoperate GT3 GT3 GT3 GT3 GT3 GT3

Grid Steps – Multiple sites Manage Secure Interoperate GT3 GT3 GT3 GT3 GT3 GT3

Grid Reality – What we found • Port number management • Host access • Remote visibility of installation, container, services • Installation by System Administrators • Tomcat or Test container • Compilation issues on Solaris • Exponential increase in testing complexity as number of nodes increases.

Grid Reality – What we found • Port number management • Post number conflicts (with other services) • What port is the container running on?

Grid Reality – What we found • Host access • Is the container visible on that port externally? • From which machines? • For which users? • Non-trivial to test/debug if/when something goes wrong

Grid Reality – What we found • Remote visibility of installation, container, services • What infrastructure is installed? • What packages and versions? • How is it configured? • What state is it in?

Grid Reality – What we found • Installation by System Administrators • Division of roles • Didn’t meet expectations • Extra effort to support multiple roles • System Administrators – install, configure and secure • Globus Administrators – test, maintain • Globus Developers – develop, deploy, test/use Grid services

Grid Reality – What we found • Tomcat or Test container • Differences in deployment, configuration, and management • With Tomcat, increased potential for centralised management, and sand-boxing of run-time environment

Grid Reality – What we found • Compilation issues on Solaris • Took longer than expected • Only Linux testing and support can be taken for granted

Grid Reality – What we found • Exponential increase in testing complexity as number of nodes increases • Testing (and maintaining) interoperability between m client machines, and n servers gets complicated. • How well will this scale for 100s, 1000s of nodes?

Grid Reality – Security • In theory just had to • obtain (and update) host, client, and CA certificates • convert • install • configure • generate (and update) proxies. • However, parts of “All Services” package also needed.

Grid Security - What we found • Interactions between security for multiple installations • Essential to test non-secure interoperability first • Windows client-side security • Testing and viewing security configuration • Debugging secure calls • Client side security is programmatic • Security management scalability • Construction and maintenance of user accounts and grid-map file entries.

Grid Security - What we found • Interactions between security for multiple installations • For testing may want • multiple versions, or duplicates (with different configurations) of same versions. • One container with no security, and another container with security • May want test/production environments

Grid Security - What we found • Essential to test non-secure interoperability first • Trying to test interoperability and security simultaneously wasn’t fun

Grid Security - What we found • Windows client-side security • Still havn’t got it working • Not obvious exactly what parts of Globus are needed for client side code with security (no “client plus security” package).

Grid Security - What we found • Testing and viewing security configuration • Need to be able to view/edit and check security configuration for containers and services • Confusion about hierarchical security settings • Virtual Organisations, clusters, servers, containers, factories, services, methods, and instances. • Remotely • Validate security deployment before run-time

Grid Security - What we found • Debugging secure calls (or any stateful service) • Proxy interceptor approach (e.g. TCPMON) won’t work with stateful services • As grid handle returned to client contains the port number of the instance, not the proxy • But proxies are an important design pattern for SOAs… • GT4/WS-RF may be different • Handle resolvers, WS-Addressing and WS-RenewableReferences

Grid Security - What we found • Client side security is programmatic • Client side code modifications required to call services/methods with required protocols • Should be declarative • Sensitive to server side security credentials

Grid Security - What we found • Security management scalability • Construction and maintenance of user accounts and grid-map file entries. • For each server, each user needs an account, and an entry in the container gridmap file (mapping client certificate to account) • May also need service specific gridmap files • Not scalable for large numbers of users, servers, services. • Alternatives? • Tool support • Role based authentication • Shared accounts or certificates

Grid Recommendations • If Globus is middleware, then need: • Platform independent, automatic, installation. • Tool support for configuration and deployment creation, validation, viewing and editing. • Management console for grid, nodes, globus packages, containers and services. • Support for remote, location independent, cross-organisational, multiple role scenarios.

Grid Recommendations (continued) • If Globus is middleware, then need: • Remote deployment and management of services. • Remote distributed debugging of grid installations, services, and applications. • Tool support, and more scalable processes for security.

Grid Alternatives • Next we plan to evaluate the two architectural choices in more detail • Science exposed as services, vs science code managed by higher level grid services. • Explore alternative mechanisms for: • Load balancing and resource management • Directory services (service and resource discovery) • Data movement approaches (e.g. SOAP Attachments vs GridFTP)

Grid Performance • First approach (initial results) • Scientific benchmark (SciMark2.0) modified to measure throughput, and invoked as a Stateful Grid Service • Metric is Calls Per Minute (CPM) – one unit of work. • No data movement, just computation and memory load. • JVM: 512MB Heap and –server (of course ) • Good performance and scalability • Security has minimal overhead • Problem with client side timeouts as response times increase

Grid Performance Tomcat Fastest: 3.6s (Edinburgh) Slowest: 25s (UCL)

Grid Performance 95% of predicted maximum throughput

Grid Performance • Tomcat vs Test container • No difference on 3 out of 4 nodes • But 67% faster on one node (Newcastle, slowest Intel box) • Attachments will work with GT3 and Tomcat • But not with security • Limit of 1GB (DIME) • Bug in Axis – doesn’t clean up temporary files.

Grid Performance • Stateful instances can be problematic • Intermittent unreliability • On some runs, 1 exception in 300 calls (reliability of .9967) • But non-repeatable, SOAP/network related? • What is the safe response to exceptions? Can’t just retry. • Possible to kill container (relies on clients being well behaved): • By invoking same instance/method more than once. • By consuming container resources • But instances can be passivated/activated in theory • Could be used to enable fine-grain (per instance) control over resource usage.

Grid Deployment • How to install and configure Grid infrastructure and services - scalably and securely? • Install GT3 infrastructure and security manually • MMJFS allows executable code to be staged automatically (But not services - could provide a deployment service). • Install bootstrapping code, and then install and deploy all other code and security automatically. • Using SmartFrog (HP) in the lab, and then test-bed. • Configuring GT3 security remotely is an open-issue, as is “trust” with System Administrators.

Grid Dreams - Debugging • Debugging distributed systems is tricky • Need better support for cross-cutting non-functional concerns such as deployment and debugging. • (One) problem with debugging services is not knowing the context of errors (to aid diagnosis or cure) – a service is just an interface. • Deployment aware debugging: • Starting from functional work-flows, generate deployment-flows, which are executed prior to, or concurrent with, functional work-flows. • If failure in functional work-flow, then corresponding deployment-flow is examined to determine likely causes, and parts are re-executed.

Grid Dreams - Debugging • Backtrack through deployment steps (Like peeling an onion) • Some steps will need to be reversed • Track dependencies, and redundant operations. • This approach may fix an (interesting) sub-class of problems: • Those which can be fixed by simply redoing (or replicating) (part of) the installation, E.g. • Intermittent failure of container or services • Resource starvation or overload • Security problems that can be fixed with reconfiguration or refresh of certificates/proxies. • But not: • network, or all configuration and security/access problems.

"Grid middleware is easy to install, configure, secure, debug and manage - across multiple sites"