Establishing an inter-organisational OGSA Grid: Lessons Learned

Establishing an inter-organisational OGSA Grid: Lessons Learned Wolfgang Emmerich London Software Systems, Dept. of Computer Science University College London Gower St, London WC1E 6BT, U.K http://www.sse.ucl.ac.uk/UK-OGSA

An Experimental UK OGSA Testbed • Established 12/03-12/04 • Four nodes: • UCL (coordinator) • NeSC • NEReSC • LeSC • Deployed Globus Toolkit 3.2 throughout onto • Heterogeneous HW/OS • Linux • Solaris • Windows XP

Experience with GT3.2 Installation • Different levels of experience within team • Heterogeneity • HW (Intel/SPARC) • Operating system (Windows/Solaris/Linux) • Servlet container (Tomcat/GT3 container) • Interaction with previous GT versions • Departure from web service standards prevented standard tool use • JMeter • Development environments (Eclipse) • Exception management tools (Amberpoint) • Interaction with system administration • Platform dependencies

Performance and Scalability • Developed GTMark • Server-side load model: SciMark 2.0 (http://math.nist.gov/SciMark) • Client-side load model, configuration and metrics collection based on J2EE benchmark StockOnline • Configurable Benchmark • Static vs dynamic discovery of nodes • Loads for fixed period of time or until steady state obtained • Constant or variation of concurrent requests

Performance Results

Scalability Results

Performance Results • Performance and scalability of GT3.2 with Tomcat/Axis surprisingly good • Performance overhead of security is negligible • Good scalability - reached 96% of theoretical maximum • Tomcat performs better than GT3.2 container on slow machines • Surprising results on raw CPU performance

Reliability • Tomcat more reliable than GT3.2 container. • Tomcat container sustained 100% reliability under load • GT3.2 container failed once every 300 invocations (99.67% reliability) • Denial of Service Attack possible by • Concurrently invoking operation on the same service instance (they are not thread safe!) • Fully exhausting resources • Problem of hosting more than one service in one container • Trade-off between reliability and reuse of containers across multiple users/services.

Security • Interesting effect of firewalls on testing and debugging • Accountability and audit trails demand users be given individual accounts on each node • Overhead of node and user certificates (they always expire at the wrong time) • Current security model does not scale: • Assuming cost of £18/Admin hour • 10 users per node (site) • It will cost approx. £300,000 to set up a 100 node grid with 1000 users • It will be prohibitively expensive to scale up to 1,000 nodes(with admin costs in excess of £6M)

Deployment • How do admins get grid middleware deployed systematically onto grid nodes? • How can users get the services onto remote hosts? • We tried out SmartFrog (http://www.smartfrog.org) • Worked very well inside a node. • Impossible across organisations: • SmartFrog daemon would need to execute actions with root privileges which some site admins just did not agree to • Security paramount (SmartFrog would be the perfect virus distribution engine) • SmartFrog’s security infastructure incompatible with GT 3.2 infrastructure

Looking Ahead • Installation efforts need to be reduced significantly • Binary distributions • For a few selected HW/OS platforms • Standards compliance • Track standards by all means • Otherwise no economies of scale • Management console • Add / remove grid hosts • Need to be able to monitor status of grid resources • Across organisational boundaries • More lightweight security model needed • Role-based Access Control • Trust-delegation • Deployment is a first-class citizen • Avoid adding as an afterthought • Needs to be built into middleware stack

Conclusions • Very interesting experience • Building a distributed system across organisational boundaries is different from building a system over a LAN • Insights that might prove useful for • OMII • Globus • ETF • There is a lot more work to do before we realize the vision of the Grid!

Acknowledgements • A large number of people have helped with this project, including • Dave Berry (NeSC) • Paul Brebner (UCL, now CSIRO) • Tom Jones (UCL, now Symantec) • Oliver Malham (NeSC) • David McBride (LeSC) • Savas Parastatidis (NEReSC) • Steven Newhouse (OMII) • Jake Wu (NEReSC) • For further details (including IGR) check out http://sse.cs.ucl.ac.uk/UK-OGSA

Establishing an inter-organisational OGSA Grid: Lessons Learned