150 likes | 283 Views
Asia Pacific Grid: Towards a production Grid. Yoshio Tanaka Grid Technology Research Center, Advanced Industrial Science and Technology, Japan. Contents. Updates from PRAGMA 5 Demo at SC2003 (climate simulation using Ninf-G) Joint demo with NCHC Joint demo with TeraGrid
E N D
Asia Pacific Grid:Towards a production Grid Yoshio Tanaka Grid Technology Research Center, Advanced Industrial Science and Technology, Japan
Contents • Updates from PRAGMA 5 • Demo at SC2003 (climate simulation using Ninf-G) • Joint demo with NCHC • Joint demo with TeraGrid • Experiences and Lessons Learned • Towards a production Grid
Why the climate simulation? • Climate simulation is used as a test application to evaluate progress of resource sharing between institutions • We can confirm achievements of • Globus-level resource sharing • Globus is correctly installed • Mutual authentication based on GSI • High-level Middleware (GridRPC) –level resource sharing • JobManager works well • Network configuration of the cluster(note that most clusters use private IP addresses)
Behavior of the System Severs NCSA Cluster (225 CPU) Ninf-G Client (AIST) Severs AIST Cluster (50 CPU) Titech Cluster (200 CPU) KISTI Cluster (25 CPU)
Terrible 3 weeks (PRAGMA5~SC2003) • Increased resources • 14 clusters -> 22 clusters • 317 cpus -> 853 cpus • Installed Ninf-G and climate simulation on TeraGrid • Account was given in Nov. 4th • Port Ninf-G2 to IA64 architecture
Necessary steps for the demo • Apply my account to each site • Add an entry to grid-mapfile • Test globusrun • authentication • Is my CA trusted? Do I trust your CA? • Is my entry in grid-mapfile? • DNS lookup • reverse lookup is used for server authentication • firewall / TCP Wrapper • Can I connect to the Globus gatekeeper? • Can the globus jobmanager connect to my machine? • jobmanager • Is the queuing system (eg. pbs, sge) installed appropriately? • Does jobmanager script work as expected? • In case of TeraGrid • Obtained my user certificate from TeraGrid CA (NCSA CA) • Asked TITECH and KISTI to trust NCSA CA • It was not feasible to ask TeraGrid to trust AIST GTRC CA
Necessary steps for the demo (cont’d) • Install Ninf-G2 • Frequently occurred problem due to inappropriate installation of GT2 SDK • GT2 manual: • GRAM and DATA: gcc32dbg • Info: gcc32dbgpthr • Asked additional installation of Info SDK with gcc32dbg • Test Ninf-G application • Can Ninf-G server program connect to the client? • If private IP address is used for the backend node, NAT must be available • These are application/middleware specific requirements. Requirements depend on applications and middleware. • New Ninf-G application (TDDFT) needs Intel Fortran Compiler • Other application needs GAMESS / Gaussian
Lessons Learned • Need to pay much efforts for initiation • MDS is not scalable and still unstable • Need to modify some parameters in grid-info-slapd.conf • Testbed was unstable • Unstable / poor network • System maintenance (incl. version up of software) without notification • realized when the application would fail. • it worked well yesterday, but I’m not sure whether it works today
Lessons Learned (cont’d) • Difficulties caused by the grass-roots approach. • It is not easy to keep the GT2 version coherent between sites. • Different requirements for the Globus Toolkit between users • Most resources are not dedicated to the Testbed. • resources may be busy / highly utilized • Need grid level scheduler, fancy Grid reservation system? • (from point of view of resource providers) we need flexible control of donated resources • e.g. 32 nodes for default user, 64 nodes for specific groups, 256 nodes for my organization
Summary of current status (cont’d) • What has been done? • Resource sharing between more than 20 sites (853cpus were used by Ninf-G application) • Use GT2 as a common software • What hasn’t? • Formalize “how to use the Grid Testbed” • I could use, but it is difficult for others • I was given an account at each site by personal communication • Provide documentation • Keep the testbed stable • Develop management tools • Browse information • CA/Cert. management
Towards a production Grid • Define minimum requirements of Grid middleware • Resource WG has the responsibility • NMI, TeraGrid software stack • Each site must follow the requirement • Keep the testbed as stable as possible • Understand that the security is definitely essential for international collaboration • How is the security (CA) policy in Asia Pacific?
Towards a production Grid (cont’d) • Draft “Asia Pacific Grid Middleware Deployment Guide”, which is a recommendation document for deployment of Grid middleware • Minimum requirements • Configuration • Draft “Instruction of Grid Operation in the Asia Pacific Region”, which guides how to run Grid Operation Center to support management of stable Grid testbed. • Launch Asia Pacific Grid Policy Management Authority ( http://www.apgridpma.org/ ) • Coordinate security level in Asia • Interact with outside of Asia (DOEGrids PMA, EUGrid PMA) • Sophisticated users’ Guide is necessary
Towards a production Grid (cont’d) • Each site should provide a document and/or web for users • Requirements for users • How to obtain an account • Available resources • hardware • software and its configuration • resource utilization policy • support and contact information
Future Plan (cont’d) • Should think about GT3/GT4-based Grid Testbed • Each CA must provide CP/CPS • International Collaboration • TeraGrid, UK eScience, EUDG, etc. • Run more applications to evaluate feasibility of Grid • large-scale cluster + fat link • many small cluster + thin link
Summary • It is tough work to make resources available for applications • many steps • It is tough to keep the testbed stable • Many issues to be solved toward a production Grid • Technical • local and global scheduler • dedication / reservation / co-allocation • Political • CA policy • How can I get an account on your site? • Both • Coordination of middlewares • More interaction between resource and applications WG is necessary • Need to establish necessary procedures for resource sharing