410 likes | 575 Views
UT Grid Project. Jay Boisseau, Texas Advanced Computing Center SURA Grid Application Planning & Implementations Workshop December 7, 2005. Outline. Overview Vision Strategy Approach Current Project Status, Near-Term Goals UT Grid Production Compute Resources Roundup Rodeo
E N D
UT Grid Project Jay Boisseau, Texas Advanced Computing Center SURA Grid ApplicationPlanning & Implementations Workshop December 7, 2005
Outline • Overview • Vision • Strategy • Approach • Current Project Status, Near-Term Goals • UT Grid Production Compute Resources • Roundup • Rodeo • Interfaces to production resources: • Grid User Portal • Grid User Node • Tools to support resources: • GridPort • GridShell • Metascheduling Prediction Services • Future Work and Plans
UT Grid Vision: A Powerful, Flexible, and SimpleVirtual Environment for Research & Education The UT Grid project vision is to create a cyberinfrastructure for research and education in which people can develop and test ideas, collaborate, teach, and learn through applications that seamlessly harness the diverse campus compute, visualization, storage, data, and instruments as needed from their personal systems (PCs) and interfaces (web browsers, GUIs, etc.).
UT Grid: Develop and Provide a Unique, Comprehensive Cyberinfrastructure… The strategy of the UT Grid project is to integrate… • common security/authentication • scheduling and provisioning • aggregation and coordination diverse campus resources… • computational (PCs, servers, clusters) • storage (Local HDs, NASes, SANs, archives) • visualization (PCs, workstations, displays, projection rooms) • data collections (sci/eng, social sciences, communications, etc.) • instruments & sensors (CT scanners, telescopes, etc.) from ‘personal scale’ to terascale… • personal laptops and desktops • department servers and labs • institutional (and national) high-end facilities
…That Provides Maximum Opportunity & Capability for Impact in Research, Education …into a campus cyberinfrastructure… • evaluate existing grid computing technologies • develop new grid technologies • deploy and support appropriate technologies for production use • continue evaluation, R&D on new technologies • share expertise, experiences, software & techniques that provides simple access to all resources… • through web portals • from personal desktop/laptop PCs, via custom CLIs and GUIs to the entire community for maximum impact on • computational research in applications domains • educational programs • grid computing R&D
Texas Two-Step: Hub & Spoke Approach • Deploying P2P campus grid requires overcoming two trust issues • grid software: reliability, security, and performance • each other: not to abuse one’s own resources • Advanced computing center presents opportunity to build centrally manage grid as step to P2P grid • already has trust relationships with users • so, when facing both issues, install grid software centrally first • create centrally managed services • create spokes from central hub • then, when grid software is trusted • show usage and capability data to demonstrate opportunity • show policies and procedures to ensure fairness • negotiate spokes among willing participants
UT Grid: Logical View • Integrate a set of resources(clusters, storage systems, etc.)within TACC first TACC Compute, Vis, Storage, Data (actually spread across two campuses)
UT Grid: Logical View • Next add other UTresources usingsame tools andprocedures TACC Compute, Vis, Storage, Data ACES Cluster ACES Data ACES PCs
UT Grid: Logical View • Next add other UTresources usingsame tools andprocedures GEO Data GEO Cluster TACC Compute, Vis, Storage, Data GEO Cluster ACES Cluster ACES Data ACES PCs
UT Grid: Logical View BIO Data BIO Instrument • Next add other UTresources usingsame tools andprocedures PGE Cluster GEO Data PGE Data GEO Cluster TACC Compute, Vis, Storage, Data PGE Instrument GEO Cluster ACES Cluster ACES Data ACES PCs
UT Grid: Logical View BIO Data BIO Instrument • Finally negotiateconnectionsbetween spokesfor willing participantsto develop a P2P grid. PGE Cluster GEO Data PGE Data GEO Cluster TACC Compute, Vis, Storage, Data PGE Instrument GEO Cluster ACES Cluster ACES Data ACES PCs
Enhancing Grid Computing R&D and Deployment Expertise for UT and for IBM • Benefits for IBM • Increased knowledge of diverse grid user and application requirements in universities • Access to new software technologies developed for UT Grid • Early awareness of new distributed & grid computing R&D opportunities • Exposure & expertise in a variety of grid technologies, open source & commercial, which can be shared internally • Experience to be gained from maintaining a large distributed production grids • Collaboration with UT in conducting new distributed & grid computing R&D activities, including publications, proposals • Exposure among TACC’s collaborators and peers for expertise in grid deployment services, capabilities
Enhancing Grid Computing R&D and Deployment Expertise for UT and for IBM • Benefits for UT Austin • greater access to all resources by entire community • more effective utilization of existing and future resources • unique capabilities presented by access, aggregation, coordination for research, education • enhanced collaborative capabilities among researchers, and among teachers & students • Additional Benefits for TACC • increased expertise in grid deployment issues • early awareness of new distributed & grid computing R&D opportunities • platform for conducting new distributed & grid computing R&D activities
Enhancing Grid Computing R&D and Deployment Expertise for UT and for IBM • Benefits for TACC Partners • UT Grid-supported technologies being integrated into TeraGrid: GridPort/user portal, GridShell/user node, etc. • Expertise being developed in scheduling will be used in TeraGrid • UT Grid developments will be used • in TIGRE and SURA Grid • by TACC partners in UT System, HiPCAT, U.S., Latin America • by TACC industrial partners • Benefits for Community • UT Grid producing IBM DeveloperWorks articles • UT Grid R&D will produce professional papers in Year 2 (and proposals)
TACC Grid Technology & Deployment Activities Provide Synergy Through Tech Transfer • UT Grid • creating new tools for integrating compute, vis, storage and data across campus, from ‘personal scale’ to terascale • will exchange tools, experiences with TeraGrid & TIGRE to advance both and be interoperable with each • TeraGrid • will utilize & promote UT Grid user portal & user node technologies, and scheduling & workflow results • will provide grid visualization and data collection services to UT Grid, benefiting TACC and IBM • TIGRE • will utilize, promote UT Grid results and expertise to other state institutions, including industry • will provide additional experiences with UT Grid technologies from users from across state, helping to refine technologies
UT Grid Compute Resources • PCs and workstations • Roughly 1/2 are Windows on Intel/AMD and 1/3 are Macs • Most of rest are Linux on Intel/AMD • Networks of PCs and Workstations • Roundup: United Devices-managed network of PCs • Non-dedicated, heterogeneous compute resources across campus • Some managed by TACC, ITS, or other departments; some individually managed • Windows, Linux & Mac desktop PCs • Rodeo: Condor-managed network of PCs • Dedicated & non-dedicated, heterogeneous compute resources • Some managed by TACC, ITS, or other departments; some individually managed • Linux, Windows & Mac PCs , plus some workstations • Clusters • Lonestar: 1024-processor Linux at TACC • Wrangler: 656-processor Linux cluster at TACC • Longhorn: 128-processors in 4-way IBM p655 nodes at TACC • Other smaller clusters at TACC • Various department/lab cluster from 4 to 128+ processors will be included • Resources have different resource managers (LSF, PBS, SGE) • High-end Servers • Longhorn: IBM system 32 Power4 processors, 128 GB memory • Maverick: Sun system w/64 dual-core UltraSPARC 4 procs, 512 GB mem
Interfaces and Tools for these Resources • For a broad, diverse campus community, access must be easy and from local resources • Users access Grid User Portal with standard web browser • Grid User Portal submits to Rodeo via SOAP • UT-Grid Condor Web Services layer developed to facilitate • Condor portlet part of GridPort 4 release • Grid User Portal submits to Roundup via Hosted Applications • Users access Grid User Node with SSH • Grid User Node submits to Rodeo via GridShell • GridShell provides command line interface through shell façade • Abstracts user from underlying grid technology and complexity • Submits to specific resource or determines most appropriate resource using catalog services • Grid User Node submits to Roundup • Batch job submission supported via GridShell • CLI for submitting hosted application jobs
Accessing UT Grid Compute ResourcesHosted User Nodes & Portals
Roundup: Current Status • Roundup is a production UT Grid resource • Production system with over 1000 PCs distributed in campus • Automated account request and creation • Production level consulting • Comprehensive user guide • Training classes offered at TACC • Client downloads available for Windows, Mac, Linux from UT Grid web site • Hosted Applications Installed • HMMer, BLAST, POV-Ray, Coorset, etc.
Roundup: Next Steps • Near-term goals (few months): • Support additional production users • GSI Integration • United devices GridMP has capability for multiple authentication schemes • Need to add support extension for GSI • Evaluate MP Insight data warehousing and report generation package • Test and evaluate screen saver feature and start development of UT specific screen saver • Investigate possible solutions to enable sharing jobs across grids • Multi-grid agents or job forwarding
Rodeo: Current Status • Rodeo is a production UT Grid resource • Production system with over 500 PCs made up of dedicated clusters and PCs distributed on campus • Automated account request and creation • Production level consulting • Comprehensive user guide • Training classes offered at TACC • Client downloads available for Windows, Mac, Linux from UT Grid web site
Rodeo: Current Status • Currently the largest production users are: • UTCS (Department of Computer Sciences) • Graeme Henkleman (Chemistry) • Wolfgang Bangerth (Geosciences)
Rodeo: Next Steps • Near term goals (few months): • Continue supporting production users • Expand on number of CPUs available to users • Explore ‘hosted’ applications possibilities
UT Grid Interfaces • UT Grid will provide two types of interfaces: • Web-based Grid User Portal (GUP) accessible via any web browser • Customized desktop environments for Linux, Windows and Macintosh PCs to act as Grid User Nodes (GUN). • Users can access all UT Grid resources using either the GUP or GUNs managed by UT Grid. • They will also be able to download the necessary software to build and host their own customized grid user portals or convert their personal desktop systems into grid user nodes.
Motivation for a Grid User Portal • Lower the barrier of entry for novice user • Provide a centralized grid account management interface • Easy access to multiple resources through a single interface • Simple GUI interface to complex grid computing capabilities • Provide simple alternatives to CLI for advanced users • Present a “Virtual Organization” view of the Grid as a whole • Increase productivity of UT researchers – do more science!
Grid User Portal: Current Status • Added Roundup and Rodeo as production resources on TACC User Portal • Developed JSR-168 Compliant portlets that can: • View information on resources within UT Grid, including status, load, jobs, queues, etc. • View network bandwidth and latency between systems, aggregate capabilities for all systems. • Submit user jobs • Manage files across systems, and move/copy multiple files between resources with transfer time estimates • These portlets contribute to GridPort 4 release • TACC leading portal effort in TeraGrid • This will impact TACC User Portal and therefore UTGrid
Grid User Portal: Next Steps • New term plans (few months): • Complete new TACC User Portal (TUP) based on GridPort 4 including UT Grid resources • UT Grid capabilities fully integrated into TUP • Ability to customize environment to only expose UT Grid resources • Migrate portlets to WebSphere to ensure compatibility(?) • Grid Account Management Portlets
Grid User Node • The Linux GUN current capabilities: • Information queries about grid resources • Job submission • Parallel computing jobs (Dedicated Cluster Resources) • Serial computing jobs (Roundup, Rodeo) • Monitoring job status • Reviewing job results • Resource brokering based on ClassAd catalogs • GridFTP enabled GSIFTP
Grid User Node: Current Status • Production Linux, development Windows and Mac GUNs • Need to decide whether to do GUI versions • Submission to Roundup and Rodeo • “On-Demand” glide-in of UD resources into Condor pool • Integrated “real-life” applications • NAMD • SNOOP3D • HMMeR • POVray
Grid User Node: Next Steps • Near term goals: • Investigating distribution of GUN software stack using VDT • Prepare and present training class before the end of the year.
GridPort: Current Status • GridPort 4 developed and released this month • Available to UT Grid and national users as a grid portal toolkit to download and create user and application portals • Based on JSR-168 compliant portlets • Leveraged technology and knowledge in UT Grid to create Condor and Comprehensive file transfer portlets
GridPort: Next Steps • Near term goals • GridPort4 will be part of the TeraGrid User Portal, to be in production in 1Q06 • Preparing demonstration and lab at Grid Workshop in Venezuela in April 2006 • Continue evolution of GridPort to include: • Advanced job submission functionality • Advanced user customization, and more • Investigating demo portal based in WebSphere
GridShell: Current Status • GridShell developed and deployed on UT Grid (and TeraGrid) • Available to UT Grid users in the GUN software stack. • Able to submit jobs first to a Spoke (departmental cluster) and then to the Hub (TACC) if not enough resources are available at the Spoke. • Collaborating with researchers at PSC and Caltech, we have extended GridShell to provide a single job submission interface (Condor) to the heterogeneous clusters on the TeraGrid.
GridShell: Next Steps • Near term goals • Create a public download site for GridShell 1.0 (current version available only to NSF TeraGrid and UT Grid users). • Continue evolution of GridShell to include: • Support submitting jobs to clusters with firewalls • Need to hire an additional developer and developers partnerships with external developers (e.g. GridPort)
MPS: Current Status • Goal is to reduce turn around times of jobs by optimizing resource selection for data movements, queue wait times, and performance • Components • Prediction Services • Execution times, queue wait times, file transfer times • Resource Brokering • Immediately select resources based on job requirements • Including predictions • Metascheduling • Schedule complex jobs such as workflows • Workload management
MPS: Next Steps • Near term goals: • Create prediction web services • Based on existing R&D • Predictions based on • Historical information • Learning algorithms • Scheduling simulations • Integrate with Condor-G • Provide additional information about clusters • The clusters themselves (e.g. number of CPUs) • The jobs submitted to the clusters • Add call outs so matchmaker can request predictions • User requests minimizing predicted response time as part of ranking • Demonstration with Graham Carey (ICES / UT Austin) • Selecting which cluster at TACC to use • Matchmaking capability using MPS to rank systems based on user request
Future Plans and Work • Complete MPS work and integrate campus cluster with TACC clusters • First, just ‘upload’ larger jobs • Later, share jobs among spokes • Integrate maverick as remote visualization resource into UT Grid • Overlapping software stack with PCs • Remote vis software downloads (incl. file transfer) • Vis portal • Integrate campus data collections into UT Grid • Hosted collections in DBs • WebSphere Information Integrator? • Prepare NSF proposal?