390 likes | 407 Views
Sun Grid Engine is an award-winning workload management system that offers dynamic resource management, job scheduling, resource monitoring, policy administration, user authentication and access control, and accounting and reporting. It is used in various industries for computing tasks such as data centers, medical imaging, risk analysis, manufacturing, entertainment/media, energy, government/education, and more. With Sun Grid Engine, users can easily match resources, select schedules, and manage jobs efficiently. The system also provides maximum flexibility, scalability, and integration options for users.
E N D
SUN GRID ENGINE • Wolfgang Gentzsch • Formally (Senior Director of Grid Computing) • Sun Microsystems, Inc.
Popular work load management systems Google search term Google hits (including quotes) ---------------------------------------------------------------------------- "grid engine": 138,000 "load sharing facility": 11,200 "portable batch system": 15,700 SGE +Sun: 204,000 LSF +"Platform Computing": 28,100 PBS +Altair: 48,700
Award-winning Sun Grid EngineThousands of successful Grids Excellence in Cluster Technology
Dynamic Resource Management Distributed ResourceManager Jobs Dispatch Results
Sun Grid Engine Overview • Dynamic Resource Management • Job Scheduling • Resource monitoring • Policy administration • User authentication and access control • Accounting and reporting
Target Industries & Typical Workloads Industries Computing Tasks Data Centers Resource assignment across persistent services Health Sciences Medical imaging, bio-informatics Financial Services Risk and portfolio analysis, Monte Carlo simulations Manufacturing EDA, MCAD, fluid dynamics, crash test simulations, aerodynamics modeling Entertainment/Media Digital content creation, animation, digital asset management Energy Reservoir simulations, seismic processing Software Build, test, verify Government/Edu Weather analysis, nuclear yield simulation
Customers Sample Customers Industries Health Sciences One of the largest pharmaceutical companies in USA Financial Services Largest Wall Street financial institution, Deutsche Bank Manufacturing Ford, Transmeta, Mentor Graphics, Monsanto Entertainment/Media GlobeXplorer, Inc. Energy Landmark Graphics Government/Edu DoE INEEL, University of Leicester, University of Aachen
Resource Matching Selection Scheduling JOB User • User policies • Groups • Roles • Departments • Projects • Job policies • Resources • System characteristics • System status • Resources
Sun Grid Engine Components Execution Daemon ARCo Execution Daemon qsub qrsh qlogin qmon qtcsh QMaster qmaster Execution Daemon App Scheduler DRMAA Execution Daemon Shadow Master
Data Management No explicit data management Script files are transferred Binaries are not Shared File System NFS “by default” File Staging Copy data in before job Copy results out after Not inherent feature Configured via scripting hooks
Security Access Control Lists Explicitly allow or disallow Users and groups Restricted operations Managers and operators Submit and admin hosts Certificate-based encryption Hides and protects data Guarantees identity Replace rsh with ssh
Maximum Flexibility Almost every behavior can be configured Resources Load sensors Hierarchical Hosts, host groups, queues, etc. Users, user groups, departments, projects, etc. Script-based integration points Suspend/resume Job execution Checkpointing, Parallel Environments
Scalability Sun Grid Engine 6.1 target: 10k+ hosts (hosts ≤ CPU's) 500k+ jobs (no task limit) Sun Grid Engine 6.2 target: 90k+ cores Sun Grid Engine 6.1: Job round-trip 0.4s Mostly fork and exec Submit rate >120 Jobs/sec Using DRMAA
Policies and Priorities User 1 Project C Team B Enterprise-wide Resource Demand User 2 Department 1 Contractor X Project A Department 2 Departmental Resource Access Department 3 Department 5 Department 4
Sophisticated Scheduler Align resource usage with business policies Historical usage tracking Time-based priorities Resource-based priorities Fine-grained quotas Maximize utilization Hardware and software Dynamic, continuously evaluated Changes take effect immediately No restart
Grids in the Enterprise Accounting Production Development Running Jobs Siloed Grids Waiting Jobs Idle resources in some grids, Jobs waiting in others.
The Enterprise Grid Accounting Production Development Borrowed Resources Running Jobs Resources shared among departments. Policies can ensure “fair” usage.
Accounting and Reporting ARCo: Accounting and Reporting Console Fine-grained resource accounting Stored in RDBMS in well-defined schema Standard SQL access for 3rd party tools Customizable and extensible Web-based console tool Generate reports, queries, etc. Customizable queries and report formats Spreadsheet report generation for offline analysis
Accounting and Reporting Console • Result List • Save new results • View results generated offline • Query List • Run by ordinary users • Create, Edit by privilegedusers
Customizable Results View • Tables • Simple • Pivot • Definable fields • Customizableheadings • Graphs • Line Chart • Bar Chart • Pie Chart • 3-D or flat
Distributed ResourceManagement Application API Standard from the Open Grid Forum Submit, monitor, control jobs Language & platform agnostic ISVs “Grid-enable” their applications Avoid DRM/Grid system lock-in In-house developers Integrate Grid tasks into workflow, orchestration, online apps, etc.
User Interfaces Browser (accounting) Command-line Graphical <c/> <java/> Programmatic (DRMAA) Programmatic (DRMAA) Sun Grid Engine
Sun Grid Engine Multi-Clustering I need resources I have 2 free Sun Grid Engine grid #1 Sun Grid Engine grid #2 Spare Pool Service Domain Manager
Sun Grid Engine Multi-Clustering I can spare some I still need resources Sun Grid Engine grid #1 Sun Grid Engine grid #2 Service Domain Manager
Sun Grid Engine Multi-Clustering Grids are monitored by Service Level Objectives Policies control relative grid priorities Sun Grid Engine grid #1 Sun Grid Engine grid #2 Service Domain Manager
Multi-Clustered Accounting Multiple grids can use the same ARCo database All accounting data available from the same web interface Sun Grid Engine grid #1 Sun Grid Engine grid #2 ARCo
SGE 6.2 Cloud Connectivity • install a 0-node SGE system on your laptop or desktop and allocate nodes on the Cloud (EC2) on demand. • I.e. SGE Cloud Connectivity feature is fully elastic: • Grid resources allocated through it can go from 0 to whatever is needed and covered by the user's budget. • And they can go back to 0, of course. • All policy controlled. • No user intervention required. • Secure Communication: OpenVPN (part of EC2 AMI and of SGE instance running on user laptop or desktop)
Open Source Project Foundation for Sun Grid Engine Development happens in open source Very widely adopted – strong community Active mailing lists Monitored by the development engineers Licensed under SISSL http://gridengine.sunsource.net/ http://gridengine.info/ By the community, for the community
Product Versus Open Source Support and/or indemnification important? Licensed product Exploring your options? Sun Download Center Want to customize? Open source Want to run on unsupported platforms? Open source Want unsupported features? Open source
Tokyo Institute of Technology Largest Supercomputer in Asia top500.org Debuted at #7 on Top 500 List June 2006 #14 on June 2007 list Research grid Numeric simulations 25+ applications 6 different flavors of MPI, plus OpenMP, DDI, etc. TSUBAME Grid Cluster
TSUBAME Components 655 Nodes Sun Fire x4600 16 cores per node → 10480 cores ClearSpeed CSX600 accelerators 85TFlops theoretical / 48TFlops peak 21.4TB aggregate memory Infiniband network 8 Voltaire ISR 9288 Lustre file system 42 Sun Fire x4500 48 500GB SATA disks per node → ~1PB
Texas Advanced Computing Center First National Science Foundation Track2 system $30M acquisition budget $29M for support over 4 years Awarded September 2006 Production December 2007 TeraGrid member Over 3200 users Over 1000 projects From 48 states Physics, molecular biology, chemistry, astronomy, etc. Larger than current top 20 TeraGrid systems combined Ranger System
Ranger Components Nodes 82 Sun Blade 6048 – 3936 blades 16 cores per blade → 62976 cores 504TFlops peak 125TB aggregate memory Infiniband network 2 Sun Datacenter Switch 3456 Lustre/QFS/SAM file system 72 Sun Fire x4500 Largest file system is ~1PB
More Information Main product page: http://www.sun.com/gridware/ Open source project site: http://gridengine.sunsource.net/ Community site: http://gridengine.info/ Open source Service Domain Manager site: http://hedeby.sunsource.net/
SUN GRID ENGINE • First Last • first.last@sun.com