230 likes | 358 Views
Alexandru V Staicu 1 , Jacek R. Radzikowski 1 Kris Gaj 1 , Nikitas Alexandridis 2 , Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University. Effective Use of Networked Reconfigurable Resources. http://ece.gmu.edu/lucite. Problem :. Reconfigurable resources
E N D
Alexandru V Staicu1, Jacek R. Radzikowski1 Kris Gaj1, Nikitas Alexandridis2, Tarek El-Ghazawi2 1 George Mason University 2 George Washington University Effective Use of Networked Reconfigurable Resources http://ece.gmu.edu/lucite
Problem: • Reconfigurable resources • expensive and underutilized • Many of these resources available • over the network • It is desirable to leverage • networked reconfigurable resources • to help other users within the same • organization
Approach: • Select the most suitable existing Job • Management System (JMS) - identify and define functional requirements - rank known systems according to these requirements - identify which JMS is the easiest to extend • Extend this JMS to recognize and utilize • reconfigurable resources - add new dynamic resources - configure scheduling to be based on these new resources
An existing Job Management System Execution Host 1 Master Host SubmissionHost Task 1 Tasks 1, 2, 3 Task 2 Execution Host 2 Task 3 Execution Host 3
Networked Reconfigurable Resource Management System Execution Host 1 SubmissionHost Master Host Task 1 Tasks 1, 2, 3 Execution Host 2 Task 2 Task 3 Execution Host 3 FPGA boards
Heterogeneous network with FPGA-based accelerators SLAAC Research Reference Platform Sparc 10 Dell HP Dell WILDFORCE SLAAC WILDSTAR Dell WILDSTAR Ethernet Intelligent Hub 100 Mbps Ethernet Intelligent Hub 100 Mbps Myrinet SAN/LAN Switch Dell SLAAC Dell WILDSTAR Sparc 20 Gateway Dell Dell WILDFORCE SLAAC WILDFORCE
Functional units of a typical Job Management System scheduling policies Resource Manager resource requirements available resources Resource Monitor Job Scheduler User Server Job Dispatcher jobs & their requirements resource allocation and job execution
Classification of Investigated Systems (1) Distributed JMS w/o a Central Scheduler Distributed Operating System Centralized JMS • MOSIX • Globus • Legion • NetSolve • LSF • CODINE • PBS • Condor • RES
Classification of Investigated Systems (2) Resource Monitor and Forecaster Parameter Study Scheduler Distributed Computing Interface • AppLES • Compaq DCE • NWS
Operating system, flexibility, user interface RES CONDOR PBS Codine LSF pub com pub/com pub gov Distribution Source code OS Support Solaris Linux Tru64 NT User Interface GUI & CLI GUI & CLI CLI GUI & CLI GUI & CLI
Schedulingand Resource Management RES CONDOR PBS Codine LSF Batch jobs Interactive jobs Parallel jobs Accounting
Efficiency and Utilization RES CONDOR PBS Codine LSF Stage-in and stage-out Timesharing Process migration Dynamic load balancing Scalability
Fault Tolerance and Security RES CONDOR PBS Codine LSF Checkpointing Daemon fault recovery Authentication Authorization
Documentation and Technical Support RES CONDOR PBS Codine LSF Documentation Technical support
JMS features supporting extension to reconfigurable hardware • capability to define new dynamic resources • strong support for stage-in and stage-out • configuration bitstreams • executable code • input/output data • strong support for checkpointing, job migration, and • dynamic load balancing
Ranking of Centralized Job Management Systems (1) Capability to define new dynamic resources: Excellent:LSF, PBS, CODINE More difficult: CONDOR, RES Stage-in and stage-out: Excellent:LSF, PBS Limited: CONDOR No: CODINE, RES
Ranking of Centralized Job Management Systems (2) Checkpointing: Excellent:LSF, CONDOR External mechanisms: CODINE No: PBS, RES Job Migration: Yes:LSF, CODINE, CONDOR, PBS No: RES Dynamic Load Balancing: Yes:LSF, CODINE No: CONDOR, PBS, RES
Ranking of Centralized Job Management Systems (3) Overall suitability to extend to reconfigurable hardware: • LSF • CODINE • PBS • CONDOR • RES without changing the JMS source code requires changes to the JMS source code
Operation of LSF other hosts other hosts Execution host Submission host Master host 3 LIM LIM MLIM Load information 4 2 5 SBD MBD Batch API 11 8 9 Child SBD 1 7 6 queue 12 10 bsub app RES 13 LIM – Load Information Manager MLIM – Master LIM MBD – Master Batch Daemon SBD – Slave Batch Daemon RES – Remote Execution Server User job
Extension of LSF to reconfigurable hardware other hosts other hosts Execution host Submission host Master host 3 LIM LIM ELIM MLIM Load information 4 2 5 SBD MBD Batch API 11 8 9 Status of the board Child SBD 1 7 6 12 queue 10 bsub app RES 13 ELIM – External Load Information Manager ACS API – Adaptive Computing Systems API User job FPGA board 14 ACS API
Conclusions (1) 12 systems evaluated using 25 functional requirements + the suitability of extension to support reconfigurable hardware LSF, CODINE, PBS, and Condor ranked the highest in the functional requirements LSF, CODINE, and PBSPro found easy to extend without changes in their source codes LSF most suitable to support reconfigurable hardware
Conclusions (2) General software architecture of the extended system developed Experimental verification and the measurement of the speed-up of the extended system in progress