310 likes | 435 Views
Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006. Prof. Douglas Thain CSE Department 9 Feb 2007. What is Condor?. Condor is software from UW-Madison that harnesses idle cycles and storage from existing machines. (ND workstations are 89% idle!)
E N D
Harnessing Idle Computerswith Condor at Notre Dame:Impact on Research in 2006 Prof. Douglas Thain CSE Department 9 Feb 2007
What is Condor? • Condor is software from UW-Madison that harnesses idle cycles and storage from existing machines. (ND workstations are 89% idle!) • With the assistance of OIT/CSE staff, Condor has been installed on 379 CPUs in the Colleges of Engineering and Science since early 2005. • Our Condor pool is expanding the capabilities of researchers in CSE, EE, AME, and Physics perform CPU and storage intensive research. • More users and contributors are welcome to join! http://www.nd.edu/~condor
Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Computing Environment I will only run jobs between midnight and 8 AM I will only run jobs when there is no-one working at the keyboard Miscellaneous CSE Workstations CPU CPU CPU Fitzpatrick Workstation Cluster CPU CPU CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk Disk Disk Disk Condor Match Maker I prefer to run a job submitted by a CSE student. CPU CPU CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk Disk Disk CVRL Research Cluster CCL Research Cluster
Scheduling Policy • First, Owners Exercise Absolute Control • Set who, what, and when can use machine. • Can kick jobs off at any time manually. • Default policy: • Start job if console idle > 15 minutes • Suspend job if console used or CPU busy. • Kick off job if suspended > 10 minutes. • After satisfying that principle, the users split available CPU hours equally. A little more complicated, see details here: http://www.cse.nd.edu/~ccl/operations/condor/policy.shtml
CPU History Storage History
Flocking Between Universities Wisconsin 1200 CPUs Purdue A 541 CPUs Notre Dame 379 CPUs Purdue B 1016 CPUs http://www.cse.nd.edu/~ccl/operations/condor/
Total Consumption in 2005 1128038 (100%) CPU-Hours Total 134978 (11%) CPU-Hours Consumed by Owner at Keyboard 350148 (31%) CPU-Hours Totally Unused 642912 (56%) CPU-Hours Harnessed by Condor http://www.cse.nd.edu/~ccl/operations/condor/2006/users.html
User CPUHours Percentof Total Max JobsRunning Max Jobsin Queue Dept Advisor Total 642912 100.00% 327 2268 tdysart@nd.edu 548187 85.27% 204 740 CSE Kogge tfaltemi@nd.edu 51058 7.94% 163 2004 CSE Flynn johanes@nd.edu 27184 4.23% 7 7 CRC - nice-user.tdysart@nd.edu 22425 3.49% 100 100 CSE Kogge lxiao@nd.edu 13236 2.06% 78 85 EE Fuja dsalyers@nd.edu 10016 1.56% 28 688 CSE Striegel yjiang3@nd.edu 7371 1.15% 24 800 CSE Striegel pbrenne1@nd.edu 6148 0.96% 112 120 CSE Izaguirre dcieslak@nd.edu 5619 0.87% 145 1814 CSE Chawla bnovak@nd.edu 4116 0.64% 52 52 ??? ??? dvonhand@nd.edu 1820 0.28% 32 32 CSE Izaguirre jmcraven@nd.edu 1390 0.22% 40 92 ??? ??? Top Condor Users in 2005 http://www.cse.nd.edu/~ccl/operations/condor/2005/users.html
2376456 (100%) CPU-Hours Total 281003 (11%) CPU-Hours Consumed by Owner at Keyboard 934277 (39%) CPU-Hours Totally Unused 1161176 (48%) CPU-Hours Harnessed by Condor Total Consumption in 2006 http://www.cse.nd.edu/~ccl/operations/condor/2005/users.html
(> 389 jobs running by some migrating to Purdue and UW.) Top Condor Users in 2006 User CPUHours Percentof Total Max JobsRunning Max Jobsin Queue Dept Advisor Total 1161176 100.00% 1156 61695 dcieslak@nd.edu 471126 40.57% 1142 60314 CSE Chawla tfaltemi@nd.edu 415972 35.82% 447 20030 CSE Flynn tdysart@nd.edu 186050 16.02% 213 1275 CSE Kogge johanes@nd.edu 31341 2.70% 21 21 CRC - pbrenne1@nd.edu 27693 2.38% 64 1082 CSE Izaguirre dthain@nd.edu 23217 2.00% 299 300 CSE Thain apusane@nd.edu 20708 1.78% 160 192 EE Costello yjiang3@nd.edu 4842 0.42% 24 366 CSE Striegel lxiao@nd.edu 2741 0.24% 53 66 EE Fuja npatel@nd.edu 1690 0.15% 6 6 AME Renaud gniederw@nd.edu 980 0.08% 29 35 CSE Thain jwozniak@nd.edu 413 0.04% 30 72 CSE Izaguirre http://www.cse.nd.edu/~ccl/operations/condor/2006/users.html
Research Projects Using Condor • Data Mining and Applications • CSE: Chawla • Multidimensional Biometric Imaging and Applications (NSF/DOJ) • CSE: Flynn and Bowyer • High End Biometric Computing (NSF) • CSE: Thain and Flynn • Architectures and Devices for Quantum Dot Cellular Automata (NSF) • EE and CSE: Kogge, Lent, Fay, Orlov • GEMS Grid Enabled Molecular Simulations (NSF) • CSE and Chem: Izaguirre, Striegel, Peng • Delay-Constrained Multihop Transmission in Wireless Networks: Interaction of Coding, Channel Access, and Routing (NSF/NASA/Moto) • EE: Laneman, Costello, Fuja, Haenggi • ND Design Automation Laboratory • AME: Renaud • GRAND: Gamma Ray Astrophysics at Notre Dame • Physics: Poirer (Distributed Storage)
Recent Papers Supportedby Cycles from Condor at ND (1) • N. Chawla, D. Cieslak, L. Hall, A. Joshi, "Killing Two Birds with One Stone: Countering Cost and Imbalance," Data Mining and Knowledge Discovery, under review. • D. Cieslak, N. Chawla, "The Calibration and Power of Probability Estimation Trees in Ensembles," 7th International Workshop on Multiclassifier Systems, under review. • D. Cieslak, N. Chawla, "Reducing Loss and Improving ROC AUC Through Sampling," International Conference on Machine Learning , Corvallis, Oregon, 2007. • N. Chawla, D. Cieslak, "Evaluating Calibration of Probability Estimation Trees,“ Proceedings of the AAAI Workshop on the Evaluation Methods in Machine Learning, Boston, July 2006 • D. Cieslak, D. Thain, N. Chawla, "Troubleshooting Distributed Systems via Data Mining," Hot Topics Sessions: 15th IEEE International Symposium on High Performance Distributed Computing (HPDC-15), Paris, France, June 2006 • D. Cieslak, N. Chawla, A. Striegel, "Combating Imbalance in Network Intrusion Datasets,“ IEEE International Conference on Granular Computing, Athens, Georgia, May 2006. CSE: Data Mining
Recent Papers Supportedby Cycles from Condor at ND (2) • X. Chen, T. Faltemier, P. Flynn, and K. Bowyer, “Human Face Modeling and Recognition Through Multi-View High Resolution Stereopsis”, Biometrics: Theory, Applications, and Systems, 2006. • D. Woodard, T. Faltemier, P. Yan, and P. Flynn, “A Comparison of 3D Biometric Modalities”, Biometrics: Theory, Applications, and Systems, 2006. • T. Faltemier, P. Flynn, and K. Bowyer, “3D Face Recognition with Cruvature Based Region Selection”, 3D Data Processing, Visualization, and Transmission, 2006. • T. Faltemier, K. Bowyer, and P. Flynn, “Region Ensemble for 3D Face Recognition and Indexing”, under submission. • T. Faltemier, K. Bowyer, and P. Flynn, “Using Multiple Gallery Images for 3D Face Recognition”, under submission. • Timothy J. Dysart. "Defect Properties and Design Tools for Quantum Dot Cellular Automata." Master's Thesis, 2005. PDF • Timothy J. Dysart, Peter M. Kogge, Craig S. Lent, and Mo Liu. "An Analysis of Missing Cell Defects in Quantum-Dot Cellular Automata." IEEE International Workshop on Design and Test of Defect-Tolerant Nanoscale Architectures (NANOARCH '05) in conjunction with the VLSI Test Symposium. Palm Springs, CA. May 1, 2005 CSE: Biometrics EE/CSE: Quantum Comp
Recent Papers Supportedby Cycles from Condor at ND (3) • On Deriving Good LDPC Convolutional Codes, A. E. Pusane, R. Smarandache, P. O. Vontobel, and D. J. Costello, Jr, submitted to IEEE International Symposium on Information Theory, Nice, France, June 2007. • A Comparison of ARA- and Protograph-Based LDPC Block and Convolutional Codes, D. J. Costello, Jr., A. E. Pusane, C. Jones, and D. Divsalar, to appear in Proc. Information Theory and Applications Workshop, San Diego, CA, USA, January 29-February 2, 2007. • LDPC Convolutional Codes: What Are They? How Do They Work? Are They Any Good?, D. J. Costello, Jr. and A. E. Pusane in Book of Abstracts, AMS Joint Mathematics Meetings, New Orleans, LA, USA, January 5-8, 2007. • L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Algebraic Superposition of LDGM Codes for Cooperative Diversity'' submitted to IEEE International Symposium on Information Theory (ISIT) 2007. • L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Cooperative diversity based on code superposition'' in IEEE International Symposium on Information Theory (ISIT), Seattle, WA, July 2006. • L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Nested codes with multiple interpretations'' in 40th Conference on Information Sciences and Systems (CISS), Princeton, NJ, March 2006. EE: Signal Coding
Recent Papers Supportedby Cycles from Condor at ND (4) • Yingxin Jiang, Aaron Striegel, "A Distributed Traffic Control Scheme based on Edge-Centric Resource Management," ACM Computer Communications Review, vol. 36, no. 2, pp. 5-16, April 2006. • Effects of low-quality computation time estimates in policed schedulers, Justin M. Wozniak, Yingxin Jiang and Aaron Striegel, Proc. Annual Simulation Symposium, IEEE Computer Society, 2007. • D. Salyers, A. Striegel "A Novel Approach for Transparent Bandwidth Conservation,“ Proceedings of Networking 2005, Waterloo Ontario Canada, May 2005 • Access Control for a Replica Management Database, Justin Wozniak, Paul Brenner, Douglas Thain, Aaron Striegel, Jesus Izaguirre, ACM Workshop on Storage Security and Survivability (StorageSS), October 2006. • Generosity and Gluttony in GEMS: Grid Enabled Molecular Simulations, Justin Wozniak, Paul Brenner, Douglas Thain, Aaron Striegel, and Jesus Izaguirre, in Proceedings of the IEEE Symposium on High Performance Distributed Computing, July 2005 CSE: Network Simulation CSE: Scientific Databases
Recent Papers Supportedby Cycles from Condor at ND (5) • Challenges in Executing Data Intensive Biometric Workloads on a Desktop Grid, Christopher Moretti, Timothy Faltemier, Douglas Thain, and Patrick J. Flynn, Workshop on Large Scale and Volatile Desktop Grids, March 2006. • Operating System Support for Space Allocation in Grid Storage Systems, Douglas Thain, IEEE Conference on Grid Computing, September 2006. • The Consequences of Decentralized Security in a Cooperative Storage System, Douglas Thain, Chris Moretti, Paul Madrid, Phil Snowberger, and Jeff Hemmes, IEEE Workshop on Security in Storage (SISW), San Francisco, December 2005. • Separating Abstractions from Resources in a Tactical Storage System, Douglas Thain, Sander Klous, Justin Wozniak, Paul Brenner, Aaron Striegel, and Jesus Izaguirre, in Proceedings of IEEE/ACM Supercomputing, Nov 2005. • Patisserie: Support for Parameter Sweeps in a Fault-Tolerant, Massively Parallel, Peer-to-Peer Simulation Environment, Timothy Schoenharl, Scott Christley, and Douglas Thain, Workshop on Agent Directed Simulation (ADS), San Diego, California, April 2005. CSE: Grid Computing
How does Condor relate to CRC? • Use the CRC clusters for: • CPU-intensive, fine-grained parallel codes. • The latest, fastest machines. • Professional, continuous support. • Use the Condor pool for: • Coarse grained, naturally parallel codes. • Harnessing college/dept level machines. • Integration with distributed storage. • Building and deploying novel systems for computer science research. • Self-service support at this point. • (Some ambitious students use both!)
How does Condor relate to OSG? • The Open Science Grid • A wide-area consortium of universities. • A mechanism (Condor+Globus) to access remote batch/storage systems over the WAN. • Interface (Condor-G) is one piece of Condor. • The ND Condor Pool • A campus-scale collection of resources. • Could be made accessible via OSG interface. • Indirectly part of OSG/TeraGrid via Purdue.
Scalable I/O for Biometrics • Computer Vision Research Lab in CSE • Goal: Develop robust algorithms for identifying humans from (non-ideal) images. • Technique: Collect lots of images. Think up clever new matching function. Compare them. • How do you test a matching function? • For a set S of images, • Compute F(Si,Sj) for all Si and Sj in S. • Compare the result matrix to known functions. Credit: Patrick Flynn at Notre Dame CSE
F Computing Similarities
A Big Data Problem • Data Size: 10k images of 1MB = 10 GB • Total I/O: 10k * 10k * 2 MB *1/2 = 100 TB • Would like to repeat many times! • In order to execute such a workload, we must be careful to partition both the I/O and the CPU needs, taking advantage of distributed capacity.
Move 200 TB at Runtime! Job Job Job Job Job Job Job Job Conventional Solution Disk Disk Disk Disk CPU CPU CPU CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk Disk Disk Disk
Job Job Job Job 1. Break array into MB-size chunks. Using Tactical Storage 3. Jobs find nearby data copy, and make full use before discarding. CPU CPU CPU CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk Disk Disk Disk 2. Replicate data to many disks. Result: achieve greater than 2Gb/s of disk->application bandwidth on large workload
Technical Issues (1) • Deployment • All codes and config in AFS,just deploy startup script in /etc/init.d. • Manual copy onto each node gets lost at the end of the semester, copy into image. • Firewalls • TCP/UDP on ports 9000-1000 both directions. • One firewalled machine can hang everyone! • Workaround: Periodic check of TCP ports, manually disable Condor on FW nodes.
Technical Issues (2) • Disappearing Servers • Problem: condor_master on each host disappears mysteriously; pool decays. • Diagnosis: AFS outage? Condor bug? • Solution: /etc/cron.hourly/restart_condor • CPU Detection • Problem: Hyperthreaded machines appearto be multi-CPU machines on Linux. • Result: Condor overcommits the CPU. • Solution: Manual override NUM_CPUS=1
Summary • With your help, our Condor pool has provided significant benefits for both research and education. Thank you! • Liaison between faculty and staff at the dept, college, and univ level is needed to keep the system working. • Lots more info here: • http://www.nd.edu/~condor • condor-discuss@listserv.nd.edu