320 likes | 424 Views
The IEEE CS Task Force on Cluster Computing (TFCC). William Gropp Mathematics and Computer Science Argonne National Lab www.mcs.anl.gov/~gropp. Thanks to Mark Baker University of Portsmouth, UK http://www.dcs.port.ac.uk/~mab. A Little History.
E N D
The IEEE CS Task Force on Cluster Computing (TFCC) William GroppMathematics and Computer ScienceArgonne National Labwww.mcs.anl.gov/~gropp Thanks to Mark BakerUniversity of Portsmouth, UKhttp://www.dcs.port.ac.uk/~mab
A Little History • In 1998 there was obvious huge interest in clusters, so it seemed natural to set up a focused group in this area. • A Cluster Computing Task Force was proposed to the IEEE CS. • The TFCC was approved and started operating in February 1999 – been going just over 2 years. gropp@mcs.anl.gov
Proposed Activities • Act as an international forum to promote cluster computing research and education, and participate in setting up technical standards in this area. • Be involved with issues related to the design, analysis and development of cluster systems as well as the applications that use them. • Sponsor professional meetings, produce publications, set guidelines for educational programs, and help co-ordinate academic, funding agency, and industry activities. • Organize events and hold a number of workshops that would span the range of activities sponsored by the Task Force. • Publish a bi-annual newsletter to help the community keep abreast of activities in field. gropp@mcs.anl.gov
IEEE CS Task Forces • A TF is expected to have a finite term of existence, normally a period of 2-3 years - continued existence beyond that point is generally not appropriate. • A TF is expected to either increase their scope of activities such that establishment of a Technical Committee (TC) is warranted, or the task force will be merged into existing TCs. • TFCC will submit an application to the CS become a TC later this year. gropp@mcs.anl.gov
Why a separate TFCC! • It brings together all the activities/technologies used with Cluster Computing into one area - so instead of tracking four or five IEEE TCs there is one... • Cluster Computing is NOT just Parallel, Distributed, OSs, or the Internet, it is a mix of them all, and consequently different. • The TFCC is an appropriate body for focusing activities and publications associated with Cluster Computing. gropp@mcs.anl.gov
http://www.ieeetfcc.org gropp@mcs.anl.gov
TFCC Mailing Lists • Currently three emails lists have been set up: • tfcc-l@bucknell.edu – a discussion list open to anyone interested in the TFCC - see TFCC page for info. on “how to subscribe”. • tfcc-exe@port.ac.uk– a closed executive committee mailing reflector. • tfcc-adv@port.ac.uk– a closed advisory committee mailing reflector. gropp@mcs.anl.gov
Annual Conference – ClusterXY • 1st IEEE International Workshop on Cluster Computing (Cluster 1999), Melbourne, Australia, December 1999, about 105 attendees from 16 countries. http://www.clustercomp.org • 2nd IEEE International Conference on Cluster Computing (Cluster 2000), Chemnitz, Germany, November, 2000, anticipate 160 attendees. http://www.tu-chemnitz.de/cluster2000 • 3rd IEEE International Conference on Cluster Computing (Cluster 2001), Newport Beach, California, October 8-11, 2001, expect 250-300 attendees. http://andy.usc.edu/cluster2001 gropp@mcs.anl.gov
Associated Events - GRID’XY • 1st IEEE/ACM International Workshop on Grid Computing (Grid2000), Bangalore, India, December 17, 2000 (attendees from 15 countries). http://www.gridcomputing.org • 2nd IEEE/ACM International Workshop on Grid Computing (Grid2001), at SC2001, November 2001 gropp@mcs.anl.gov
Supercomputing • “Birds of A Feather” at SC99 and SC2000. • Aims of meetings are to gather together interested parties and bring them up to date, but also put together a bunch of short talks and start a discussion on a variety of topics… • Probably be another at SC01 – depending on the community interest. gropp@mcs.anl.gov
Other Activities • Book donation program • Cluster Computing Archive • www.ieeetfcc.org/ClusterArchive.html • TopClusters Project • www.TopClusters.org • TFCC Whitepaper • www.dcs.port.ac.uk/~mab/tfcc/WhitePaper • TFCC Newsletter • www.eg.bucknell.edu/~hyde/tfcc gropp@mcs.anl.gov
TopClusters Project • http://www.TopClusters.org • TFCC collaboration with Top500 project. • Numeric, I/O, Web, Database, and Application level benchmarking of clusters. • Joint BOF with Top500 at SC2000 on Cluster-based benchmarking. • Ongoing effort… gropp@mcs.anl.gov
TFCC Whitepaper • A Whitepaper on Cluster Computing, submitted to the International Journal of High-Performance Applications and Supercomputing, November 2000 • Snap-shot of the state-of-the-art of Cluster Computing. • Preprint, www.dcs.port.ac.uk/~mab/tfcc/WhitePaper/ gropp@mcs.anl.gov
TFCC Membership • Over 300 registered members • Free membership open to all, but few benefits may be restricted - (reduced registration fee for IEEE members) • Over 450 on the TFCC mailing list <tfcc-l@bucknell.edu> gropp@mcs.anl.gov
Future Plans • We plan to submit an application to the IEEE CS Technical Activities Board (TAB) to attain full Technical Committee status. • The TAB see the TFCC as a success and we hope that our application will be successful. • Obviously if we achieve TC status, we will need the continuing assistance and help of the TFCCs current volunteers plus encourage a bunch of new ones… gropp@mcs.anl.gov
Summary • Successful conference series has been started, with commercial sponsorship. • Promoting Cluster-based technologies through TFCC sponsorship. • Helping the community with our book donation program. • Engendering debate and discussion through mailing forum. • Keeping the community informed with our information rich TFCC Web site. gropp@mcs.anl.gov
Scalable Clusters • TopCluster.org list: • 26 Clusters with 128+ nodes • 8 with 500+ nodes • 34 with 64-127 nodes • Most run Linux • Most dedicated to applications • Where are scalable tools developed and tested? • Caveats: • Does not include MPP-like systems (IBM SP, SGI Origin, Compaq, Intel TFLOPs, etc.) • Not a complete list • Only clusters explicitly contributed to topcluster.org gropp@mcs.anl.gov
What is Scalability? • Most common definition in use: • Works for n+1 nodes if it works for n, for small n • Practical definition • Operations complete “fast enough” • 0.5 to 3 seconds for “interactive” • Operations are reliable • Approach to scalability must not be fragile gropp@mcs.anl.gov
Issues in Clusters and Scalability • Developing and Testing Tools • Requires convenient access to large-scale system • Can this co-exist with production computing? • Too many different tools • Why not adopt Unix philosophy? • Example solution: Scalable Unix Tools • Following slides thanks to Rusty Lusk and Emil Ong gropp@mcs.anl.gov
What Are the Scalable Unix Tools? • Parallel versions of common Unix commands like ps, ls, cp, …, with appropriate semantics • A few new commands in the same spirit but without a serial counterpart • Designed for users • New this spring: release of a high-performance implementation based on MPI • One of the original “official” Ptools projects • Original definition published • Proceedings of the Scalable High Performance Computing Conference • http://www.mcs.anl.gov/~gropp/papers/1994/shpcc-paper.ps gropp@mcs.anl.gov
Motivation • Basic Unix commands (ls, grep, find, …) are quintessential tools. • Simple syntax and semantics (except maybe find syntax) • Have same component interface (lines of text, stdin, stdout) • Unix redirection ( <, >, and especially | ) allow tools to be easily combined into powerful command lines • “Old-fashioned”: no GUI, little interactivity gropp@mcs.anl.gov
Motivation, continued • Many parallel machines have Unix and at least partially distinct file systems on each node. • A user needs simple and familiar ways to • Copy a file to local file space on each node • Find all processes running on all nodes • Test for conditions on all nodes • Avoid getting swamped with output • On large machines these commands are not useful unless they take advantage of parallelism in their execution. gropp@mcs.anl.gov
Design Goals • Familiar to Unix users • Similar names (we chose pt<Unix-name>) • Same arguments, similar semantics • Interact well with traditional Unix commands, facilitating construction of powerful command lines • Run at interactive speeds (requires scalability in parallel process manager startup and handling of I/O) gropp@mcs.anl.gov
ptcp ptmv ptrm ptln ptmkdir ptrmdir ptchmod ptchgrp ptchown pttest[ao] Part I: Parallel Versions of Traditional Commands • Select nodes to run on by • -all • -m <file of hostnames> • -M <hostlist> • ‘donner dasher blitzen’ • ‘ccn%d@1-32,42,65-96’ gropp@mcs.anl.gov
Part II: Traditional Commands Producing Lots of Output • ptcat, ptls, ptfind • Have potential to produce lots of output, and the source is also of interest • With –h option: ptls –M node%d@1-3 -h [node1] myfile1 [node2] [node3] myfile1 myfile2 gropp@mcs.anl.gov
Performance of ptcp • Copying a single 10 MB file • to 241 nodes in 14 seconds Time to Copy 10MB file Total Bandwidth gropp@mcs.anl.gov
Watching ptcp ptcp –all bigfile BIGFILE X=1 while true; do \ ptexec -all 'echo "`hostname`: `ls -s BIGFILE \ | awk \ "{print \\"percentage\\" \$ (1)/98 \\" blue \ red\\"}\"`"' | ptdisp -h gropp@mcs.anl.gov
Percentage of Completion gropp@mcs.anl.gov
Percentage of Completion gropp@mcs.anl.gov
Availability • Open source • Get from http://www.mcs.anl.gov/sut • All source, man pages • Configure, make, on Linux, Solaris, Irix, AIX • Needs MPI implementation with mpirun • Developed with Linux, MPICH, MPD, on Chiba City at Argonne gropp@mcs.anl.gov
Chiba City Scalability Testbed • http://www-unix.mcs.anl.gov/chiba/ gropp@mcs.anl.gov
Some Other Efforts in Scalable Clusters • Large Programs • DOE Scientific Discovery through Advanced Computing (SciDAC) • NSF Distributed Terascale Facility (DTF) • OSCAR • Goal is a “cluster in a box” CD • PVFS (Parallel Virtual File System) • Many Smaller Efforts • www.beowulf.org, etc. • Commercial Efforts • Scyld, etc. gropp@mcs.anl.gov