230 likes | 399 Views
AME: An Any-scale many-task computing Engine. Zhao Zhang, University of Chicago Daniel S. Katz, CI University of Chicago & ANL Matei Ripeanu , ECE University of British Columbia Michael Wilde, CI University of Chicago & ANL Ian Foster, CI University of Chicago & ANL. MTC application review.
E N D
AME: An Any-scale many-task computing Engine Zhao Zhang, University of Chicago Daniel S. Katz, CI University of Chicago & ANL MateiRipeanu, ECE University of British Columbia Michael Wilde, CI University of Chicago & ANL Ian Foster, CI University of Chicago & ANL
MTC application review Up to Millions 1 2 3 Sequenced execution of other programs …… mProject mProject mProject Involves several different programs Large number of invocations 2 1 2 3 High degree of inter-task parallelism Parallelism is enabled by file dependency mDiff mDiff …… Programs exchange data via (POSIX files) 1&2 2&3 mFit …… mFit 1&2 2&3 mConFit
Supercomputer review Exclusive Data Collection Networks Storage Network Control Network IO IO IO LN Compute Nodes with multi cores No local disk, limited RAM disk Full linux kernel Large number of compute nodes Interconnect Optional Data Collection Network Control Network Interconnect
Gaps • Resource Provisioning • Task Management • Task Dispatching • Dependency Resolution • Load Balancing • Data Management • Resiliency
Task Management • Task Dispatching • All tasks will be sent and queued on workers • Workers do a screen of all tasks • Workers find out the input data states and location for all its tasks • Workers subscribe to FLS (File Location Lookup Service) for the files the tasks need • Tasks can run immediately are pushed into a ready queue, others are kept in a hash table • Tasks in the hash table will be moved to ready queue once the input files are ready.
Task Management • Task Dispatching • Test setup • Parameter sweep over scale and task length • Scale = {256, 512, 1024, 2048, 4096, 8192, 16384}cores • Task length = {0, 1, 4, 16, 64, 256} seconds • 16 tasks each core • Dispatch Rate =
Task Management • Task Dispatching • Test setup • Parameter sweep over scale and task length • Scale = {256, 512, 1024, 2048, 4096, 8192, 16384}cores • Task length = {1, 4, 16, 64, 256} seconds • 16 tasks each core • Efficiency =
Task Management • Dependency Resolution • States of Intermediate Files • Invalid: The file is not produced yet. • Remote: The file is produced, and stored at some peer node. • Local: The file has been moved to local storage. • Shared: The file has been moved to global shared file system.
Task Management • Dependency Resolution Query a produced file Query an invalid file
Task Management • Dependency Resolution • Test Setup: • Parameter Sweep over scales and running time, fixed file size at 10 bytes • Scale = {256, 512, 1024, 2048, 4096, 8192, 16384} cores • Running Time = {0, 1, 4, 16} seconds • Each core runs 16 tasks • 16 tasks are divided into 8 pairs, with a producer/consumer relation in each pair • Run the tests with the worst case
Task Management • Dependency Resolution – File size impact • Test Setup • Parameter Sweep over scales and Data size, with fixed running time of 16 • Scale = {256, 512, 1024, 2048, 4096, 8192} cores • File size = {1KB, 1MB, 10MB} • Each core runs 8 tasks • 8 tasks are divided into 4 pairs, with a producer/consumer relation in each pair • Run the tests with the worst case
Task Management • Overhead Analysis • Query/Update/Transfer traffic congested in network transition. • Saturated CPU • Query/Update traffic congested at server side. • Congested in the Queue • Congested by the synchronization of the server • Test Setup • Scale: 256 cores • Running Time: 16 seconds • File Size: 10 bytes • Number of Jobs: 16 tasks per core • 16 tasks are divided into 8 pairs, with a producer/consumer relation in each pair
Data Management • Intermediate File Storage • Isolated file storage & processing vs. Collocated
Data Management • Intermediate File Storage • Isolated file storage & processing vs. Collocated • Test Setup • Parameter Sweep over scales, with fixed running time of 16 seconds • Scale = {256, 1024, 4096, 16384} cores • Each core runs 16 tasks • 16 tasks are divided into 8 pairs, with a producer/consumer relation in each pair • Run the tests with the worst case
Application • Montage is an astronomy application that composes small images from telescope into one large image. It has been successfully running over supercomputers and grids, with MPI and Pegasus respectively.
Application • Test Setup • 6 degree x 6 degree mosaic centered at galaxy M101 • Input: 1319 files, each around 2MB • Output: 1 file, 3.7GB • Parallel Stages: mProjectPP, mDiffFit, mBackground • 512 cores, data management, no load-balancing
Application • Test Setup • 6 degree x 6 degree mosaic centered at galaxy M101 • Input: 1319 files, each around 2MB • Output: 1 file, 3.7GB • Parallel Stages: mProjectPP, mDiffFit, mBackground • 512 cores, data management, no load-balancing
Application • Test Setup • 6 degree x 6 degree mosaic centered at galaxy M101 • Input: 1319 files, each around 2MB • Output: 1 file, 3.7GB • Parallel Stages: mProjectPP, mDiffFit, mBackground • 512 cores, data management, no load-balancing
Summary • We identify and classify the gaps between MTC applications and supercomputers into six categories: resource provisioning, task dispatching, task dependency resolution, load balancing, data management, and resiliency. • We design and implement AME that bridges these gaps. (in future) • The results show that AME scales well up to 16,384 core. • AME accelerates MTC applications, such as Montage on supercomputers.