180 likes | 364 Views
Daniel Quinlan (LLNL) Matt Sottile (Galois), Aaron Tomb (Galois) Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 Operated by Lawrence Livermore National Security, LLC, or the U.S. Department of Energy,
E N D
Daniel Quinlan (LLNL) Matt Sottile (Galois), Aaron Tomb (Galois) Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 Operated by Lawrence Livermore National Security, LLC, or the U.S. Department of Energy, National Nuclear Security Administration under Contract DE-AC52-07NA27344 Automated Extraction of Skeleton Apps from AppsFebruary 2012
What is a Skeleton and why you want one • A skeleton is a reduced size version of an application that focuses on one or more aspects of the behavior of the full original application. Examples include: • MPI usage, message passing patterns; • memory traversal; • I/O demands • This is important for Exascale: • Provides inputs to simulators for evaluation of expected Exascale architectures and features (e.g. SST/macro) • Provides smaller applications for independent study • A skeleton program will not get the same answer as the original application • There is prior work in this area… • I think we are the only ones with a distributed tool for this…
CoDesign Tool FlowAutomatic Generation of Skeletons for Rapid Analysis This talk is about these arrows
We can generate many skeletons from an App • Many skeletons could be generated from a single application • The process can work on full applications or smaller compact applications Many Skeleton Apps each with maybe many files Skeleton A Aspect A Single App with many files Aspect B Skeleton B Aspect X Skeleton X
An Automated or Semi-Automated Process • We treat this as a compiler research problem • We are building tools to automate the generation of skeletons, but some questions are difficult to resolve • May require dynamic analysis to identify important values • May require some user annotations to define some behavior • We start with the original application and transform it to modify and remove code to define an automated process; this is a source-to-source solution
System-dependency Sliced-system- dependency We are using the ROSESource-To-Source Compiler to support this work ROSE-based Skeleton Generation Tool Source Code Fortran/C/C++ OpenMP Transformed Source Code ROSE Frontend Unparser ROSE IR Analyses/ Transformation/ Optimizations Control-Flow Control flow ROSE Control dependency Science & Technology: Computation Directorate
A Non-trivial problem to Automate • Different aspects are related (they are not actually orthogonal) • Example: inter-message timings are a function of the computational work that an app does. • Static analysis is not always precise, and dynamic analysis is not always complete • We are focused on using static analysis and formal methods to generate plausible, realistic skeletons is the focus of our research work.
Example of Automated Skeleton Code Generation: Before/After After Before do { if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n”, itcnt, gdiffnorm ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); do { if (rank < size - 1) MPI_Send( xlocal[maxn / size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ) if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); itcnt ++; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); } while (gdiffnorm > 1.0e-2 && itcnt < 100);
void rank( int iteration ) { INT_TYPE i, k; INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val, max_key_val; INT_TYPE *key_buff_ptr; TIMER_START( T_RANK ); /* Iteration alteration of keys */ if(my_rank == 0 ) { key_array[iteration] = iteration; key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration; } /* Initialize */ for( i=0; i<NUM_BUCKETS+TEST_ARRAY_SIZE; i++ ) { bucket_size[i] = 0; bucket_size_totals[i] = 0; process_bucket_distrib_ptr1[i] = 0; process_bucket_distrib_ptr2[i] = 0; } /* Determine where the partial verify test keys are, load into */ /* top of array bucket_size */ for( i=0; i<TEST_ARRAY_SIZE; i++ ) if( (test_index_array[i]/NUM_KEYS) == my_rank ) bucket_size[NUM_BUCKETS+i] = key_array[test_index_array[i] % NUM_KEYS]; /* Determine the number of keys in each bucket */ for( i=0; i<NUM_KEYS; i++ ) bucket_size[key_array[i] >> shift]++; /* Accumulative bucket sizes are the bucket pointers */ bucket_ptrs[0] = 0; for( i=1; i< NUM_BUCKETS; i++ ) bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1]; /* Sort into appropriate bucket */ for( i=0; i<NUM_KEYS; i++ ) { key = key_array[i]; key_buff1[bucket_ptrs[key >> shift]++] = key; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce( bucket_size, bucket_size_totals, NUM_BUCKETS+TEST_ARRAY_SIZE, MP_KEY_TYPE, MPI_SUM, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* Determine Redistibution of keys: accumulate the bucket size totals till this number surpasses NUM_KEYS (which the average number of keys per processor). Then all keys in these buckets go to processor 0. Continue accumulating again until supassing 2*NUM_KEYS. All keys in these buckets go to processor 1, etc. This algorithm guarantees that all processors have work ranking; no processors are left idle. The optimum number of buckets, however, does not result in as high a degree of load balancing (as even a distribution of keys as is possible) as is obtained from increasing the number of buckets, but more buckets results in more computation per processor so that the optimum number of buckets turns out to be 1024 for machines tested. Note that process_bucket_distrib_ptr1 and ..._ptr2 hold the bucket number of first and last bucket which each processor will have after the redistribution is done. */ bucket_sum_accumulator = 0; local_bucket_sum_accumulator = 0; send_displ[0] = 0; process_bucket_distrib_ptr1[0] = 0; for( i=0, j=0; i<NUM_BUCKETS; i++ ) { bucket_sum_accumulator += bucket_size_totals[i]; local_bucket_sum_accumulator += bucket_size[i]; if( bucket_sum_accumulator >= (j+1)*NUM_KEYS ) { send_count[j] = local_bucket_sum_accumulator; if( j != 0 ) { send_displ[j] = send_displ[j-1] + send_count[j-1]; process_bucket_distrib_ptr1[j] = process_bucket_distrib_ptr2[j-1]+1; } process_bucket_distrib_ptr2[j++] = i; local_bucket_sum_accumulator = 0; } } /* When NUM_PROCS approaching NUM_BUCKETS, it is highly possible that the last few processors don't get any buckets. So, we need to set counts properly in this case to avoid any fallouts. */ while( j < comm_size ) { send_count[j] = 0; process_bucket_distrib_ptr1[j] = 1; j++; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall( send_count, 1, MPI_INT, recv_count, 1, MPI_INT, MPI_COMM_WORLD ); /* Determine the receive array displacements for the buckets */ recv_displ[0] = 0; for( i=1; i<comm_size; i++ ) recv_displ[i] = recv_displ[i-1] + recv_count[i-1]; /* Now send the keys to respective processors */ MPI_Alltoallv( key_buff1, send_count, send_displ, MP_KEY_TYPE, key_buff2, recv_count, recv_displ, MP_KEY_TYPE, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* The starting and ending bucket numbers on each processor are multiplied by the interval size of the buckets to obtain the smallest possible min and greatest possible max value of any key on each processor */ min_key_val = process_bucket_distrib_ptr1[my_rank] << shift; max_key_val = ((process_bucket_distrib_ptr2[my_rank] + 1) << shift)-1; /* Clear the work array */ for( i=0; i<max_key_val-min_key_val+1; i++ ) key_buff1[i] = 0; /* Determine the total number of keys on all other processors holding keys of lesser value */ m = 0; for( k=0; k<my_rank; k++ ) for( i= process_bucket_distrib_ptr1[k]; i<=process_bucket_distrib_ptr2[k]; i++ ) m += bucket_size_totals[i]; /* m has total # of lesser keys */ /* Determine total number of keys on this processor */ j = 0; for( i= process_bucket_distrib_ptr1[my_rank]; i<=process_bucket_distrib_ptr2[my_rank]; i++ ) j += bucket_size_totals[i]; /* j has total # of local keys */ /* Ranking of all keys occurs in this section: */ /* shift it backwards so no subtractions are necessary in loop */ key_buff_ptr = key_buff1 - min_key_val; /* In this section, the keys themselves are used as their own indexes to determine how many of each there are: their individual population */ for( i=0; i<j; i++ ) key_buff_ptr[key_buff2[i]]++; /* Now they have individual key */ /* population */ /* To obtain ranks of each key, successively add the individual key population, not forgetting the total of lesser keys, m. NOTE: Since the total of lesser keys would be subtracted later in verification, it is no longer added to the first key population here, but still needed during the partial verify test. This is to ensure that 32-bit key_buff can still be used for class D. */ /* key_buff_ptr[min_key_val] += m; */ for( i=min_key_val; i<max_key_val; i++ ) key_buff_ptr[i+1] += key_buff_ptr[i]; /* This is the partial verify test section */ /* Observe that test_rank_arrayvals are */ /* shifted differently for different cases */ for( i=0; i<TEST_ARRAY_SIZE; i++ ) { k = bucket_size_totals[i+NUM_BUCKETS]; /* Keys were hidden here */ if( min_key_val <= k && k <= max_key_val ) { /* Add the total of lesser keys, m, here */ INT_TYPE2 key_rank = key_buff_ptr[k-1] + m; int failed = 0; switch( CLASS ) { case 'S': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'W': if( i < 2 ) { if( key_rank != test_rank_array[i]+(iteration-2) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'A': if( i <= 2 ) { if( key_rank != test_rank_array[i]+(iteration-1) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-(iteration-1) ) failed = 1; else passed_verification++; } break; case 'B': if( i == 1 || i == 2 || i == 4 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'C': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'D': if( i < 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; } if( failed == 1 ) printf( "Failed partial verification: " "iteration %d, processor %d, test key %d\n", iteration, my_rank, (int)i ); } } TIMER_STOP( T_RANK ); /* Make copies of rank info for use by full_verify: these variables in rank are local; making them global slows down the code, probably since they cannot be made register by compiler */ if( iteration == MAX_ITERATIONS ) { key_buff_ptr_global = key_buff_ptr; total_local_keys = j; total_lesser_keys = 0; /* no longer set to 'm', see note above */ } } #include <stdio.h> #include <math.h> #include "mpi.h" /* This example handles a 12 x 12 mesh, on 4 processors only. */ #define maxn 12 int main( argc, argv ) intargc; char **argv; { int rank, size, i, j, itcnt; inti_first, i_last; MPI_Status status; double diffnorm, gdiffnorm; double xlocal[(12/4)+2][12]; double xnew[(12/3)+2][12]; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); if (size != 4) MPI_Abort( MPI_COMM_WORLD, 1 ); /* xlocal[][0] is lower ghostpoints, xlocal[][maxn+2] is upper */ /* Note that top and bottom processes have one less row of interior points */ i_first = 1; i_last = maxn/size; if (rank == 0) i_first++; if (rank == size - 1) i_last--; /* Fill the data as specified */ for (i=1; i<=maxn/size; i++) for (j=0; j<maxn; j++) xlocal[i][j] = rank; for (j=0; j<maxn; j++) { xlocal[i_first-1][j] = -1; xlocal[i_last+1][j] = -1; } itcnt = 0; do { /* Send up unless I'm at the top, then receive from below */ /* Note the use of xlocal[i] for &xlocal[i][0] */ if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); /* Send down unless I'm at the bottom */ if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); /* Compute new values (but not on boundary) */ itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } /* Only transfer the interior points */ for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n", itcnt, gdiffnorm ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); MPI_Finalize( ); return 0; } #include <stdio.h> #include <math.h> #include "mpi.h" /* This example handles a 12 x 12 mesh, on 4 processors only. */ #define maxn 12 int main( argc, argv ) intargc; char **argv; { int rank, size, i, j, itcnt; inti_first, i_last; MPI_Status status; double diffnorm, gdiffnorm; double xlocal[(12/4)+2][12]; double xnew[(12/3)+2][12]; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); if (size != 4) MPI_Abort( MPI_COMM_WORLD, 1 ); /* xlocal[][0] is lower ghostpoints, xlocal[][maxn+2] is upper */ /* Note that top and bottom processes have one less row of interior points */ i_first = 1; i_last = maxn/size; if (rank == 0) i_first++; if (rank == size - 1) i_last--; /* Fill the data as specified */ for (i=1; i<=maxn/size; i++) for (j=0; j<maxn; j++) xlocal[i][j] = rank; for (j=0; j<maxn; j++) { xlocal[i_first-1][j] = -1; xlocal[i_last+1][j] = -1; } itcnt = 0; do { /* SendupunlessI'matthe top, thenreceivefrombelow */ /* Note the use ofxlocal[i] for &xlocal[i][0] */ if (rank < size - 1) MPI_Send( xlocal[maxn/size], maxn, MPI_DOUBLE, rank + 1, 0, MPI_COMM_WORLD ); if (rank > 0) MPI_Recv( xlocal[0], maxn, MPI_DOUBLE, rank - 1, 0, MPI_COMM_WORLD, &status ); /* Send down unless I'm at the bottom */ if (rank > 0) MPI_Send( xlocal[1], maxn, MPI_DOUBLE, rank - 1, 1, MPI_COMM_WORLD ); if (rank < size - 1) MPI_Recv( xlocal[maxn/size+1], maxn, MPI_DOUBLE, rank + 1, 1, MPI_COMM_WORLD, &status ); /* Compute new values (but not on boundary) */ itcnt ++; diffnorm = 0.0; for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) { xnew[i][j] = (xlocal[i][j+1] + xlocal[i][j-1] + xlocal[i+1][j] + xlocal[i-1][j]) / 4.0; diffnorm += (xnew[i][j] - xlocal[i][j]) * (xnew[i][j] - xlocal[i][j]); } /* Only transfer the interior points */ for (i=i_first; i<=i_last; i++) for (j=1; j<maxn-1; j++) xlocal[i][j] = xnew[i][j]; MPI_Allreduce( &diffnorm, &gdiffnorm, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD ); gdiffnorm = sqrt( gdiffnorm ); if (rank == 0) printf( "At iteration %d, diff is %e\n", itcnt, gdiffnorm ); } while (gdiffnorm > 1.0e-2 && itcnt < 100); MPI_Finalize( ); return 0; } Example of Automated Skeleton Code Generation: Larger example Original Source Code: rank(int iteration) INT_TYPE i, k; INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m; ailed = 0; switch( CLASS ) { case 'S': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'W': if( i < 2 ) { if( key_rank != test_rank_array[i]+(iteration-2) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'A': if( i <= 2 ) { if( key_rank != test_rank_array[i]+(iteration-1) ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-(iteration-1) ) failed = 1; else passed_verification++; } break; case 'B': if( i == 1 || i == 2 || i == 4 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; case 'C': if( i <= 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } void rank( int iteration ) { INT_TYPE i, k; INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; INT_TYPE key; INT_TYPE2 bucket_sum_accumulator, j, m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val, max_key_val; INT_TYPE *key_buff_ptr; TIMER_START( T_RANK ); /* Iteration alteration of keys */ if(my_rank == 0 ) { key_array[iteration] = iteration; key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration; } /* Initialize */ for( i=0; i<NUM_BUCKETS+TEST_ARRAY_SIZE; i++ ) { bucket_size[i] = 0; bucket_size_totals[i] = 0; process_bucket_distrib_ptr1[i] = 0; process_bucket_distrib_ptr2[i] = 0; } /* Determine where the partial verify test keys are, load into */ /* top of array bucket_size */ for( i=0; i<TEST_ARRAY_SIZE; i++ ) if( (test_index_array[i]/NUM_KEYS) == my_rank ) bucket_size[NUM_BUCKETS+i] = key_array[test_index_array[i] % NUM_KEYS]; /* Determine the number of keys in each bucket */ for( i=0; i<NUM_KEYS; i++ ) bucket_size[key_array[i] >> shift]++; /* Accumulative bucket sizes are the bucket pointers */ bucket_ptrs[0] = 0; for( i=1; i< NUM_BUCKETS; i++ ) bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1]; /* Sort into appropriate bucket */ for( i=0; i<NUM_KEYS; i++ ) { key = key_array[i]; key_buff1[bucket_ptrs[key >> shift]++] = key; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce( bucket_size, bucket_size_totals, NUM_BUCKETS+TEST_ARRAY_SIZE, MP_KEY_TYPE, MPI_SUM, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* Determine Redistibution of keys: accumulate the bucket size totals till this number surpasses NUM_KEYS (which the average number of keys per processor). Then all keys in these buckets go to processor 0. Continue accumulating again until supassing 2*NUM_KEYS. All keys in these buckets go to processor 1, etc. This algorithm guarantees that all processors have work ranking; no processors are left idle. The optimum number of buckets, however, does not result in as high a degree of load balancing (as even a distribution of keys as is possible) as is obtained from increasing the number of buckets, but more buckets results in more computation per processor so that the optimum number of buckets turns out to be 1024 for machines tested. Note that process_bucket_distrib_ptr1 INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS, it is highly possible that the last few processors don't get any buckets. So, we need to set counts properly in this case to avoid any fallouts. */ while( j < comm_size ) { send_count[j] = 0; process_bucket_distrib_ptr1[j] = 1; j++; } TIMER_STOP( T_RANK ); TIMER_START( T_RCOMM ); /* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall( send_count, 1, MPI_INT, recv_count, 1, MPI_INT, MPI_COMM_WORLD ); /* Determine the receive array displacements for the buckets */ recv_displ[0] = 0; for( i=1; i<comm_size; i++ ) recv_displ[i] = recv_displ[i-1] + recv_count[i-1]; /* Now send the keys to respective processors */ MPI_Alltoallv( key_buff1, send_count, send_displ, MP_KEY_TYPE, key_buff2, recv_count, recv_displ, MP_KEY_TYPE, MPI_COMM_WORLD ); TIMER_STOP( T_RCOMM ); TIMER_START( T_RANK ); /* The starting and ending bucket numbers on each processor are multiplied by the interval size of the buckets to obtain the smallest possible min and greatest possible max value of any key on each processor */ min_key_val = process_bucket_distrib_ptr1[my_rank] << shift; max_key_val = ((process_bucket_distrib_ptr2[my_rank] + 1) << shift)-1; /* Clear the work array */ for( i=0; i<max_key_val-min_key_val+1; i++ ) key_buff1[i] = 0; /* Determine the total number of keys on all other processors holding keys of lesser value */ m = 0; for( k=0; k<my_rank; k++ ) for( i= process_bucket_distrib_ptr1[k]; i<=process_bucket_distrib_ptr2[k]; i++ ) m += bucket_size_totals[i]; /* m has total # of lesser keys */ /* Determine total number of keys on this processor */ j = 0; for( i= process_bucket_distrib_ptr1[my_rank]; i<=process_bucket_distrib_ptr2[my_rank]; i++ ) j += bucket_size_totals[i]; /* j has total # of local keys */ /* Ranking of all keys occurs in this section: */ /* shift it backwards so no subtractions are necessary in loop */ key_buff_ptr = key_buff1 - min_key_val; /* In this section, the keys themselves are used as their own indexes to determine how many of each there are: their individual population */ for( i=0; i<j; i++ ) key_buff_ptr[key_buff2[i]]++; /* Now they have individual key */ /* population */ /* To obtain ranks of each key, successively add the individual key population, not forgetting the total of lesser keys, m. void rank( int iteration ) { INT_TYPE i, k; INT_TYPE shift = 'D': if( i < 2 ) { if( key_rank != test_rank_array[i]+iteration ) failed = 1; else passed_verification++; } else { if( key_rank != test_rank_array[i]-iteration ) failed = 1; else passed_verification++; } break; } if( failed == 1 ) printf( "Failed partial verification: " "iteration %d, processor %d, test key %d\n", iteration, my_rank, (int)i ); } } TIMER_STOP( T_RANK ); /* Make copies of rank info for use by full_verify: these variables in rank are local; making them global slows down the code, probably since they cannot be made register by compiler */ if( iteration == MAX_ITERATIONS ) { key_buff_ptr_global = key_buff_ptr; total_local_keys = j; total_lesser_keys = 0; /* no longer set to 'm', see note above */ } } • Source-to-source transformation • Def-use analysis of variables leading to MPI calls • Future work will explore use of: • System Dependence Graph (SDG) • Data flow framework and defined concepts of dead-code elimination. • Can be supplemented with dynamic information • Can be applied to abstract other things than MPI use Generated Skeleton Code: rank(int iteration) void rank(int iteration) { INT_TYPE i; INT_TYPE k; INT_TYPE shift = (23 - 10); INT_TYPE key; INT_TYPE2 bucket_sum_accumulator; INT_TYPE2 j; INT_TYPE2 m; INT_TYPE local_bucket_sum_accumulator; INT_TYPE min_key_val; INT_TYPE max_key_val; INT_TYPE *key_buff_ptr; /* Get the bucket size totals for the entire problem. These will be used to determine the redistribution of keys */ MPI_Allreduce(bucket_size,bucket_size_totals,((1 << 10) + 5),MPI_INT,MPI_SUM,MPI_COMM_WORLD); /* This is the redistribution section: first find out how many keys each processor will send to every other processor: */ MPI_Alltoall(send_count,1,MPI_INT,recv_count,1,MPI_INT,MPI_COMM_WORLD); /* Now send the keys to respective processors */ MPI_Alltoall(key_buff1,send_count,send_displ,MPI_INT,key_buff2,recv_count,recv_displ,MPI_INT,MPI_COMM_WORLD); }
Static Analysis Drives Skeleton Generation • First prototype: • Generate skeleton representing message passing via static analysis (using the use-def analysis in ROSE) • Basic concept, where MPI is the target aspect: • Identify message passing (MPI) operations. • Preserve MPI operations and code that they depend on, removing superfluous code. • Aim to remove large blocks of computational code, replacing it with surrogate code that is simpler to produce skeleton of app that contains essential message passing structure without the actual work. • Our research approach has been to explore four different forms of analysis to drive the skeleton generation: • Use-def analysis (to generate a form of program slice), works on the AST directly, not directly using the inter-procedural control flow graph (CFG) • Program slicing using ROSE’s System Dependence graph (SDG) which captures the def-use analysis and more on the inter-procedural control flow graph in ROSE • A new Data-Flow Framework in ROSE; another form of analysis using the interprocedural control flow graph in ROSE • Connections to Formal methods
Static Analysis: Program Slicing intreturnMe (int me) { return me; }int main (intargc, char ** argv) {int a = 1;int b;returnMe(a); b = returnMe(a); #pragma SliceTarget return b; } • System (Inter-procedural) Dependence Analysis • A sequence of directed edges define a slice • Can be used for Model extraction
Data Flow as an alternative approach to Drive Skeleton Generation • Future work will explore the use of a new Data Flow Framework in ROSE to support analysis required to generate skeletons • May be an easier way (for users) to specify aspects • It is related to slicing in that it uses the same inter-procedural control flow graph internally • Each form of analysis (Use-def, SDG, and Data-Flow) are an orthogonal direction of work which share the common infrastructure we have built for skeleton generation. • The analysis and infrastructure in implemented using ROSE
A Generic API for Skeletonization • Generalized skeletonization target APIs • Original work focused on skeletonizing relative to the MPI API. • Current code extended to allow skeletons against any API (e.g., Visualization and Data Analysis, I/O and Storage, use of domain-specific abstractions, etc.) • Important for building skeletons to probe different aspects of program behavior – IO, message passing, threading, app-specific libraries
Annotation guided skeletonization • Annotation guided skeletonization • Previous work focused on purely dependency-based slicing. This led to problems: • Removal of computational code could cause loops to cease to converge (iterate forever). • Branching patterns no longer meaningful with computational code gone. • Annotations let the userguide skeletonizationto add semantics the skeleton that is impossible/difficult to statically infer. • Loop iteration counts ; branching probabilities ; variable initialization values.
Use of an Annotation Before/After After Before intmain() { int x = 0; inti; // execute exactly 10 times #pragma skelloopIterate 10 for (i = 0; x < 100 ; i++) { if (x % 2) x += 5; } return x; } intmain() { int x = 0; inti; // execute exactly 10 times #pragma skelloopIterate 10 int k = 0; for (i = 0; k < 10; k++) {{ if ((x % 2) != 0) x += 5; } rose_label__1: i++; } return x; }
User Work Flow for Skeletonization Unsatisfactory behavior: modify or add annotations to tune skeleton generator Original Application Program Observe Behavior Of Skeleton Annotated Application Program Skeleton Program Dynamic Measurements Of Program Skeleton Extraction Tool Satisfactory Behavior Keep Skeleton - Branch probabilities - Average loop iteration counts - Legitimate data values Science & Technology: Computation Directorate
Future work • SDG version of analysis for skeletonization • Using the new Data Flow framework in ROSE for skeletonization • Galois will be working on adding formal-methods-based analysis to the skeleton generator to analyze regions of code to remove. • Floating point range analysis. • Symbolic execution. • Formal methods will aim to answer questions to aid skeleton generation such as: • What range of values do we expect a complex computation to produce? • Allows us to automatically select surrogate values for populating data structures • Know when specific values are critical • Under specific input conditions, what code is reachable or not reachable? • Allows us to build skeletons for specific input circumstances, instead of generic skeletons • This is a connection to path feasibility analysis currently being developed in ROSE
ROSE Compiler Design General Purpose Languages used within DOE Front-End C & C++ Fortran (F77-F2003) CUDA UPC 1.1 OpenMP 3.0 Python AST Builder API High Level Analysis & Optimization Framework IR Extension API (ROSETTA) Mid-End High Level IRs (AST) Low Level Analysis & Optimization Low Level IR (LLVM) Back-End Unparser Existing LLVM Analysis & Optimization LLVM Backend Code Generation Exascale Vendor Compilers Exascale Vendor Compiler Infrastructures Exascale Architecture