Paraguin Compiler

Paraguin Compiler Examples

Examples • Matrix Addition (the complete program) • Traveling Salesman Problem (TSP) • Sobel Edge Detection

Matrix Addition The complete program

Matrix Addition (complete) #define N 512 #ifdef PARAGUIN typedef void* __builtin_va_list; #endif #include <stdio.h> #include <math.h> #include <sys/time.h> print_results(char *prompt, float a[N][N]); int main(int argc, char *argv[]) { int i, j, error = 0; float a[N][N], b[N][N], c[N][N]; char *usage = "Usage: %s file\n"; FILE *fd; double elapsed_time; struct timeval tv1, tv2;

Matrix Addition (complete) if (argc < 2) { fprintf (stderr, usage, argv[0]); error = -1; } if ((fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading.\n", argv[0], argv[1]); fprintf (stderr, usage, argv[0]); error = -1; } #pragma paraguin begin_parallel #pragma paraguin bcast error if (error) return -1; #pragma paraguin end_parallel

Matrix Addition (complete) // Read input from file for matrices a and b. // The I/O is not timed because this I/O needs // to be done regardless of whether this program // is run sequentially on one processor or in // parallel on many processors. Therefore, it is // irrelevant when considering speedup. for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%f", &a[i][j]); for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%f", &b[i][j]); fclose (fd);

Matrix Addition (complete) ; #pragma paraguin begin_parallel // This barrier is here so that we can take a time stamp // Once we know all processes are ready to go. #pragma paraguin barrier #pragma paraguin end_parallel // Take a time stamp gettimeofday(&tv1, NULL); #pragma paraguin begin_parallel #pragma paraguin scatter a b // Parallelize the following loop nest assigning iterations // of the outermost loop (i) to different partitions. #pragma paraguin forall for (i = 0; i < N; i++) for (j = 0; j < N; j++) c[i][j] = a[i][j] + b[i][j];

Matrix Addition (complete) ; #pragma paraguin gather c #pragma paraguin end_parallel // Take a time stamp. This won't happen until after the master // process has gathered all the input from the other processes. gettimeofday(&tv2, NULL); elapsed_time = (tv2.tv_sec - tv1.tv_sec) + ((tv2.tv_usec - tv1.tv_usec) / 1000000.0); printf ("elapsed_time=\t%lf (seconds)\n", elapsed_time); // print result print_results("C = ", c); }

Matrix Addition (complete) print_results(char *prompt, float a[N][N]) { int i, j; printf ("\n\n%s\n", prompt); for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { printf(" %.2f", a[i][j]); } printf ("\n"); } printf ("\n\n"); }

Matrix Addition • After compiling with the command: • scc –DPARAGUIN –D__x86_64__ matrixadd.c –cc mpicc \ –o matrixadd.out • This produces: • matrixadd.out • After compiling with the command: • scc –DPARAGUIN –D__x86_64__ matrixadd.c -.out.c • This produces: • matrixadd.out .c (MPI source code) All on one line

Traveling Salesman Problem (TSP)

The Traveling Salesman Problem is simply to find the shortest circuit (Hamiltonian circuit) that visits every city in a set of cities at most once

This problem falls into the class of “NP-hard” problems • What that means is that there is no known “polynomial” time (“big-oh” of a polynomial) algorithm that can solve it • The only know algorithm to solve it is to compare the distances of all possible Hamiltonian circuits. • But there are N! possible circuits of N cities.

Yes, heuristics can be applied to find a “good” solution fast, but there’s no guarantee it is the best • The “brute force” algorithm is to consider all possible permutations of the N cities • First we’ll fix the first city since there are N equivalent circuits where we rotate the cities • We will consider the reverse directions to be different circuits but that’s hard to account for

If we number the cities from 0 to N-1, and 0 is the origination city, then the possible permutations of 4 cities are: • 0->1->2->3->0 • 0->1->3->2->0 • 0->2->3->1->0 • 0->2->1->3->0 • 0->3->1->2->0 • 0->3->2->1->0 Notice that there are some permutations that are the reverse of other. These are equivalent permutations. Since we are fixing origination city, there are (N-1)! permutations instead of N!.

We can compute the distances between all pairs of locations (O(N2)) • This is the input

Problem: Iterating through the possible permutations is recursive, but we need a straight forward for loop to parallelize • Solution: Use a for loop to assign the first two cities • Since city 0 is fixed, there are n-1 choices for city 1 and n-2 choices for city 2 • That means there are (n-1)(n-2) = n2 – 3n + 2 combinations of the first two cities

Assignment of cities 0-2 N = n*n - 3*n + 2; // (n-1)(n-2) perm[0] = 0; for (i = 0; i < N; i++) { perm[1] = i / (n-2) + 1; perm[2] = i % (n-2) + 1; ...

// This structure is used for the "minloc" reduction. We want to find the // minimum distance as well as which processor found it so that we can // get the final minimum circuit. struct { float minDist; int rank; } myAnswer, resultAnswer; int main(intargc, char *argv[]) { inti, j, k, N, p; int perm[MAX_NUM_CITIES], minPerm[MAX_NUM_CITIES+1]; float D[MAX_NUM_CITIES][MAX_NUM_CITIES]; float dist, minDist, finalMinDist; int abort; To use minloc, we need a value to minimize and the location (rank)

abort = processArgs(argc, argv); if (!abort) { for (i = 0; i < n; i++) { D[i][i] = 0.0f; for (j = 0; j < i; j++) { fscanf (fd, "%f", &D[i][j]); D[j][i] = D[i][j]; } } } else { if (n <= 1) printf ("0 0 0\n"); else n = 0; } #pragmaparaguinbegin_parallel #pragmaparaguinbcast abort if (abort) return -1; Read in the lower triangular matrix of relative distances between cities. The values are mirrored across the diagonal.

#pragmaparaguinbcast n D perm[0] = 0; minDist = 9.0e10; // Near the largest value we can represent with a float if (n == 2) { perm[1] = 1; // If n = 2, the N = 0, and we are done. minPerm[0] = perm[0]; minPerm[1] = perm[1]; minDist = computeDist(D, n, perm); } N = n*n - 3*n + 2; // N = (n-1)(n-2)

#pragmaparaguinforall for (p = 0; p < N; p++) { perm[1] = p / (n-2) + 1; perm[2] = p % (n-2) + 1; if (perm[2] >= perm[1]) perm[2]++; initialize(perm, n, 3); do { dist = computeDist(D, n, perm); if (minDist > dist) { minDist = dist; for (i = 0; i < n; i++) minPerm[i] = perm[i]; } } while (increment(perm,n)); } After cities 0, 1, and 2 have been determined, initialize the rest of the cities for the first permutation. Keep the shortest circuit seen thus far. Move to the next permutation

myAnswer.minDist = minDist; myAnswer.rank = __guin_rank; #pragmaparaguin reduce minlocmyAnswerresultAnswer #pragmaparaguinbcastresultAnswer if (__guin_rank == resultAnswer.rank) { printf (" %f ", minDist); for (i = 0; i < n; i++) printf ("%d ", minPerm[i]); printf ("%d\n", minPerm[0]); } #pragmaparaguinend_parallel struct { float minDist; int rank; } myAnswer, resultAnswer; If the current processors is the one who found the solution, then report the solution.

Demonstration

Sobel Edge Detection

Sobel Edge Detection • Given an image, the problem is to detect where the “edges” are in the picture

Sobel Edge Detection

Sobel Edge Detection Algorithm /* 3x3 Sobel masks. */ GX[0][0] = -1; GX[0][1] = 0; GX[0][2] = 1; GX[1][0] = -2; GX[1][1] = 0; GX[1][2] = 2; GX[2][0] = -1; GX[2][1] = 0; GX[2][2] = 1; GY[0][0] = 1; GY[0][1] = 2; GY[0][2] = 1; GY[1][0] = 0; GY[1][1] = 0; GY[1][2] = 0; GY[2][0] = -1; GY[2][1] = -2; GY[2][2] = -1; for(x=0; x < N; ++x){ for(y=0; y < N; ++y){ sumx = 0; sumy = 0; // handle image boundaries if(x==0 || x==(h-1) || y==0 || y==(w-1)) sum = 0; else{

Sobel Edge Detection Algorithm //x gradient approx for(i=-1; i<=1; i++) for(j=-1; j<=1; j++) sumx += (grayImage[x+i][y+j] * GX[i+1][j+1]); //y gradient approx for(i=-1; i<=1; i++) for(j=-1; j<=1; j++) sumy += (grayImage[x+i][y+j] * GY[i+1][j+1]); //gradient magnitude approx sum = (abs(sumx) + abs(sumy)); } edgeImage[x][y] = clamp(sum); } } There are no loop-carried dependencies. Therefore, this is a Scatter/Gather pattern.

Loop Carried Dependency • A Loop-Carried Dependency is when a value is computed in one iteration and used in another for (i = 1; i < n; i++) { A[i] = f(A[i-1]); <1>: a[1] = f(a[0]); <2>: a[2] = f(a[1]); <3>: a[3] = f(a[2]); <4>: a[4] = f(a[3]); ... • If we run this loop as a forall, then inter-processors communication is needed

Sobel Edge Detection Algorithm • Inputs (that need to be broadcast or scattered): • GX and GY arrays • grayImage array • w and h (width and height) • There are 4 nested loops (x, y, i, and j) • The final answer is the array • edgeImage

Sobel Edge Detection Algorithm #pragma paraguin begin_parallel /* 3x3 Sobel masks. */ GX[0][0] = -1; GX[0][1] = 0; GX[0][2] = 1; GX[1][0] = -2; GX[1][1] = 0; GX[1][2] = 2; GX[2][0] = -1; GX[2][1] = 0; GX[2][2] = 1; GY[0][0] = 1; GY[0][1] = 2; GY[0][2] = 1; GY[1][0] = 0; GY[1][1] = 0; GY[1][2] = 0; GY[2][0] = -1; GY[2][1] = -2; GY[2][2] = -1; #pragma paraguin bcast grayImage w h #pragma paraguin forall 1 for(x=0; x < N; ++x){ for(y=0; y < N; ++y){ sumx = 0; sumy = 0; ... These are the inputs Partition the x loop (outermost loop) Using cyclic scheduling

Sobel Edge Detection Algorithm ... edgeImage[x][y] = clamp(sum); } } ; #pragma paraguin gather edgeImage Gather all elements of the edgeImage array

Questions

Paraguin Compiler

Paraguin Compiler

Presentation Transcript

One pass compiler Compiler Design

yacc Yet Another Compiler Compiler

Yacc Yet Another Compiler Compiler

YACC Yet Another Compiler-Compiler

Compiler

COMPILER

Compiler

Compiler

COMPILER

Compiler

Compiler

Compiler

Compiler

Paraguin Compiler

Using compiler-directed approach to create MPI code automatically Paraguin Compiler Continued