410 likes | 572 Views
Big Data Technologies Lecture 3: Algorithm Parallelization. Assoc. Prof. Marc FRÎNCU , PhD. Habil . marc.frincu@e-uvt.ro. Conceptually. M aster-slave model M aster process Starts a number of client processes ( slave ) on other cores/CPUs/machines Communicates with them
E N D
Big Data TechnologiesLecture 3:Algorithm Parallelization Assoc. Prof. Marc FRÎNCU, PhD. Habil. marc.frincu@e-uvt.ro
Conceptually • Master-slavemodel • Master process • Starts a number ofclient processes (slave) on other cores/CPUs/machines • Communicates with them • Sends data for processing and receives answers • Ensures all data are received and continues execution
Can we parallelize the algorithm? • Code • Understanding of sequential code (if exists) • Identifying criticalpoints • Where are the computational heavy code lines? • Profiling • Parallelizecode only where intense computation is found (legea lui Amdahl) • Where are the bottlenecks? • Are there areas of slow code? • I/O • Use parallel optimized code • IBM ESSL, Intel MKL, AMD AMCL, etc. • If many parallel versions of the same algorithm exist ALL must be studied! • Data • Any data dependencies? • Can they be removed? • Can data be partitioned?
exemples • Potential energy of each molecule. Find the minimum energy configuration • Each computation can be done in parallel • Search for minimum energy can also be parallelized • Parallel search • Fibonaci • F(n) depends on F(n-1) and F(n-2) • Cannot be parallelized
Automatic vs. manual parallelization • The process of parallelizing code is complex, iterative and error prone, requiring time until an efficient solution is found • Some compilerscan parallelize code (pre-proccessing) • Automated • Compileridentifiesparallel code sections as well as bottleneck areas • Offers an analysis of the benefits of parallelization • The target are the iterative statements (do, for) • Programmer oriented • Use compilation directives and execution flags • The programmer is in charge of finding the parallel sections • They are mostly used on shared memory devices • OpenMP • BUT • Performance can degrade • Less flexible than manual parallelization • Limited to certain code sections (loops) • Not all sequential code is parallelizable as is
Designing parallel algorithms • Partitioning • Communication • Synchronization • Data dependencies • Load balancing • Granularity • I/O • Debugging • Performance analysis and optimization
Problem and data partitioning • The first step in any parallel problem is to partitionit in order to be handled by multiple parallel processes • Data partitioning (domain) • Algorithm partitioning
Interprocess communication • Do we need communication? • Embarrassingly parallelproblems • Exemple: image negative • Problems where we need information from neighbors • Exemple: 2D heat dissipation
Interprocess communication • Communication overhead • Resources are used to store and send data • Communication requires synchronization • A process will wait for the other to send data • Network limitation • Latency vs. bandwidth • Latency: time needed to send information from point A to point B • Microseconds • Bandwidth: quantity of information sent per time unit • MB/s, GB/s • Many small messages make latency dominant pack all together as a single message • Communication visibility • In MPI messages are visible and under the programmer’s control • The Data Parallel model hides communication details (MapReduce)
Interprocess communication • Synchronous vs. asynchronous • Synchronization blocksthe codeas one task must wait for the other to end and send its data • Asinchronousexecution assumes that a task executes independently from others (non-blocking communication)
Interprocess communication • Communication objective • Knowing how communication takes place is vital in parallel algorithm design • Punctual:task to task • Collective: task to many tasks
Interprocess communication • Efficiency of communication • An MPI implemention may give different results based on the used hardware architecture • Asynchronous communication may improve the execution of the parallel algorithm • Type and property of the network • Overhead and complexity
Sincronization • Barrier • Each tasks executes until it reaches a barrier • Synchronization takes place when all tasks reach the barrier • The Bulk Synchronous Parallel model • Semaphore • Used to serialize access to data • Only one task access the data at one time • When multiple tasks request access simultaenously then access is granted randomly or based on priorities • Java threads, MPI • Synchronous operations • Only the tasks involved in active communication • Before sending data a task must receive the OK from the other • MPI
Data dependencies There is a dependency between two tasks if the order of their execution affects the outcome of the program. • Dependencies are the main inhibitors of parallelism • Occur when a certain data address is accessed by multiple tasks • How do we solve them? • Sincronization • Exclusive access to the shared memory Value of Y depends on X (which X?) Execution of A(j) depends of A(j-1)
Load balancing • Uniform distribution of workfload per task • All tasks perform some work, none is idle • Optimize usage as well as number of tasks • How do we achieve it? • Partitioningthe workload (date) • Matrix operations • Distribuția datelor uniform pe mașini • Loops • Distribute cycles uniformly on machines • For heterogeneous machines • Profile code to determine unbalanced code • Dynamic allocation of the workload • When a task finishes one task it receives the next one
Granularity • Computation/communication ratio • Periods of computation separated by periods of communication • Fine grained parallelism • Less computation between communication • Low computation/communication ratio • For load balancing • High overhead with few opportunities for optimization • Communication/synchronization can take longer than computation • Coarse grained parallelism • Large computation/communication ratio • Opportunity to optimize • Hard to load balance • How do we chose? • Depends on algorithm and hardware • Better to have coarse grained parallelism • Communication/synchronization overhead is usually to high compared to computation cost • Fine grained parallelism can help in load balancing
I/O • Parallelism inhibitor • Requires lots of time • Parallel I/O systems are not widely available • HDFS (Hadoop Distributed File System) • Lustre (servere Linux) • IBM Spectrum Scale • ... • In a shared environment I/O can lead to file overwritings • Read operations can be affected by the capability of the server to handle multiple requests • I/O over the network (NSF) can lead to congestion and file server blackouts
Debugging • Can be costly if code complexity is high • Many applications • TotalView • DDT • Inspector (Intel)
Performance analysis and optimization • Much more complex than for the sequential code • Valgrind (http://valgrind.org/) • Vampir (http://vampir.eu/) • Mpitrace (https://computing.llnl.gov/tutorials/bgq/index.html#mpitrace)
Practical example • Compute PI using Monte Carlo • Idea • Use random numbers to cover the surface • (x,y) random between -1 and 1 • Circle radius = 1 • No. of total points must be large enought
Practical example • Problem analysis • Any data dependencies? • Can we generate points in parallel? • Communication? • Do we need communication to generate random numbers? • Can we compute the percentage of points generated inside the circle on each processor without communication? • Load balancing? • Can we generate the same amount of points on each processor? • Strategy • Divideet impera • G points per processor (N – total no. of points, P – no. of processors) • On each processor, check if generated point is inside the circle • SMDP model • Master-slave • A parent process (master) will gather results from child processes (slaves)
Pseudocode npoints = 10000 circle_count= 0 p = number of tasks num= npoints/p // find out if I am MASTER or WORKER doj = 1,num generate 2 random numbers between 0 and 1 xcoordinate= random1 ycoordinate= random2 if(xcoordinate, ycoordinate) inside circle then circle_count= circle_count + 1 end do ifI am MASTERthen receive from WORKERS their circle_counts compute PI (use MASTER and WORKER calculations) else if I am WORKER then send to MASTER circle_count endif
APIs for parallel shared and distributed memory algorithms • Shared memory • OpenMP • Distributed memory • Unified Parallel C • MPI • GPUs • CUDA • Distributed computing • MapReduce • Data flows: Storm, Spark • Graphs: Giraph, GraphX
Openmp • Shared memory model • Requires minimal modifications of the sequential code • Specification is implemented in the compiler • g++ program.cpp -fopenmp -o program • fork-join model • When the compiler reaches a parallel construct in OpenMP, it creates a series of threads which execute in parallel (fork) • The threads merge at the end of their execution (join)
Openmp • How is the parallel code implemented? • A set of functions • Start with omp_ • omp_get_thread_num(): returns the no of the current thread • omp_get_num_threads(): returns the total number of available threads • Global variables • Start OMP_ • OMP_NUM_THREADS: sets the number of threads • $ export OMP_NUM_THREADS=4 • Pragma directives • Start with #pragma omp • #pragma omp parallel { // structured block }
PI in Openmp Sums up the values in count Sets variables private to the thread Sets the number of threads #include <omp.h> ... #pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads) { // give random() a seed value srand48((int)time(NULL) ^ omp_get_thread_num()); for(i=0; i<niter; ++i) //main loop { x = (double)drand48();// gets a random x coordinate y = (double)drand48();// gets a random y coordinate z = ((x*x)+(y*y)); // checksto see if number is inside unit circle if(z<=1) { ++count;// if it is, consider it a valid random point } } } pi = ((double)count/(double)(niter*numthreads))*4.0; printf("Pi: %f\n", pi); Seed initialization Nu putemfolosifuncțiile de bibliotecă! NU SUNT THREAD SAFE! x = custom random fct y = custom random fct See https://www.bnl.gov/bnlhpc2013/files/pdf/OpenMPTutorial.pdf For a correct implementation!
MPI • Message Passing Interface • API for running message based parallel application • Distributed memory • Many implementations • MPICH2 • Function names start withMPI_ • MPI_Init(); • MPI_Finalize(); • MPI_Comm_size(MPI_COMM_WORLD, &world_size); • MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); • MPI_Get_processor_name(processor_name, &name_len); • MPI_Send(&offset, 1, MPI_INT, dest, BEGIN, MPI_COMM_WORLD); • MPI_Recv(&offset, 1, MPI_INT, source, msgtype, MPI_COMM_WORLD, &status);
PI in mpi MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numtasks); MPI_Comm_rank(MPI_COMM_WORLD,&taskid); printf ("MPI task %d has started...\n", taskid); // set seed for random number generator equal to task ID srandom (taskid); avepi = 0; for (i = 0; i < ROUNDS; i++) { // call the function which generates random numbers and counts how many are in the circle // see the OpenMP code for details // this method is called in MASTER and all SLAVES count= dboard(DARTS); rc= MPI_Reduce(&count, &pisum, 1, MPI_DOUBLE, MPI_SUM, MASTER, MPI_COMM_WORLD); } if (taskid == MASTER) pi = (pisum/(count * numtasks)) * 4.0; printf("\nReal value of PI: 3.1415926535897 \n"); MPI_Finalize();
Uniform parallel C (UPC) • Extension of C for parallel computing • Distributed memory • Shared memory • SPMD model • Own compiler • upcc-o program program.upc • upcrun -n 4 program • Variables • shared int x • Own constructions • upc_forall • Constants • THREADS • MYTHREAD • Function names start withupc_ • Custom versions of existing standard C functions • upc_memcpy, ...
PI in upc shared intcount [THREADS]; upc_forall (j=0; j<THREADS; ++j;j)// main loop { for (i=0; i<niter; i++) { x = (double)drand48();// gets a random x coordinate y = (double)drand48();// gets a random y coordinate z = ((x*x)+(y*y)); // checks to see if number is inside unit circle if (z<=1) { ++count[MYTHREAD];// if it is, consider it a valid random point } } } upc_barrier();// ensure all is done if (MYTHREAD == 0) { for (j=0; j<THREADS; ++j;j) countHit += count[j]; pi = ((double)countHit/(double)(niter*THREADS))*4.0; printf("Pi: %f\n", pi); }
CUDA • NVIDIA • API and platform for GPUs • Direct access to accelerator instructions • Each core runs a kernel function(thread) • Threadsare grouped in blocks • Can communicate through shared memory, synchronization primitives, and barriers • Parallel or sequential execution • Run on the same thread • ThreadID: • Is computed based on data • Blocks form a grid • Communication between blocks is not possible
MapReduce • Terms barrowed from functional programming (eg.,Lisp) • Integrated with Hadoop (map square ‘(1 2 3 4)) • Output: (1 4 9 16) [process each element independently] (reduce + ‘(1 4 9 16)) • (+ 16 (+ 9 (+ 4 1) ) ) • Output: 30 [processes all dataset together] Divide et impera • Divides data in chunks for parallel execution • 64 MB per block • Aggregates the result • Programmer only implements the map and reduce functions • Platform takes care of communication
MapReduce model Reduce(k’, v’[]) -> (k’’,v’’) Map(k, v) ->(k’,v’) Input Map Reduce Output Shuffle – MR System Funcții definite utilizator void reduce(String key, Iterator values) { //for each key, iterate through all values //aggregate results //emit final result } void map(String key, String value) { //do work //emit(key, value) pairs to reducers }
PI in MapReduce Point inside circle • void map (LongWritable size, Context context) • { • int count = 0;- • for(long i = 0; i < size.get(); i++) • { • //generate random points in unit square • final double x = …; • final double y = …; • if (x*x + y*y <= 1) • { • ++count; • } • } • //output map results • context.write(newBooleanWritable(true), newLongWritable(count)); • context.write(newBooleanWritable(false), newLongWritable(size.get()-count)); • } • void reduce (BooleanWritableisInside, Iterable<LongWritable> values, Context context) • { • if(isInside.get()) • { • for(LongWritable val : values) • { • numInside +=val.get(); • } • } • else • { • for(LongWritable val : values) • { • numOutside +=val.get(); • } • } • } • // reduce done. Store results in HDFS • void cleanup(Context context) throws IOException • { • … • writer.append(new LongWritable(numInside), new LongWritable(numOutside)); • writer.close(); • }
PI in mapreduce public static BigDecimalestimatePi(intnumMaps, longnumPoints, Path tmpDir, Configuration conf) throws IOException, ClassNotFoundException, InterruptedException{ … Job job = new Job(conf); job.setInputFormatClass(SequenceFileInputFormat.class); job.setOutputKeyClass(BooleanWritable.class); job.setOutputValueClass(LongWritable.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setMapperClass(QmcMapper.class); job.setReducerClass(QmcReducer.class); job.setNumReduceTasks(1); // generate an input file for each map task … // start MapReduce job job.waitForCompletion(true); // read outputs … reader.next(numInside, numOutside); // evaluate Pi pi = ((double)numInside/(double)(numMaps* numPoints))*4.0; }
Lecture sources • https://computing.llnl.gov/tutorials/parallel_comp/#Designing • http://jcsites.juniata.edu/faculty/rhodes/smui/parex.htm • http://mpitutorial.com/tutorials/mpi-hello-world/ • http://upc.gwu.edu/tutorials/UPC-SC05.pdf • https://github.com/facebookarchive/hadoop-20/blob/master/src/examples/org/apache/hadoop/examples/PiEstimator.java
Next lecture • Algorithm scalability • Hardware + data + algorithm