1 / 41

Big Data Technologies Lecture 3: Algorithm Parallelization

Big Data Technologies Lecture 3: Algorithm Parallelization. Assoc. Prof. Marc FRÎNCU , PhD. Habil . marc.frincu@e-uvt.ro. Conceptually. M aster-slave model M aster process Starts a number of client processes ( slave ) on other cores/CPUs/machines Communicates with them

elom
Download Presentation

Big Data Technologies Lecture 3: Algorithm Parallelization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data TechnologiesLecture 3:Algorithm Parallelization Assoc. Prof. Marc FRÎNCU, PhD. Habil. marc.frincu@e-uvt.ro

  2. Conceptually • Master-slavemodel • Master process • Starts a number ofclient processes (slave) on other cores/CPUs/machines • Communicates with them • Sends data for processing and receives answers • Ensures all data are received and continues execution

  3. Can we parallelize the algorithm? • Code • Understanding of sequential code (if exists) • Identifying criticalpoints • Where are the computational heavy code lines? • Profiling • Parallelizecode only where intense computation is found (legea lui Amdahl) • Where are the bottlenecks? • Are there areas of slow code? • I/O • Use parallel optimized code • IBM ESSL, Intel MKL, AMD AMCL, etc. • If many parallel versions of the same algorithm exist ALL must be studied! • Data • Any data dependencies? • Can they be removed? • Can data be partitioned?

  4. exemples • Potential energy of each molecule. Find the minimum energy configuration • Each computation can be done in parallel • Search for minimum energy can also be parallelized • Parallel search • Fibonaci • F(n) depends on F(n-1) and F(n-2) • Cannot be parallelized

  5. Automatic vs. manual parallelization • The process of parallelizing code is complex, iterative and error prone, requiring time until an efficient solution is found • Some compilerscan parallelize code (pre-proccessing) • Automated • Compileridentifiesparallel code sections as well as bottleneck areas • Offers an analysis of the benefits of parallelization • The target are the iterative statements (do, for) • Programmer oriented • Use compilation directives and execution flags • The programmer is in charge of finding the parallel sections • They are mostly used on shared memory devices • OpenMP • BUT • Performance can degrade • Less flexible than manual parallelization • Limited to certain code sections (loops) • Not all sequential code is parallelizable as is

  6. Designing parallel algorithms • Partitioning • Communication • Synchronization • Data dependencies • Load balancing • Granularity • I/O • Debugging • Performance analysis and optimization

  7. Problem and data partitioning • The first step in any parallel problem is to partitionit in order to be handled by multiple parallel processes • Data partitioning (domain) • Algorithm partitioning

  8. Domain partitioning

  9. Functional partitioning

  10. Interprocess communication • Do we need communication? • Embarrassingly parallelproblems • Exemple: image negative • Problems where we need information from neighbors • Exemple: 2D heat dissipation

  11. Interprocess communication • Communication overhead • Resources are used to store and send data • Communication requires synchronization • A process will wait for the other to send data • Network limitation • Latency vs. bandwidth • Latency: time needed to send information from point A to point B • Microseconds • Bandwidth: quantity of information sent per time unit • MB/s, GB/s • Many small messages make latency dominant  pack all together as a single message • Communication visibility • In MPI messages are visible and under the programmer’s control • The Data Parallel model hides communication details (MapReduce)

  12. Interprocess communication • Synchronous vs. asynchronous • Synchronization blocksthe codeas one task must wait for the other to end and send its data • Asinchronousexecution assumes that a task executes independently from others (non-blocking communication)

  13. Interprocess communication • Communication objective • Knowing how communication takes place is vital in parallel algorithm design • Punctual:task to task • Collective: task to many tasks

  14. Interprocess communication • Efficiency of communication • An MPI implemention may give different results based on the used hardware architecture • Asynchronous communication may improve the execution of the parallel algorithm • Type and property of the network • Overhead and complexity

  15. Sincronization • Barrier • Each tasks executes until it reaches a barrier • Synchronization takes place when all tasks reach the barrier • The Bulk Synchronous Parallel model • Semaphore • Used to serialize access to data • Only one task access the data at one time • When multiple tasks request access simultaenously then access is granted randomly or based on priorities • Java threads, MPI • Synchronous operations • Only the tasks involved in active communication • Before sending data a task must receive the OK from the other • MPI

  16. Data dependencies There is a dependency between two tasks if the order of their execution affects the outcome of the program. • Dependencies are the main inhibitors of parallelism • Occur when a certain data address is accessed by multiple tasks • How do we solve them? • Sincronization • Exclusive access to the shared memory Value of Y depends on X (which X?) Execution of A(j) depends of A(j-1)

  17. Load balancing • Uniform distribution of workfload per task • All tasks perform some work, none is idle • Optimize usage as well as number of tasks • How do we achieve it? • Partitioningthe workload (date) • Matrix operations • Distribuția datelor uniform pe mașini • Loops • Distribute cycles uniformly on machines • For heterogeneous machines • Profile code to determine unbalanced code • Dynamic allocation of the workload • When a task finishes one task it receives the next one

  18. Granularity • Computation/communication ratio • Periods of computation separated by periods of communication • Fine grained parallelism • Less computation between communication • Low computation/communication ratio • For load balancing • High overhead with few opportunities for optimization • Communication/synchronization can take longer than computation • Coarse grained parallelism • Large computation/communication ratio • Opportunity to optimize • Hard to load balance • How do we chose? • Depends on algorithm and hardware • Better to have coarse grained parallelism • Communication/synchronization overhead is usually to high compared to computation cost • Fine grained parallelism can help in load balancing

  19. I/O • Parallelism inhibitor • Requires lots of time • Parallel I/O systems are not widely available • HDFS (Hadoop Distributed File System) • Lustre (servere Linux) • IBM Spectrum Scale • ... • In a shared environment I/O can lead to file overwritings • Read operations can be affected by the capability of the server to handle multiple requests • I/O over the network (NSF) can lead to congestion and file server blackouts

  20. Debugging • Can be costly if code complexity is high • Many applications • TotalView • DDT • Inspector (Intel)

  21. Performance analysis and optimization • Much more complex than for the sequential code • Valgrind (http://valgrind.org/) • Vampir (http://vampir.eu/) • Mpitrace (https://computing.llnl.gov/tutorials/bgq/index.html#mpitrace)

  22. Practical example • Compute PI using Monte Carlo • Idea • Use random numbers to cover the surface • (x,y)  random between -1 and 1 • Circle radius = 1 • No. of total points must be large enought

  23. Practical example • Problem analysis • Any data dependencies? • Can we generate points in parallel? • Communication? • Do we need communication to generate random numbers? • Can we compute the percentage of points generated inside the circle on each processor without communication? • Load balancing? • Can we generate the same amount of points on each processor? • Strategy • Divideet impera • G points per processor (N – total no. of points, P – no. of processors) • On each processor, check if generated point is inside the circle • SMDP model • Master-slave • A parent process (master) will gather results from child processes (slaves)

  24. Pseudocode npoints = 10000 circle_count= 0 p = number of tasks num= npoints/p // find out if I am MASTER or WORKER doj = 1,num generate 2 random numbers between 0 and 1 xcoordinate= random1 ycoordinate= random2 if(xcoordinate, ycoordinate) inside circle then circle_count= circle_count + 1 end do ifI am MASTERthen receive from WORKERS their circle_counts compute PI (use MASTER and WORKER calculations) else if I am WORKER then send to MASTER circle_count endif

  25. APIs for parallel shared and distributed memory algorithms • Shared memory • OpenMP • Distributed memory • Unified Parallel C • MPI • GPUs • CUDA • Distributed computing • MapReduce • Data flows: Storm, Spark • Graphs: Giraph, GraphX

  26. Openmp • Shared memory model • Requires minimal modifications of the sequential code • Specification is implemented in the compiler • g++ program.cpp -fopenmp -o program • fork-join model • When the compiler reaches a parallel construct in OpenMP, it creates a series of threads which execute in parallel (fork) • The threads merge at the end of their execution (join)

  27. Openmp • How is the parallel code implemented? • A set of functions • Start with omp_ • omp_get_thread_num(): returns the no of the current thread • omp_get_num_threads(): returns the total number of available threads • Global variables • Start OMP_ • OMP_NUM_THREADS: sets the number of threads • $ export OMP_NUM_THREADS=4 • Pragma directives • Start with #pragma omp • #pragma omp parallel { // structured block }

  28. PI in Openmp Sums up the values in count Sets variables private to the thread Sets the number of threads #include <omp.h> ... #pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads) { // give random() a seed value srand48((int)time(NULL) ^ omp_get_thread_num()); for(i=0; i<niter; ++i) //main loop { x = (double)drand48();// gets a random x coordinate y = (double)drand48();// gets a random y coordinate z = ((x*x)+(y*y)); // checksto see if number is inside unit circle if(z<=1) { ++count;// if it is, consider it a valid random point } } } pi = ((double)count/(double)(niter*numthreads))*4.0; printf("Pi: %f\n", pi); Seed initialization Nu putemfolosifuncțiile de bibliotecă! NU SUNT THREAD SAFE! x = custom random fct y = custom random fct See https://www.bnl.gov/bnlhpc2013/files/pdf/OpenMPTutorial.pdf For a correct implementation!

  29. MPI • Message Passing Interface • API for running message based parallel application • Distributed memory • Many implementations • MPICH2 • Function names start withMPI_ • MPI_Init(); • MPI_Finalize(); • MPI_Comm_size(MPI_COMM_WORLD, &world_size); • MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); • MPI_Get_processor_name(processor_name, &name_len); • MPI_Send(&offset, 1, MPI_INT, dest, BEGIN, MPI_COMM_WORLD); • MPI_Recv(&offset, 1, MPI_INT, source, msgtype, MPI_COMM_WORLD, &status);

  30. PI in mpi MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numtasks); MPI_Comm_rank(MPI_COMM_WORLD,&taskid); printf ("MPI task %d has started...\n", taskid); // set seed for random number generator equal to task ID srandom (taskid); avepi = 0; for (i = 0; i < ROUNDS; i++) { // call the function which generates random numbers and counts how many are in the circle // see the OpenMP code for details // this method is called in MASTER and all SLAVES count= dboard(DARTS); rc= MPI_Reduce(&count, &pisum, 1, MPI_DOUBLE, MPI_SUM, MASTER, MPI_COMM_WORLD); } if (taskid == MASTER) pi = (pisum/(count * numtasks)) * 4.0; printf("\nReal value of PI: 3.1415926535897 \n"); MPI_Finalize();

  31. Uniform parallel C (UPC) • Extension of C for parallel computing • Distributed memory • Shared memory • SPMD model • Own compiler • upcc-o program program.upc • upcrun -n 4 program • Variables • shared int x • Own constructions • upc_forall • Constants • THREADS • MYTHREAD • Function names start withupc_ • Custom versions of existing standard C functions • upc_memcpy, ...

  32. PI in upc shared intcount [THREADS]; upc_forall (j=0; j<THREADS; ++j;j)// main loop { for (i=0; i<niter; i++) { x = (double)drand48();// gets a random x coordinate y = (double)drand48();// gets a random y coordinate z = ((x*x)+(y*y)); // checks to see if number is inside unit circle if (z<=1) { ++count[MYTHREAD];// if it is, consider it a valid random point } } } upc_barrier();// ensure all is done if (MYTHREAD == 0) { for (j=0; j<THREADS; ++j;j) countHit += count[j]; pi = ((double)countHit/(double)(niter*THREADS))*4.0; printf("Pi: %f\n", pi); }

  33. CUDA • NVIDIA • API and platform for GPUs • Direct access to accelerator instructions • Each core runs a kernel function(thread) • Threadsare grouped in blocks • Can communicate through shared memory, synchronization primitives, and barriers • Parallel or sequential execution • Run on the same thread • ThreadID: • Is computed based on data • Blocks form a grid • Communication between blocks is not possible

  34. CUDA architecture

  35. PI in cuda

  36. MapReduce • Terms barrowed from functional programming (eg.,Lisp) • Integrated with Hadoop (map square ‘(1 2 3 4)) • Output: (1 4 9 16) [process each element independently] (reduce + ‘(1 4 9 16)) • (+ 16 (+ 9 (+ 4 1) ) ) • Output: 30 [processes all dataset together] Divide et impera • Divides data in chunks for parallel execution • 64 MB per block • Aggregates the result • Programmer only implements the map and reduce functions • Platform takes care of communication

  37. MapReduce model Reduce(k’, v’[]) -> (k’’,v’’) Map(k, v) ->(k’,v’) Input Map Reduce Output Shuffle – MR System Funcții definite utilizator void reduce(String key, Iterator values) { //for each key, iterate through all values //aggregate results //emit final result } void map(String key, String value) { //do work //emit(key, value) pairs to reducers }

  38. PI in MapReduce Point inside circle • void map (LongWritable size, Context context) • { • int count = 0;- • for(long i = 0; i < size.get(); i++) • { • //generate random points in unit square • final double x = …; • final double y = …; • if (x*x + y*y <= 1) • { • ++count; • } • } • //output map results • context.write(newBooleanWritable(true), newLongWritable(count)); • context.write(newBooleanWritable(false), newLongWritable(size.get()-count)); • } • void reduce (BooleanWritableisInside, Iterable<LongWritable> values, Context context) • { • if(isInside.get()) • { • for(LongWritable val : values) • { • numInside +=val.get(); • } • } • else • { • for(LongWritable val : values) • { • numOutside +=val.get(); • } • } • } • // reduce done. Store results in HDFS • void cleanup(Context context) throws IOException • { • … • writer.append(new LongWritable(numInside), new LongWritable(numOutside)); • writer.close(); • }

  39. PI in mapreduce public static BigDecimalestimatePi(intnumMaps, longnumPoints, Path tmpDir, Configuration conf) throws IOException, ClassNotFoundException, InterruptedException{ … Job job = new Job(conf); job.setInputFormatClass(SequenceFileInputFormat.class); job.setOutputKeyClass(BooleanWritable.class); job.setOutputValueClass(LongWritable.class); job.setOutputFormatClass(SequenceFileOutputFormat.class); job.setMapperClass(QmcMapper.class); job.setReducerClass(QmcReducer.class); job.setNumReduceTasks(1); // generate an input file for each map task … // start MapReduce job job.waitForCompletion(true); // read outputs … reader.next(numInside, numOutside); // evaluate Pi pi = ((double)numInside/(double)(numMaps* numPoints))*4.0; }

  40. Lecture sources • https://computing.llnl.gov/tutorials/parallel_comp/#Designing • http://jcsites.juniata.edu/faculty/rhodes/smui/parex.htm • http://mpitutorial.com/tutorials/mpi-hello-world/ • http://upc.gwu.edu/tutorials/UPC-SC05.pdf • https://github.com/facebookarchive/hadoop-20/blob/master/src/examples/org/apache/hadoop/examples/PiEstimator.java

  41. Next lecture • Algorithm scalability • Hardware + data + algorithm

More Related