1 / 70

Mathew Reno

Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters. Mathew Reno. Overview.

Download Presentation

Mathew Reno

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparing The Performance Of Distributed Shared Memory And Message Passing Programs Using The Hyperion Java Virtual Machine On Clusters Mathew Reno M.S. Thesis Defense

  2. Overview • For this thesis we wanted to evaluate the performance of the Hyperion distributed virtual machine, designed at UNH, when compared to a preexisting parallel computing API. • The results would indicate where Hyperion’s strength and weaknesses were and possibly validate Hyperion as a high-performance computing alternative. M.S. Thesis Defense

  3. What Is A Cluster? • A cluster is a group of low­cost computers connected with an “off­the­shelf” network. • The cluster’s network is isolated from WAN data traffic and the computers on the cluster are presented to the user as a single resource. M.S. Thesis Defense

  4. Why Use Clusters? • Clusters are cost effective when compared to traditional parallel systems. • Clusters can be grown as needed. • Software components are based on standards allowing portable software to be designed for the cluster. M.S. Thesis Defense

  5. Cluster Computing • Cluster computing takes advantage of the cluster by distributing computational workload among nodes of the cluster, thereby reducing total computation time. • There are many programming models for distributing data throughout the cluster. M.S. Thesis Defense

  6. Distributed Shared Memory • Distributed Shared Memory (DSM) allows the user to view the whole cluster as one resource. • Memory is shared among the nodes. Each node has access to all other nodes memory as if it owns it. • Data coordination among nodes is generally hidden from the user. M.S. Thesis Defense

  7. Message Passing • Message Passing (MP) requires explicit messages to be employed to distribute data throughout the cluster. • The programmer must coordinate all data exchanges when designing the application through a language level MP API. M.S. Thesis Defense

  8. Related: Treadmarks Vs. PVM • Treadmarks (Rice, 1995) implements a DSM model while PVM implements a MP model. The two approaches were compared with benchmarks. • On average, PVM was found to perform two times better the Treadmarks. • Treadmarks suffered from excessive messages that were required for the request­response communication DSM model employed. • Treadmarks was found to be more natural to program with saving development time. M.S. Thesis Defense

  9. Hyperion • Hyperion is a distributed Java Virtual Machine (JVM), designed at UNH. • The Java language provides parallelism through its threading model. Hyperion extends this model by distributing the threads among the cluster. • Hyperion implements the DSM model via DSM-PM2, which allows for lightweight thread creation and data distribution. M.S. Thesis Defense

  10. Hyperion, Continued • Hyperion has a fixed memory size that it shares with all threads executing across the cluster. • Hyperion uses page­based data distribution; if a thread accesses memory it does not have locally, a page­fault occurs and the memory is transmitted from the node that owns the memory to the requesting node a page at a time. M.S. Thesis Defense

  11. Hyperion, Continued • Hyperion translates Java bytecodes into native C code. • A native executable is generated by a native C compiler. • The belief is that native executables are optimized by the C compiler and will benefit the application by executing faster than interpreted code. M.S. Thesis Defense

  12. Hyperion’s Threads • Threads are created in a round robin fashion among the nodes of the cluster. • Data is transmitted between threads via a request/response mechanism. This approach requires two messages. • In order to respond to a request message, a response thread must be scheduled. This thread handles the request by sending back the requested data in a response message. M.S. Thesis Defense

  13. mpiJava • mpiJava is a Java wrapper for the Message Passing Interface (MPI). • The Java Native Interface (JNI) is used to translate between Java and native code. • We used MPICH for the native MPI implementation. M.S. Thesis Defense

  14. Clusters • The “Star” cluster (UNH) consists of 16 PIII 667MHz Linux PCs on a 100Mb Fast Ethernet network. TCP is communication protocol. • The “Paraski” cluster (France) consists of 16 PIII 1GHz Linux PCs on a 2Gb Myrinet network. BIP (DSM) and GM (MPI) are the communication protocols. M.S. Thesis Defense

  15. Clusters, Continued • The implementation of MPICH on BIP was not stable in time for this thesis. GM had to be used in place of BIP for MPICH. GM has not been ported to Hyperion and a port would be unreasonable at this time. • BIP performs better than GM as the message size increases. M.S. Thesis Defense

  16. BIP vs. GM Latency (Paraski) M.S. Thesis Defense

  17. DSM & MPI In Hyperion • For consistency, mpiJava was ported into Hyperion. • Both DSM and MPI versions of the benchmarks could be compiled by Hyperion. • The executables produced by Hyperion are then executed by the respective native launchers (PM2 and MPICH). M.S. Thesis Defense

  18. Benchmarks • The Java Grande Forum (JGF) developed a suite of benchmarks to test Java implementations. • We used two of the JGF benchmark suites, multithreaded and javaMPI. M.S. Thesis Defense

  19. Benchmarks, Continued • Benchmarks used: • Fourier coefficient analysis • Lower/upper matrix factorization • Successive over-relaxation • IDEA encryption • Sparse matrix multiplication • Molecular dynamics simulation • 3D Ray Tracer • Monte Carlo simulation (only with MPI) M.S. Thesis Defense

  20. Benchmarks And Hyperion • The multi-threaded JGF benchmarks had unacceptable performance when run “out of the box”. • Each benchmark creates all of its data objects on the root node causing all remote object access to occur through this one node. • This type of access causes a performance bottleneck on the root node as it has to service all the requests while calculating its algorithm part. • The solution was to modify the benchmarks to be cluster aware. M.S. Thesis Defense

  21. Hyperion Extensions • Hyperion makes up for Java’s limited thread data management by providing efficient reduce and broadcast mechanisms. • Hyperion also provides a cluster aware implementation of arraycopy. M.S. Thesis Defense

  22. Hyperion Extension: Reduce • Reduce blocks all enrolled threads until each thread has the final result of the reduce. • This is done by neighbor threads exchanging their data for computation, then their neighbors, and so on until each thread has the same answer. • This operation is faster and scales well as opposed to performing the calculation serially. The operation is LogP. M.S. Thesis Defense

  23. Hyperion Extension: Broadcast • The broadcast mechanism transmits the same data to all enrolled threads. • Like reduce, data is distributed to the threads in a LogP fashion, which scales better than serial distribution of data. M.S. Thesis Defense

  24. Hyperion Extension: Arraycopy • The arraycopy method is part of the Java System class. The Hyperion version was extended to be cluster aware. • If data is copied across threads, this version will send all data as one message instead of relying on paging mechanisms to access remote array data. M.S. Thesis Defense

  25. Benchmark Modifications • The multithreaded benchmarks had unacceptable performance. • The benchmarks were modified in order to reduce remote object access and root node bottlenecks. • Techniques, such as arraycopy, broadcast and reduce were employed to improve performance. M.S. Thesis Defense

  26. Experiment • Each benchmark was executed 50 times at each node size to provide a sample mean. • Node sizes were 1, 2, 4, 8, and 16. • Confidence intervals (95% level) were used to determine which version, MPI or DSM, performed better. M.S. Thesis Defense

  27. Results On The Star Cluster M.S. Thesis Defense

  28. Results On The Paraski Cluster M.S. Thesis Defense

  29. Fourier Coefficient Analysis • Calculates the first 10,000 pairs of Fourier coefficients. • Each node is responsible for calculating its portion of the coefficient array. • Each node sends back its array portion to the root node, which accumulates the final array. M.S. Thesis Defense

  30. Fourier: DSM Modifications • The original multithreaded version required all threads to update arrays located on the root node, causing the root node to be flooded with requests. • The modified version used arraycopy to distribute the local arrays back to the root thread arrays. M.S. Thesis Defense

  31. Fourier: mpiJava • The mpiJava version is similar to the DSM version. • Each process is responsible for its portion of the arrays. • MPI_Ssend and MPI_Recv were called to distribute the array portions to the root process. M.S. Thesis Defense

  32. Fourier: Results M.S. Thesis Defense

  33. Fourier: Conclusions • Most of the time in this benchmark is spent in the computation. • Network communication does not play a significant role in the overall time. • Both MPI and DSM perform similar on each cluster, scaling well when more nodes are added. M.S. Thesis Defense

  34. Lower/Upper Factorization • Solves a 500 x 500 linear system with LU factorization followed with a triangular solve. • The factorization is parallelized while the triangular solve is computed in serial. M.S. Thesis Defense

  35. LU: DSM Modifications • The original version created the matrix on the root thread and all access was through this thread, causing performance bottlenecks. • The benchmark was modified to use Hyperion’s Broadcast facility to distribute the pivot information and arraycopy was used to coordinate the final data for the solve. M.S. Thesis Defense

  36. LU: mpiJava • MPI_Bcast is used to distribute the pivot information. • MPI_Send and MPI_Recv are used so the root process can acquire the final matrix. M.S. Thesis Defense

  37. LU: Results M.S. Thesis Defense

  38. LU: Conclusions • While the DSM version uses a similar data distribution mechanism as the MPI version, there is significant overhead that is exposed when executing these methods in large loops. • This overhead is minimized on the Paraski cluster due to the nature Myrinet and BIP. M.S. Thesis Defense

  39. Successive Over-Relaxation • Performs 100 iterations of SOR on a 1000 x 1000 grid. • A “red-black” ordering mechanism allows array rows to be distributed to nodes in blocks. • After initial data distribution, only neighbor rows need be communicated during the SOR. M.S. Thesis Defense

  40. SOR: DSM Modifications • Excessive remote thread object access made it necessary to modify the benchmark. • Modified version uses arraycopy to update neighbor rows during the SOR. • When the SOR completes, arraycopy is used to assemble the final matrix on the root thread. M.S. Thesis Defense

  41. SOR: mpiJava • MPI_Sendrecv is used to exchange neighbor rows. • MPI_Ssend and MPI_Recv are used to build the final matrix on the root process. M.S. Thesis Defense

  42. SOR: Results M.S. Thesis Defense

  43. SOR: Conclusions • The DSM version requires an extra barrier after row neighbors are exchanged due to the “network reactivity” problem. • A thread must be able to service all requests in a timely fashion. If the thread is busy computing, it cannot react quickly enough to schedule the request thread. • The barrier will block all threads until each reaches the barrier, which guarantees that all nodes have their requested data and it is safe to continue with the computation. M.S. Thesis Defense

  44. IDEA Crypt • Performs IDEA encryption and decryption on a 3,000,000 byte array. • The array is divided among nodes in a block manner. • Each node encrypts and decrypts its portion. • When complete, the root node collects the decrypted array for validation. M.S. Thesis Defense

  45. Crypt: DSM Modifications • The original created the whole array on the root thread and required each remote thread to page in their portions. • The modified version used arraycopy to distribute each threads portion from the root thread. • When decryption finishes, arraycopy copies back the decrypted portion to the root thread. M.S. Thesis Defense

  46. Crypt: mpiJava • The mpiJava version uses MPI_Ssend to send the array portions to the remote processes and MPI_Recv to receive the portions. • When complete, MPI_Ssend is used to send back the processes portion and MPI_Recv receives each portion. M.S. Thesis Defense

  47. Crypt: Results M.S. Thesis Defense

  48. Crypt: Conclusions • Results are similar on both clusters. • There is a slight performance problem with 4 and 8 nodes with the DSM version. • This can be attributed to the placing of a barrier that causes all threads to block before computing, in the DSM version, while the MPI version does not block. M.S. Thesis Defense

  49. Sparse Matrix Multiplication • A 50,000 x 50,000 unstructured matrix stored in compressed-row format is multiplied over 200 iterations. • Only the final result is communicated as each node has its own portion of data and initial distribution is not timed. M.S. Thesis Defense

  50. Sparse: DSM Modifications • This benchmark originally produced excessive network traffic through remote object access. • The modifications involved removing the object access during the multiplication loop and using arraycopy to distribute the final result to the root thread. M.S. Thesis Defense

More Related