460 likes | 634 Views
Safe and Efficient Cluster Communication in Java using Explicit Memory Management. Chi-Chao Chang Dept. of Computer Science Cornell University. Goal. High-performance cluster computing with safe languages parallel and distributed applications Use off-the-shelf technologies Java
E N D
Safe and Efficient Cluster Communication in Java using Explicit Memory Management Chi-Chao Chang Dept. of Computer Science Cornell University
Goal High-performance cluster computing with safe languages • parallel and distributed applications Use off-the-shelf technologies • Java • safe: “better C++” • “write once run everywhere” • growing interest for high-performance applications (Java Grande) • User-level network interfaces (UNIs) • direct, protected access to network devices • prototypes: U-Net (Cornell), Shrimp (Princeton), FM (UIUC) • industry standard: Virtual Interface Architecture (VIA) • cost-effective clusters: new 256-processor cluster @ Cornell TC 2
Apps RMI, RPC Sockets Active Messages, MPI, FM UNI Networking Devices Java Networking Traditional “front-end” approach • pick favorite abstraction (sockets, RMI, MPI) and Java VM • write a Java front-end to custom or existing native libraries • good performance, re-use proven code • magic in native code, no common solution Interface Java with Network Devices • bottom-up approach • minimizes amount of unverified code • focus on fundamental data transfer inefficiencies due to: 1. Storage safety 2. Type safety Java C 3
Outline Thesis Overview • GC/Native heap separation, object serialization Experimental Setup: VI Architecture and Marmot Part I: Array Transfers (1) Javia-I: Java Interface to VI Architecture • respects heap separation (2) Jbufs: Safe and Explicit Management of Buffers • Javia-II, matrix multiplication, Active Messages Part II: Object Transfers (3) A Case For Specialization • micro-benchmarks, RMI using Javia-I/II, impact on application suite (4) Jstreams: in-place de-serialization • micro-benchmarks, RMI using Javia-III, impact on application suite Conclusions 4
(1) Storage Safety Java programs are garbage-collected • no explicit de-allocation: GC tracks and frees garbage objects • programs are oblivious to the GC scheme used: non-copying (e.g. conservative) or copying • no control over location of objects Modern Network and I/O Devices • direct DMA from/into user buffers • native code is necessary to interface with hardware devices 5
NI NI (1) Storage Safety Result: Hard Separation between GC and native heaps Application Memory Application Memory Pin-on-demand only works for send/write operations • For receive/read operations, GC must be disabled indefinitely... GC heap Native heap GC heap Native heap pin copy RAM RAM ON OFF DMA OFF OFF DMA pin (a) Hard Separation: Copy-on-demand (b) Optimization: Pin-on-demand 6
(1) Storage Safety: Effect Best case scenario: 10-40% hit in throughput • pick your favorite JVM, your fastest network interface, and a pair of 450Mhz P-II with commodity OS • pinning on demand is expensive... 7
(2) Type Safety Cannot forge a reference to a Java object • b is an array of bytes • in C: double *data = (double *)b; • in Java: double[] data = new double[1024/8]; for (int i=0,off=0;i<1024/8;i++,off+=8) { int upper = (((b[off]&0xff)<<24) + ((b[off+1]&0xff)<<16) + ((b[off+2]&0xff)<<8) + (b[off+3]&0xff)); int lower = (((b[off+4]&0xff)<<24) + ((b[off+5]&0xff)<<16) + ((b[off+6]&0xff)<<8) + (b[off+7]&0xff)); data[i] = Double.toLongBits(((long)upper)<<32)+ (lower&0xffffffffL)) } 8
Buffer vtable b lock obj byte[] vtable lock obj 1024 1024 (2) Type Safety Objects have meta-data • runtime safety checks (array-bounds, array-store, casts) In C: struct Buffer { int len; char data[1];} Buffer *b = malloc(sizeof(Buffer)+1024); b.len = 1024; In Java: class Buffer { int len; byte[] data; Buffer(int n) { data = new byte[n]; len = n; } } Buffer b = new Buffer(1024); b 9
NI (2) Type Safety Result: Java objects need to be serialized and de-serialized across the network Application Memory GC heap Native heap serial pin copy RAM DMA ON OFF 10
(2) Type Safety: Effect Performance hit of one order of magnitude: • pick your favorite high-level communication abstraction (e.g. Remote Method Invocation) • pick your favorite JVM, your fastest network interface, and a pair of 450Mhz P-II 11
NI Thesis Use explicit memory management to improve Java communication performance • Jbufs: safe and explicit management of Java buffers • softens the GC/Native heap separation • preserves type and storage safety • “zero-copy” array transfers • Jstreams: extends Jbufs for optimizing serialization in clusters • “zero-copy” de-serialization of arbitrary objects Application Memory GC heap Native heap pin RAM ON OFF DMA user-controlled 12
Outline Thesis Overview • GC/Native heap separation, object serialization Experimental Setup: Giganet cluster and Marmot Part I: Array Transfers (1) Javia-I: Java Interface to VI Architecture • respects heap separation (2) Jbufs: Safe and Explicit Management of Buffers • Javia-II, matrix multiplication, Active Messages Part II: Object Transfers (3) A Case For Specialization • micro-benchmarks, RMI using Javia-I/II, impact on application suite (4) Jstreams: in-place de-serialization • micro-benchmarks, RMI using Javia-III, impact on application suite Conclusions 13
Giganet Cluster Configuration • 8 P-II 450MHz, 128MB RAM • 8 1.25 Gbps Giganet GNN-1000 adapter • one Giganet switch GNN1000 Adapter: User-Level Network Interface • Virtual Interface Architecture implemented as a library (Win32 dll) Base-line pt-2-pt Performance • 14s r/t latency, 16s with switch • over 100MBytes/s peak, 85MBytes/s with switch 14
Marmot Java System from Microsoft Research • not a VM • static compiler: bytecode (.class) to x86 (.asm) • linker: asm files + runtime libraries -> executable (.exe) • no dynamic loading of classes • most Dragon book opts, some OO and Java-specific opts Advantages • source code • good performance • two types of non-concurrent GC (copying, conservative) • native interface “close enough” to JNI 15
Outline Thesis Overview • GC/Native heap separation, object serialization Experimental Setup: Giganet cluster and Marmot Part I: Array Transfers (1) Javia-I: Java Interface to VI Architecture • respects heap separation (2) Jbufs: Safe and Explicit Management of Buffers • Javia-II, matrix multiplication, Active Messages Part II: Object Transfers (3) A Case For Specialization • micro-benchmarks, RMI using Javia-I/II, impact on application suite (4) Jstreams: in-place de-serialization • micro-benchmarks, RMI using Javia-III, impact on application suite Conclusions 16
GC heap byte array ref send/recv ticket ring Vi Java C descriptor send/recv queue buffer VIA Javia-I Basic Architecture • respects heap separation • buffer mgmt in native code • Marmot as an “off-the-shelf” system • copying GC disabled in native code • primitive array transfers only Send/Recv API • non-blocking • blocking • bypass ring accesses • pin-on-demand • alloc-recv: allocates new array on-demand • cannot eliminate copying during recv 17
Javia-I: Performance Basic Costs (PII-450, Windows2000b3): pin + unpin = (10 + 10)us, or ~5000 machine cycles Marmot: native call = 0.28us, locks = 0.25us, array alloc = 0.75us Latency: N = transfer size in bytes 16.5us + (25ns) * N raw 38.0us + (38ns) * N pin(s) 21.5us + (42ns) * N copy(s) 18.0us + (55ns) * N copy(s)+alloc(r) BW: 75% to 85% of raw for 16Kbytes 18
jbufs Goal • provide buffer management capabilities to Java without violating its safety properties • re-use is important: amortizes high pinning costs jbuf: exposes communication buffers to Java programmers 1. lifetime control: explicit allocation and de-allocation 2. efficient access: direct access as primitive-typed arrays 3. location control: safe de-allocation and re-use by controlling whether or not a jbuf is part of the GC heap • heap separation becomes soft and user-controlled 19
jbufs: Lifetime Control public class jbuf { public static jbuf alloc(int bytes);/* allocates jbuf outside of GC heap */ public void free() throws CannotFreeException; /* frees jbuf if it can */ } 1. jbuf allocation does not result in a Java reference to it • cannot access the jbuf from the wrapper object 2. jbuf is not automatically freed if there are no Java references to it • free has to be explicitly called handle jbuf GC heap 20
jbufs: Efficient Access public class jbuf { /* alloc and free omitted */ public byte[] toByteArray() throws TypedException;/*hands out byte[] ref*/ public int[] toIntArray() throws TypedException; /*hands out int[] ref*/ . . . } 3. (Storage Safety) jbuf remains allocated as long as there are array references to it • when can we ever free it? 4. (Type Safety) jbuf cannot have two differently typed references to it at any given time • when can we ever re-use it (e.g. change its reference type)? jbuf Java byte[] ref GC heap 21
jbuf jbuf jbuf Java byte[] ref Java byte[] ref Java byte[] ref GC heap GC heap GC heap unRef callBack jbufs: Location Control public class jbuf { /* alloc, free, toArrays omitted */ public void unRef(CallBack cb); /* app intends to free/re-use jbuf */ } Idea: Use GC to track references unRef: application claims it has no references into the jbuf • jbuf is added to the GC heap • GC verifies the claim and notifies application through callback • application can now free or re-use the jbuf Required GC support: change scope of GC heap dynamically 22
jbufs: Runtime Checks to<p>Array, GC alloc to<p>Array Unref ref<p> free Type safety: ref and to-be-unref states parameterized by primitive type GC* transition depends on the type of garbage collector • non-copying: transition only if all refs to array are dropped before GC • copying: transition occurs after every GC unRef GC* to-be unref<p> to<p>Array, unRef 23
GC heap send/recv ticket ring jbuf state array refs Vi Java C descriptor send/recv queue VIA Javia-II Exploiting jbufs • explicit pinning/unpinning of jbufs • only non-blocking send/recvs 24
Basic Jbuf Costs allocation = 1.2us, to*Array = 0.8us, unRefs = 2.3 us, GC degradation=1.2us/jbuf Latency (n = xfer size) 16.5us + (0.025us) * n raw 20.5us + (0.025us) * n jbufs 38.0us + (0.038us) * n pin(s) 21.5us + (0.042us) * n copy(s) BW within 1% of raw Javia-II: Performance 25
MM: Communication pMM over Javia-II/jbufs spends at least 25% less in communication for 256x256 matrices on 8 processors 26
MM: Overall Cache effects: better communication performance does not always translate to better overall performance 27
Active Messages class First extends AMHandler { private int first; void handler(AMJbuf buf, …) { int[] tmp = buf.toIntArray(); first = tmp[0]; } } class Enqueue extends AMHandler { private Queue q; void handler(AMJbuf buf, …) { int[] tmp = buf.toIntArray(); q.enq(tmp); } } Exercising Jbufs: • user supplies a list of jbufs • upon message arrival: • jbuf passed to handler • unRef is invoked after handler invocation • if pool is empty, reclaim existing ones • copying deferred to GC-time only if needed 28
AM: Performance Latency about 15s higher than Javia • synch access to buffer pool, endpoint header, flow control checks, handler id lookup BW within 10% of peak for 16KByte messages 29
Jbufs: Experience Efficient access through arrays is useful: • no indirect access via method invocation • promotes code re-use of large numerical kernels • leverages compiler infrastructure for eliminating safety checks Limitations • still not as flexible as C buffers • stale references may confuse programmers Discussed in thesis: • the necessity of explicit de-allocation • implementation of Jbufs in Marmot’s copying collector • impact on conservative and generational collector • extension to JNI to allow “portable” implementations of Jbufs 30
Outline Thesis Overview • GC/Native heap separation, object serialization Experimental Setup: VI Architecture and Marmot Part I: Array Transfers (1) Javia-I: Java Interface to VI Architecture • respects heap separation (2) Jbufs: Safe and Explicit Management of Buffers • Javia-II, matrix multiplication, Active Messages Part II: Object Transfers (3) A Case For Specialization on Homogeneous Clusters • micro-benchmarks, RMI using Javia-I/II, impact on application suite (4) Jstreams: in-place de-serialization • micro-benchmarks, RMI using Javia-III, impact on application suite Conclusions 31
GC heap writeObject Object Serialization and RMI Standard JOS Protocol • “heavy-weight” class descriptors are serialized along with objects • type-checking: classes need not be “equal”, just “compatible.” • protocol allows for user extensions Remote Method Invocation • object-oriented version of Remote Procedure Call • relies on JOS for argument passing • actual parameter object can be a sub-class of the formal parameter class. GC heap readObject NETWORK 32
JOS Costs 1. overheads in tens or hundreds of s: • send/recv overheads=~ 3 s, memcpy of 500 bytes=~ 0.8 s 2. double[] 50% more expensive than byte[] of similar size 3. overheads grow as object sizes grow 33
Impact of Marmot Impact of Marmot’s optimizations: • Method inlining: up to 66% improvement (already deployed) • No synchronization whatsoever: up to 21% improvement • No safety checks whatsoever: up to 15% combined Better compilation technology unlikely to reduce overheads substantially 34
Impact on RMI • Order of magnitude worse than Javia-I/II • round-trip latency drops to about 30us in a null RMI: no JOS! • peak bandwidth of 22MBytes/s, about 25% of raw 35
Impact on Applications A Case for Specializing Serialization for Cluster applications: • overheads a order of magnitude higher than send/recv and memcpy • RMI performance degraded by one order of magnitude • 5-15% “estimated” impact on applications • old adage: “specialize for the common case” 36
Optimizing De-serialization “in-place” object de-serialization • specialization for homogeneous cluster and JVMs Goal • eliminate copying and allocation of objects Challenges • preserve the integrity of the receiving JVM • permit de-serialization of arbitrary Java objects with unrestricted usage and without special annotations • independent of a particular GC scheme GC heap GC heap writeObject NETWORK 37
Jstreams: write public class Jstream extends Jbuf { public void writeObject(Object o) /* serializes o onto the stream */ throws TypedException, ReferencedException; public void writeClear() /* clears the stream for writing*/ throws TypedException, ReferencedException; } writeObject • deep-copy of objects: maintains in-memory layout • deals with cyclic data structures • swizzle pointers: offsets to a base address • replace object meta-data with 64-bit class descriptor • optimization: primitive-typed arrays in jbufs are not copied 38
GC heap GC heap GC heap unRef callBack Jstreams: read public class Jstream extends Jbuf { public Object readObject() throws TypedException; /* de-serialization */ public boolean isJstream(Object o); /* checks if o resides in the stream */ } readObject • replace class descriptors with meta-data • unswizzle pointers, array-bounds checking • after first readObject, add jstream to GC heap • tracks references coming out of read objects • unRef: user is willing to free or re-use 39
writeObject, GC alloc writeObject Write Mode Unref free writeClear readObject GC* readObject to-be unref Read Mode unRef unRef readObject, GC jstreams: Runtime Checks Modification to Javia-II: prevent DMA from clobbering de-serialized objects • receive posts not allowed if jstream is in read mode • no changes to Javia-II architecture
jstream: Performance De-serialization costs constant w.r.t. object size • 2.6us for arrays, 3.3us per list element. 41
jstream: Impact on RMI 4-byte round-trip latency of 45us (25us higher than Javia-II) 52MBytes/s for 16KBytes arguments 42
jstream: Impact on Applications 3-10% improvement in SOR, EM3D, FFT 10% hit in pMM performance • over 22,000 incoming RMIs, 1000 jstreams in receive pool, ~26 garbage collections: 15% of total execution time in GC • generational collection will alleviate GC costs substantially • receive pool size is hard to tune: tradeoffs between GC and locality 43
Jstreams: Experience Implementation of readObject and writeObject integrated into JVM • protocol is JVM-specific • native implementation is faster Limitations • not as flexible as Java streams: cannot read and write at the same time • no “extensible” wire protocols Discussed in thesis: • implementation of Jstreams in Marmot’s copying collector • support for polymorphic RMI: minor changes to the stub compiler • JNI extensions to allow “portable” implementations of Jstreams 44
Related Work Microsoft J-Direct • “pinned” arrays defined using source-level annotations • JIT produces code to “redirect” array access: expensive • Berkeley’s Jaguar: efficient code generation with JIT extensions • security concern: JIT “hacks” may break Java or byte-code Custom JVMs • many “tricks” are possible (e.g. pinned array factories, pinned and non-pinned heaps, etc): depend on a particular GC scheme • Jbufs: isolates minimal support needed from GC Memory Management • Safe Regions (Gay and Aiken): reference counting, no GC Fast Serialization and RMI • KaRMI (Karlsruhe): fixed JOS, ground-up RMI implementation • Manta (Vrije U): fast RMI but a Java dialect 45
Summary Use of explicit memory management to improve Java communication performance in clusters • softens the GC/Native heap separation • preserves type and storage safety • independent of GC scheme • jbufs: zero-copy array transfers • jstreams: zero-copy de-serialization of arbitrary objects Framework for building communication software and applications in Java • Javia-I/II • parallel matrix multiplication • Jam: active messages • Java RMI • cluster applications: TSP, IDA, SOR, EM3D, FFT, and MM 46