380 likes | 486 Views
Efficient User-Level Networking in Java. Chi-Chao Chang Dept. of Computer Science Cornell University (joint work with Thorsten von Eicken and the Safe Language Kernel group). Goal. High-performance cluster computing with safe languages parallel and distributed applications
E N D
Efficient User-Level Networking in Java Chi-Chao Chang Dept. of Computer Science Cornell University (joint work with Thorsten von Eicken and the Safe Language Kernel group)
Goal High-performance cluster computing with safe languages • parallel and distributed applications • communication support for operating systems Use off-the-shelf technologies • User-level network interfaces (UNIs) • direct, protected access to network devices • inexpensive clusters • U-Net (Cornell), Shrimp (Princeton), FM (UIUC), Hamlyn (HP) • Virtual Interface Architecture (VIA): emerging UNI standard • Java • safe: “better C++” • “write once run everywhere” • growing interest for high-performance applications (Java Grande) Make the performance of UNIs available from Java • JAVIA: a Java interface to VIA 2
Apps RMI, RPC Sockets Active Messages, MPI, FM UNI Networking Devices Why a Java Interface to UNI? Different approach for providing communication support for Java Traditional “front-end” approach • pick favorite abstraction (sockets, RMI, MPI) and Java VM • write a Java front-end to custom or existing native libraries • good performance, re-use proven code • magic in native code, no common solution Javia: exposes UNI to Java • minimizes amount of unverified code • isolates bottlenecks in data transfer 1. automatic memory management 2. object serialization Java C 3
Contribution I PROBLEM lack of control over object lifetime/location due to GC EFFECT conventional techniques (data copying and buffer pinning) yield 10% to 40% hit in array throughput SOLUTIONjbufs: explicit, safe buffer management in Java SUPPORT modifications to GC RESULT BW within 1% of hardware, independent of xfer size 4
Contribution II PROBLEM linked, typed objects EFFECT serialization >> send/recv overheads (~1000 cycles) SOLUTIONjstreams: in-place object unmarshaling SUPPORT object layout information RESULT serialization ~ send/recv overheads unmarshaling overhead independent of object size 5
Outline Background • UNI: Virtual Interface Architecture • Java • Experimental Setup Javia Architecture • Javia-I: native buffers (baseline) • Javia-II: jbufs (buffer management) and jstreams (marshaling) Summary and Conclusions 6
V V OS OS V V V V NI V NI V V V V V OS OS UNI in a Nutshell Enabling technology for networks of workstations • direct, protected access to networking devices Traditional • all communication via OS VIA • connections between virtual interfaces (Vi) • apps send/recv through Vi, simple mux in NI • OS only involved in setting up Vis Generic Architecture • implemented in hardware, software or both 7
Application Memory Library buffers sendQ recvQ descr DMA DMA Doorbells Adapter VI Structures Key Data Structures • user buffers • buffer descriptors < addr, len>: layout exposed to user • send/recv queues: only through API calls Structures are • pinned to physical memory • address translation in adapter Key Points • direct DMA access to buffers/descr in user-space • application must allocate, use, re-use, free all buffers/desc • alloc&pin, unpin&free are expensive operations, but re-use is cheap 8
Java Storage Safety class Buffer { byte[] data; Buffer(int n) { data = new byte[n]; } } No control over object placement Buffer buf = new Buffer(1024); • cannot pin after allocation: GC can move objects No control over de-allocation buf = null; • drop all references, call or wait for GC; Result: additional data copying in communication path 9
buf Buffer vtable byte[] vtable lock obj lock obj 1024 0 1 2 ... Java Type Safety Cannot forge a reference to a Java object • e.g. cannot cast between byte arrays and objects No control over object layout • field ordering is up to the Java VM • objects have runtime metadata • casting with runtime checks Object o = (Object) new Buffer(1024) /* up cast: OK */ Buffer buf = (Buffer) o; /* down cast: runtime check */ • array bounds check for (int i = 0; i < 1024; i++) buf.data[i] = i; Result: expensive object marshaling 10
Marmot Java System from Microsoft Research • not a VM • static compiler: bytecode (.class) to x86 (.asm) • linker: asm files + runtime libraries -> executable (.exe) • no dynamic loading of classes • most Dragon book opts, some OO and Java-specific opts Advantages • source code • good performance • two types of non-concurrent GC (copying, conservative) • native interface “close enough” to JNI 11
Example: Cluster @ Cornell Configuration • 8 P-II 450MHz, 128MB RAM • 8 1.25 Gbps Giganet GNN-1000 adapter • one Giganet switch • total cost: ~ $30,000 (w/university discount) GNN1000 Adapter • mux implemented in hardware • device driver for VI setup • VIA interface in user-level library (Win32 dll) • no support for interrupt-driven reception Base-line pt-2-pt Performance • 14s r/t latency, 16s with switch • over 100MBytes/s peak, 85MBytes/s with switch 12
Outline Background Javia Architecture • Javia-I: native buffers (baseline) • Javia-II: jbufs and jstreams Summary and Conclusions 13
Java (Marmot) Apps Apps Javia classes Javia C library Giganet VIA library GNN1000 Adapter Javia: General Architecture Java classes + C library Javia-I • baseline implementation • array transfers only • no modifications to Marmot • native library: buffer mgmt + wrapper calls to VIA Javia-II • array and object transfers • buffer mgmt in Java • special support from Marmot • native library: wrapper calls to VI 14
GC heap byte array ref send/recv ticket ring Vi Java C descriptor send/recv queue buffer VIA Javia-I: Exploiting Native Buffers Basic Asynch Send/Recv • buffers/descr in native library • Java send/recv ticket rings mirror VI queues • # of descr/buffers == # tickets in ring Send Critical Path • get free ticket from ring • copy from array to buffer • free ticket Recv Critical Path • obtain corresponding ticket in ring • copy data from buffer to array • free ticket from ring 15
Javia-I: Variants GC heap byte array ref Two Send Variants: Sync Send + Copy • goal: bypass send ring • one ticket • array -> buffer copy • wait until send completes Sync Send + Pin: • goal: bypass send ring, avoid copy • pin array on the fly • waits until send completes • unpins after send One Recv Variant: No-Post Recv + Alloc • goal: bypass recv ring • allocate array on the fly, copy data send/recv ticket ring Vi Java C descriptor send/recv queue buffer VIA 16
Javia-I: Performance Basic Costs: VIA pin + unpin = (10 + 10)us Marmot: native call = 0.28us, locks = 0.25us, array alloc = 0.75us Latency: N = transfer size in bytes 16.5us + (25ns) * N raw 38.0us + (38ns) * N pin(s) 21.5us + (42ns) * N copy(s) 18.0us + (55ns) * N copy(s)+alloc(r) BW: 75% to 85% of raw, 6KByte switch over between copy and pin 17
jbufs Lessons from Javia-I • managing buffers in C introduces copying and/or pinning overheads • can be implemented in any off-the-shelf JVM Motivation • eliminate excess per-byte costs in latency • improve throughput jbuf: exposes communication buffers to Java programmers 1. lifetime control: explicit allocation and de-allocation of jbufs 2. efficient access: direct access to jbuf as primitive-typed arrays 3. location control: safe de-allocation and re-use by controlling whether or not a jbuf is part of the GC heap 18
jbufs: Lifetime Control public class jbuf { public static jbuf alloc(int bytes);/* allocates jbuf outside of GC heap */ public void free() throws CannotFreeException; /* frees jbuf if it can */ } 1. jbuf allocation does not result in a Java reference to it • cannot directly access the jbuf through the wrapper object 2. jbuf is not automatically freed if there are no Java references to it • free has to be explicitly called C pointer jbuf GC heap 19
jbufs: Efficient Access public class jbuf { /* alloc and free omitted */ public byte[] toByteArray() throws TypedException;/*hands out byte[] ref*/ public int[] toIntArray() throws TypedException; /*hands out int[] ref*/ . . . } 3. (Memory Safety) jbuf remains allocated as long as there are array references to it • when can we ever free it? 4. (Type Safety) jbuf cannot have two differently typed references to it at any given time • when can we ever re-use it (e.g. change its reference type)? jbuf Java byte[] ref GC heap 20
jbuf jbuf jbuf Java byte[] ref Java byte[] ref Java byte[] ref GC heap GC heap GC heap unRef callBack jbufs: Location Control public class jbuf { /* alloc, free, toArrays omitted */ public void unRef(CallBack cb); /* app intends to free/re-use jbuf */ } Idea: Use GC to track references unRef: application claims it has no references into the jbuf • jbuf is added to the GC heap • GC verifies the claim and notifies application through callback • application can now free or re-use the jbuf Required GC support: change scope of GC heap dynamically 21
jbufs: Runtime Checks to<p>Array, GC alloc to<p>Array Unref ref<p> free Type safety: ref and to-be-unref states parameterized by primitive type GC* transition depends on the type of garbage collector • non-copying: transition only if all refs to array are dropped before GC • copying: transition occurs after every GC unRef GC* to-be unref<p> to<p>Array, unRef 22
GC heap send/recv ticket ring jbuf state array refs Vi Java C descriptor send/recv queue VIA Javia-II: Exploiting jbufs Send/recv with jbufs • explicit pinning/unpinning of jbufs • tickets point to pinned jbufs • critical path: synchronized access to rings, but no copies Additional checks • send posts allowed only if jbuf is in ref<p> state • recv posts allowed only if jbuf is in unref or ref<p> state • no outstanding send/recv posts in to-be-unref<p> state 23
Javia-II: Performance Basic Costs allocation = 1.2us, to*Array = 0.8us, unRefs = 2.5 us Latency (n = xfer size) 16.5us + (0.025us) * n raw 20.5us + (0.025us) * n jbufs 38.0us + (0.038us) * n pin(s) 21.5us + (0.042us) * n copy(s) BW: within margin of error (< 1%) 24
Parallel Matrix Multiplication Goal: validate jbufs flexibility and performance in Java apps • matrices represented as array of jbufs (each jbuf accessed as array of doubles) • A, B, C distributed across processors (block columns) • comm phase: processor sends local portion of A to right neighbor, recv new A from left neighbor • comp phase: Cloc = Cloc + Aloc * Bloc’ Preliminary Results • no fancy instruction scheduling in Marmot • no fancy cache-conscious optimizations • single processor, 128x128: only 15 Mflops • cluster, 128x128 • comm time about 10% of total time Impact of Jbufs will increase as #flops increase C A B += * p0 p1 p2 p3 p0 p1 p2 p3 p0 p1 p2 p3 25
Active Messages Goal: Exercise jbuf mgmt Implemented subset of AM-II over Javia+jbufs: • maintains a pool of free recv jbufs • when msg arrives, jbuf is passed to the handler • AM calls unRef on jbuf after handler invocation • if pool is empty, either alloc more jbufs or invoke GC • no copying in critical path, deferred to GC-time if needed class First extends AMHandler { private int first; void handler(AMJbuf buf, …) { int[] tmp = buf.toIntArray(); first = tmp[0]; } } class Enqueue extends AMHandler { private Queue q; void handler(AMJbuf buf, …) { int[] tmp = buf.toIntArray(); q.enq(tmp); } } 26
AM: Preliminary Numbers Summary • AM latency about 15 us higher than Javia • synch access to buffer pool, endpoint header, flow control checks, handler id lookup • room for improvement • AM BW within 5% of peak for 16KByte messages 27
writeObject “typical” readObject NETWORK “in-place” readObject jstreams Goal: efficient transmission of arbitrary objects • assumption: optimizing for homogeneous hosts and Java systems Idea: “in-place” unmarshaling • defer copying and allocation to GC-time if needed jstream • R/W access to jbuf through object stream API • no changes in Javia-II architecture 28
jstream: Implementation writeObject • deep-copy of object, breadth-first • deals with cyclic data structures • replace object metadata (e.g. vtable) with 64-bit class descriptor readObject • depth-first traversal from beginning of stream • swizzle pointers, type-checking, array-bounds checking • replace class descriptors with metadata Required support • some object layout information (e.g. per-class pointer-tracking info) Minimal changes to existing stub compilers (e.g. rmic) • jstream implements JDK2.0 ObjectStream API 29
writeObject, GC alloc writeObject Unref Unref w/obj free clearWrite Only send posts allowed Only recv posts allowed readObject readObject GC* to-be unref Ref clearRead readObject readObject, GC No outstanding send/recv posts No send/recv posts allowed 30 jstreams: Safety
Status Implementation Status • Javia-I and II complete • jbufs and jstreams integrated with Marmot copying collector Current Work • finish implementation of AM-II • full implementation of Java RMI • integrate jbufs and jstreams with conservative collector • more investigation into deferred copying in higher-level protocols 32
Related Work Fast Java RMI Implementations • Manta (Vrije U): compiler support for marshaling, Panda communication system • 34 us null, 51 Mbytes/s (85% of raw) on PII-200/Myrinet, JDK1.4 • KaRMI (Karlsruhe): ground-up implementation • 117 us null, Alpha 500, Para-station, JDK1.4 Other front-end approaches • Java front-end for MPI (IBM), Java-to-PVM interface (GaTech) Microsoft J-Direct • “pinned” arrays defined using source-level annotations • JIT produces code to “redirect” array access: expensive Comm System Design in Safe Languages (e.g. ML) • Fox Project (CMU): TCP/IP layer in ML • Ensemble (Cornell): Horus in ML, buffering strategies, data path optimizations 33
Summary High-Performance Communication in Java: Two problems • buffer management in the presence of GC • object marshaling Javia: Java Interface to VIA • uses native buffers as baseline implementation • jbufs: safe, explicit control over buffer placement and lifetime, eliminates bottlenecks in critical path • jstreams: jbuf extension for fast, in-place unmarshaling of objects Concluding Remarks • building blocks for Java apps and communication software • should be integral part of a high-performance Java system 34
Javia-I: Interface package cornell.slk.javia; public class ViByteArrayTicket { private byte[] data; private int len, off, tag; /* public methods to set/get fields */ } public class Vi { /* connection to remote Vi */ public void sendPost(ViByteArrayTicket t); /* asynch send */ public ViByteArrayTicket sendWait(int timeout); public void recvPost(ViByteArrayTicket t); /* async recv */ public ViByteArrayTicket recvWait(int timeout); public void send(byte[] b, int len, int off, int tag); /* sync send */ public byte[] recv(int timeout); /* post-less recv */ } 35
Javia-II: Interface package cornell.slk.javia; public class ViJbuf extends jbuf { public ViJbufTicket register(Vi vi); /* reg + pin jbuf */ public void deregister(ViJbufTicket t); /* unreg + unpin jbuf */ } public class ViJbufTicket { private ViJbuf buf; private int len, off, tag; } public class Vi { public void sendBufPost(ViJbufTicket t); /* asynch send */ public ViBufTicket sendBufWait(int usecs); public void recvBufPost(ViJbufTicket t); /* async recv */ public ViBufTicket recvBufWait(int usecs); } 36
baseAddr vtable lock native desc ptr length array body Jbufs: Implementation alloc/free:Win32 VirtualAlloc, VirtualFree to{Byte,Int,...}Array:no alloc/copying clearRefs: • modification to stop-and-copy Cheney scan GC • clearRef adds a jbuf to that list • after GC, traverse list to invoke callbacks, delete list Stack + Global Stack + Global Before GC After GC from-space to-space to-space from-space ref’d jbufs unref’d jbufs 37
State-of-the-Art Matrix Multiplication Courtesy: IBM Research 38