380 likes | 483 Views
Overview of An Efficient Implementation Scheme of Concurrent Object-Oriented Languages on Stock Multicomputers. Tony Chen, Sunjeev Sikand, and John Kerwin CSE 291 - Programming Sensor Networks May 23, 2003 Paper by: Kenjiro Taura, Satoshi Matsuoka, and Akinori Yonezawa. Background.
E N D
Overview of An Efficient Implementation Scheme of Concurrent Object-Oriented Languages on Stock Multicomputers Tony Chen, Sunjeev Sikand, and John Kerwin CSE 291 - Programming Sensor Networks May 23, 2003 Paper by: Kenjiro Taura, Satoshi Matsuoka, and Akinori Yonezawa
Background • Most of the work done on high performance, concurrent object-oriented programming languages (OOPLs) has focused on combinations of elaborate hardware and highly-tuned, specially tailored software. • These software architectures (the compiler and the runtime system) exploit special features provided by the hardware in order to achieve: • Efficient intra-node multithreading • Efficient message passing between objects
Special Hardware Features • The hardware manages the thread scheduling queue, and automatically dispatches the next runnable thread upon termination of the current thread. • Processors and the network are tightly connected. • Processors can send a packet to the network within a few machine cycles. • Dispatching a task upon packet arrival takes only a few cycles.
Objective of this Paper • Demonstrate software techniques that can be used to achieve comparable intra-node multithreading, and inter-node message passing performance on conventional multicomputers, without special hardware scheduling and message passing facilities.
System Used to Demonstrate these Techniques • The authors developed a runtime environment for a concurrent object- oriented programming language called ABCL/onAP1000. • Used Fujitsu Laboratory’s experimental multicomputer called AP1000. • 512 SPARC chips running at 25 MHz • Interconnected with a 25 MB/s torus network
Computation/Programming Model • Computation is carried out by message transmissions among concurrent objects. • Units of concurrency that become active when they accept messages. • Multiple message transmissions may take place in parallel, so objects may become active simultaneously. • When an object receives a message, the message is placed in its message queue, so that messages can be invoked one at a time.
Computation/Programming Model (cont.) • Messages can contain mail addresses of concurrent objects in addition to basic values such as numbers and booleans. • Each object has its own autonomous single thread of control, and its own encapsulated state variables. • Objects can be in dormant mode if they have no messages to process, active mode if they are executing a method, or waiting mode if they are waiting to receive a certain set of messages.
Possible Actions Within a Method • Message Sends to other concurrent objects • Past type – sender does not wait for a reply message • Now type – sender waits for a reply message • Reply messages are sent through a third object called a reply destination object, which resumes the original sender upon the reception of the reply message. • Creation of concurrent objects
Possible Actions Within a Method (cont.) • Referencing and Updating the contents of state variables • Waiting for a specified set of messages • Standard Operations (like arithmetic operations) on values stored in state variables
Scheduling Process • Scheduling for sequential OOPLs simply involves a method lookup and a stack-based function call. • For concurrent OOPLs, scheduling of methods is not necessarily LIFO-based, since methods may be blocked to wait for messages, and resumed upon the arrival of a message. • Therefore, a naïve implementation must allocate invocation frames from the heap instead of the stack, and use a scheduling queue to keep track of pending methods.
Scheduling Process (cont.) • In addition, since it may not be possible for a receiver object to immediately process incoming messages, each object must have its own message queue to buffer incoming messages. • This can lead to substantial overhead for frame allocation/deallocation, and queue manipulation, for both the scheduling and message queues.
Example of a Naïve Scheduling Mechanism • A naïve implementation of message reception / method invocation for an object would require: • Allocation of an invocation frame to hold local variables and message arguments of the method. • Buffering a message into the frame. • Enqueueing the frame into the object message queue. • Enqueueing the object into the scheduling queue (if it is not already there).
Key Observation for Intra-node Scheduling Strategy • In many cases, this full scheduling mechanism is not necessary, and we can use more efficient stack-based scheduling. • If an object is dormant, meaning it has no messages to be processed, its method can be invoked immediately upon message reception, without message buffering or schedule queue manipulation. • If it is active, then the message is buffered, and the method is invoked later via the scheduling queue.
Scheduling Strategy Implementation • We need a mechanism to implement this strategy efficiently. • We cannot perform a runtime check on every intra-node message send to determine whether or not the receiver is dormant. • When a running object becomes blocked on the stack, we must be able to resume other objects.
Virtual Function Tables • A Virtual Function Table Pointer (VFTP) points to a Virtual Function Table, which contains the address of each compiled function (method) of the class.
Key Idea in Object Representation • Each class has multiple virtual function tables, each of which roughly corresponds to a mode (dormant, active, and waiting) of an object. • When an object is in dormant mode, its Virtual Function Table Pointer (VFTP) points to the table that contains the method bodies. • When an object is active, the VFTP points to a virtual function table that holds tiny queueing procedures, which simply allocate a frame, store the message into the frame, and enqueue it on the object’s message queue.
Benefits of Multiple Virtual Function Tables • With multiple virtual function tables, a sender object does not have to do a runtime check of whether or not the receiver object is dormant. • Instead this check is built into the virtual function table look-up, which is already a necessary cost in object-oriented programming languages.
Benefits of Multiple Virtual Function Tables (cont.) • Can be used to implement selective message reception where acceptable messages trigger functions that restore the context of the object, and unacceptable messages trigger queueing procedures. • Can also be used to initialize an object’s state variables, by creating a table that points to initialization functions that initialize variables before calling a method body.
Combining the Stack with the Scheduling Queue • When a method is invoked on a dormant object, an activation frame is allocated on the stack, thereby achieving fast frame allocation/deallocation. • If this invocation blocks in the middle of a thread, it allocates another frame on the heap, and saves its context to this frame, which will survive until termination of the method. • The scheduling queue is used to schedule preempted objects that saved their context into a heap-allocated frame, or to invoke messages that were buffered in a message queue.
Inter-node Software Architecture • Important for message passing between objects on different nodes, and object creation on a remote node. • Assumes the hardware (or message passing libraries) provides an interface to send and receive messages asynchronously. • Uses an Active Message-like mechanism, where each message attaches its own self-dispatching message handler, which is invoked immediately after the delivery of the message.
Customized Message Handlers • Providing a customized message handler for each kind of remote message allows the system to achieve low overhead remote task dispatching. • Message handlers are classified into the following categories: • Normal message transmission between objects • Request for remote object creation • Reply to remote memory allocation request • Other services such as load balancing, garbage collection, etc.
Remote Object Creation • A mail address of an object is represented as <processor number, pointer>. • This provides maximum performance for local object access, and avoids the overhead of export table management. • Object creation on a remote node requires a memory allocation on the remote node to generate a remote mail address.
Remote Object Creation (cont.) • Since the latency of remote communication is unpredictable, and the cost of context switching is high, it is unacceptable to wait for the remote node to allocate memory and return a pointer. • Therefore the system uses a prefetch scheme, where each node manages predelivered stocks of addresses of memory chunks on remote nodes, and these addresses are used for remote object allocation. • A node only has to wait for a remote address to be allocated if its local stock is empty.
Typical Remote Object Creation Sequence • The requester node obtains a unique mail address locally from the stock. • It sends a creation request message to the node specified by the mail address. • The target node performs class-specific initialization (such as initialization of the virtual function table) of the created object upon receipt of the creation message. • The target node allocates a replacement chunk of memory, and returns its address to the requester node. • The requester replenishes its stock upon receipt of the replacement address.
Comparison of Send/Reply Latency Send and reply latency for the ABCL/onAP1000 conventional multicomputer is only about 4 times that of the ABCL/onEM4 fine-grain machine, and 2 times that of the CST fine-grain machine.
Benchmark Statistics • To evaluate these techniques on real applications, the authors measured the performance of the N-queen exhaustive search algorithm for N = 8 and N = 13. • They compared these results to the results of running the same programs on a single CPU SPARC station 1+, which uses the same CPU that is used in the AP1000.
The Effect of Stack-based Scheduling • To demonstrate the effect of stack-based scheduling, they compared the performance of the N-queen program using stack-based scheduling, to its performance using a naïve scheduling mechanism that always buffers a message in the message queue of the receiver object, and schedules the object through the scheduling queue. • In these programs, approximately 75% of local messages are sent to dormant objects. • In general, they observed a speedup of approximately 30%.
Conclusions • The authors proposed a software architecture for concurrent OOPLs on conventional multicomputers that can compete with implementations on special-purpose, fine-grain architectures. • Their stack-based intra-node scheduling mechanism significantly reduces the average cost of intra-node method invocation. • Their Active Message-like messages, and address prefetch scheme minimize the cost of inter-node message passing, and remote object creation.
Discussion • The eternal question: How does this apply to sensor networks? • Low instruction count for intra-node scheduling • Power efficient remote object creation cuts down on communication
Flaws • Security problems related to active messages. User can run any code they desire. • Scalability for prefetching objects, if thousands of nodes results in lots of communication between nodes and memory becomes a scarce commodity.