380 likes | 393 Views
4th International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS’05).
E N D
4th International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS’05) QQ: Nanoscale Timing and ProfilingJames Frye † *, James G. King † *, Christine J. Wilson * ◊, Frederick C. Harris, Jr. †*†Department of Computer Science and Engineering*Brain Computation Lab◊Biomedical Engineering University of Nevada, Reno NV 89557 2005 IPDPS Conference19th IEEE International Parallel & Distributed Processing Symposium
What is QQ • QQ is a simple and efficient tool for measuring timing and memory use • Developed for the examination of a massively parallel program • Easily extensible to inspect other programs
QQ Development • QQ was developed to optimize a parallel program used to simulate cortical neurons – NeoCortical Simulator (NCS) • Our goal for the summer of 2002 was to simulate 106 neurons with 109 synapses within a realistic run time • Before optimization, NCS would run about 1.5 million synapses at a rate of 1 day per simulated second of synaptic activity • Clearly optimization of NCS was needed
NeoCortical Simulator (NCS) • Originated in the Brain Computation Lab led by Dr. Phil Goodman • Incorporates membrane dynamics • Utilizes simulated ion channels to modulate the membrane voltage changes (when applied) • Compartment based simulator • Allows for channel dynamics to drive the membrane voltage
NCS Biology • Neuron – a brain cell and the basic unit or compartment • Synapse – the region of communication between compartments • Channel – openings in the cellular membrane that allow the passage of various ions to induce a voltage gradient across the membrane • Action Potential – an electrical signal that translates to a chemical signal to the post-synaptic cell
Action Potential 30 mV 0 -45 Time (mS) NCS Biology • The membrane voltage determines the cell’s firing rate • Once threshold voltage is reached the cell sends an action potential to it’s connected synapses
Pre-Synaptic Cell Post-Synaptic Cell 0.2 mV 100 200 300 400 500 0 Time (ms) 2-Cell Model
No Channels Sustained firing at maximum rate during a continuous stimulus
Ka Channel Slows the initial response during a sustained stimulus
Km Channel Prevents continuous bursting during a continuous stimulus
Kahp Channel Dampens the effect while still allowing for some action potentials during a sustained stimulus
QQ Design • QQ is designed so that all of its routines can be selectively compiled into a program • In the QQ.h header file, each routine is defined with a preprocessor directive, so that if profiling is not enabled, it reduces to an empty statement. #ifdef QQ_ENABLE void QQInit (int); #else #define QQInit (dummy) #endif
QQ Design • Memory profiling routines also use the C preprocessor to intercept library calls #ifdef QQ_ENABLE #define malloc(arg) MemMalloc (MEM_KEY, arg) #endif • The MemMalloc function records allocation information, calls the malloc function to do the actual allocation, and returns the result to the caller
QQ Timing • Extremely accurate measurement of execution speed. • In theory fine-grained resolution to a single clock cycle. • In practice, measurements are accurate to tens of cycles
Timing Measurements • Measuring the impact of a line change in the calculation for the Km channel From: I = unitaryG * strength * pow (m, mPower) * (ReversePot – CmpV); To: I = unitaryG * strength * (ReversePot – CmpV); • Km-type channel, mPower is always 1, so we were able to change the equation to streamline the execution • Wrapping the line in calls to QQ, we measure the effect of this single change QQStateOn (QQ_Km); I = unitaryG * strength * (ReversePot – CmpV); QQStateOff (QQ_Km);
Timing Measurements • Note that both code versions give similar cycle counts on different processors, though more consistent and somewhat fewer on P4 than P3. • Times for similar counts are proportional to processor speed, as expected. • Function call pays a heavy penalty for first call. It's only called by Km channel code in this code, so time represents first load of the code into cache
Timing Measurements PIII – 800 MHz
Timing Measurements P4 – 2200MHz
Expanding Timing Information • QQ allows the user to record an additional item of information with the normal timing. • QQCount records an integer with the key • QQCount( eventKey, integer_of_interest ); • QQValue records a double precision floating point value with the key • QQValue( eventKey, double_of_interest ); • QQState records a state of on or off with the key • QQStateOn( eventKey ); QQStateOff( eventKey ); • These will be described during discussion of the output format
QQ Memory • Records memory allocation dedicated to the code-block, rather than the total allocation due to code and library calls, to single-byte accuracy
QQ Memory Example • NCS implementation of ion channels • Suppose we want to know the total memory used by all channels. Each channel function would require channel key: #define MEM_KEY KEY_CHANNEL • Then at any point in the program execution, just call the MemPrint function to display memory use
Memory Usage Output Memory Allocation: Total Allocated = 988 KBytes Object Number Number Object Alloc Total Max Item Size Created Deleted KB KB Kb KB Brain 120 1 0 1 0 1 1 CellManager 44 1 0 1 1 1 1 Cell 16 100 0 2 0 2 2 Channel 252 300 0 74 0 74 74 Compartment 324 100 0 32 2 33 33 MessageMgr 16 1 0 1 205 205 205 MessageBus 0 0 0 0 1 1 1 Report 80 1 0 1 1 1 1 Stimulus 252 1 0 1 1 1 1 Synapse 44 10000 0 430 118 547 547 --------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 2 3 4 5 6 7 8 Key 1 - Internal name given to recording category 2 - The size of the object being allocated - it's valid only if all allocations are the same size, as with "new Object". 3 - Number of allocation calls made: new, malloc, calloc, etc. 4 - Number of free or delete calls made 5 - KBytes allocated via object creation (new) 6 - KBytes allocated via *alloc calls 7 - Total memory currently allocated 8 - Max memory ever allocated = high-water mark.
QQ Applications • Brain Communication Server (BCS) • NCS
Brain Communication Server • Further experimentation with the simulator required another application be developed to coordinate communication between NCS and numerous potential clients: • virtual creatures • physical robots • visualization tools NCS BCS
Optimizing BCS Different applications make non-sequential requests. No single function was called in a loop iterating several times, so time needed to be measured over the course of execution. Then perform an analysis of QQ’s final output.
Parsing QQ’s output • QQ uses a straight forward layout for the final output file • The data can be easily extracted and displayed in a text report as shown on the previous slide or sent to a graphical display • The following slides describe the output format and how to manage the information
Header Number of Keys (int), Key Name string length (int) Key Table For each Key – Key ID (int), Key type (int), Key name (char *) Node Information Number of nodes (int) Node Table For each Node – Byte offset to data (size_t), Number of entries (int), Starting Base Time (unsigned long long), Mhz (double) Data For each Node, For each entry – item (QQItem) QQ file format
Previous Sections Node 0 – For each entry Key (int), [Optional Info], Event Time (unsigned long long) Data Node 1 – For each entry Key (int), [Optional Info], Event Time (unsigned long long) Node 2 – For each entry … QQ Format – Data Close Up Node 0 Byte offset Node 1 Byte offset Node 2 Byte offset Where Optional Info is the size of a double, but contains a State (int), a Count (int), or a Value (double)
Gathering the Results • After reading a node’s data section, entries with the same key can be gathered. • Using the key table, the user knows what is contained in the second block of a timing entry 2 1 109342759 2 0 109342768 Example: Key 2 has type “State” The second block contains integer 1 for “on” or integer 0 for “off” By subtracting the event times, the length of time spent in the “on” state is determined
Another example 4 -65.3477 109342735 4 -58.2367 109342819 Example: Key 4 has type “Value” The second block contains a double precision value passed in during execution The value can be saved and displayed with timing information, or sent to a separate graph Timing is obtained the same as before, by subtracting the event times
NCS Performance Measurement • QQ was able to hone in on specific blocks of code and allow measurement at a resolution necessary to allow for easy interpretation
Optimization Targets • QQ analysis quickly identified two major targets within the code • Synapses • Message Passing
Synapses • Synapses were by far the most common element of any NCS model with the most memory usage • Active only when an action potential was processed through the synapse • Pass information between the nodes via message passing
Message Parsing Overhead • Using QQ we were able to identify areas for improvement within NCS 3 • Many unneeded fields requiring better encoding of their destination • Fixed number of messages pre-allocated, far more than needed by the program • Implemented a shared pool, buffers allocated as needed • Messages sent individually, processed multiple times • Implemented a packet scheme: process packet once for send, once for receive • Process messages only when used
Conclusions • QQ allows profiling of nanoscale timing of code segments and memory usage analysis • Fine grained measurements of specific events • Ability to measure memory at an object or event level with a small memory and performance footprint • Simple and effective tool
Future Work • New Opteron cluster • BlueGene migration (how many processors?) • Robotic integration