860 likes | 1.36k Views
Philip Mucci, Research Consultant Innovative Computing Laboratory/UTK mucci@cs.utk.edu. PAPI 3.0. In cooperation with: The PAPI team at ICL, UTK Nils Smeds and Per Eckman at PDC, KTH, Sweden. These slides are incomplete. Not all information is present.
E N D
Philip Mucci, Research Consultant Innovative Computing Laboratory/UTK mucci@cs.utk.edu PAPI 3.0 In cooperation with: The PAPI team at ICL, UTK Nils Smeds and Per Eckman at PDC, KTH, Sweden
These slides are incomplete. Not all information is present. Not all information is consistent. Use at your own risk. THIS IS A DRAFT PRESENTATION
Man Hours • We have roughly 1000-1100 man hours left before a release has to be cut • We have >= 380 Man hours for the must haves NOT including porting the substrates • We have >= 350 Man hours for May include • This leaves <= 370 hours to port substrates and get other API’s nailed down which is not a lot of time. • Thus we need to prioritize, when not working on a substrate, grab one of the projects in the must have locations first, then work on the May haves
PAPI 3.0 Low Level API Added: int PAPIf_encode_native_event(char *); int PAPI_encode_native_event(...); int PAPI_get_hwctr_map(int EventSet, unsigned *which); int PAPI_register_thread(void); int PAPI_freeze(void); int PAPI_thaw(void); int PAPI_num_events(int EventSet); int PAPI_get_thr_specific(int tag, void **ptr); int PAPI_set_thr_specific(int tag, void *ptr); Removed: int PAPI_describe_event(char *name, int *EventCode, char *description); int PAPI_label_event(int EventCode, char *label); int PAPI_query_event_verbose(int EventCode, PAPI_preset_info_t *info); const PAPI_preset_info_t *PAPI_query_all_events_verbose(void); const PAPI_mem_info_t *PAPI_get_memory_info(void); int PAPI_add_pevent(int *EventSet, int code, void *inout); long PAPI_get_dmem_info(int option);
PAPI 3.0 Changed Calls Changed: int PAPI_add_event(int EventSet, int Event); int PAPI_add_events(int EventSet, int *Events, int number); int PAPI_create_eventset(int *EventSet); int PAPI_destroy_eventset(int *EventSet); int PAPI_rem_event(int EventSet, int Event); int PAPI_rem_events(int EventSet, int *Events, int number); int PAPI_state(int EventSet); typedef void (*PAPI_overflow_handler_t)(int EventSet, unsigned *which, void *context);
PAPI 3.0 Definitions Definitions: #define PAPI_SHORT_STR_LEN 80 #define PAPI_LONG_STR_LEN LINE_MAX #define PAPI_MAX_STR_LEN PAPI_LONG_STR_LEN #define PAPI_HUGE_STR_LEN PATH_MAX Macros: #define LOOKUP_THREAD() #define LOCK(location) #define UNLOCK(location) #define PAPI_DEBUG (level,message) #define PAPI_IS_PRESET(event) #define PAPI_IS_NATIVE(event)
PAPI 3.0 High Level API Changed: int PAPI_mflips(float *rtime, float *ptime, long long *flpins, float *mflips); int PAPI_ipc(float *rtime, float *ptime, long long *ins, float *ipc); Removed: int PAPI_num_counters(void); # Unclear and redundant Added: int PAPI_high_init(int flags); # MULTIPLEX, THREADS PROBLEM: We've buried the initialization. We need to be able to allow Multiplexing and Thread Support at the high level.
Items Completed • PAPI_lock/PAPI_unlock • PAPI_NATIVE place holder
PAPI Presets • Add PAPI_NATIVE as placeholder for native/programmable events.
Locking int PAPI_lock(); int PAPI_unlock(); • Yikes! Currently, the internal locks conflict with the user locks! • Internally, now individual components/data structures are locked when necessary. • Externally, there is a global lock. • Internally these are implemented as MACROS. • These will not exist in Fortran. New Interface: void PAPI_lock(int lock); void PAPI_unlock(int lock);
Features that must be included in SC2003 Release • Cray X1 release (2 Weeks) • Webpage and documentation update/overhaul (>= 2 Weeks) • PAPI_is_initialized (<1 day) • Rename of #define get/set_opt - (<1 day) • High-level interface changes implemented – (>= 1 week) • Interface to list native events – (Already done?) • Interface to describe/list presets – (1-2 days) • Re-define PAPI_preset names, only add new events as time permits (1-2 days – have to update all substrates) • PAPI Debug support – Calls should be ported as we are porting substrates (1-4 days) • Changes to Mem info, but not the latencies and new info, though the current functionality should be ported to all platforms (1-2 days) • Update Overflow/Profile to support multiple overflows (1-2 Weeks) • Support variable profiling buckets - (<1 day)
PAPI_overflow() Changes • PAPI 3.0 supports overflowing: • Overflowing on more than one counter. • On native events. • PAPI 3.0 does not support overflowing: • On derived events. • Derived events with hardware overflowing has always been broken. We need to check for this and return the appropriate error code.
PAPI Overflow API Changes • Old: typedef void (*PAPI_overflow_handler_t)(int EventSet, int EventCode, int index, long_long *latest, int *threshold, void *context); void *PAPI_get_overflow_address(void *context); • New: typedef void (*PAPI_overflow_handler_t)(int EventSet, void *address, void *context); int PAPI_get_overflow_ctrs(int EventSet, void *context, int *papi_event_indices);
Overflow Handler and the PC • Problem: The address of the overflow is always used to record the PC. • Solution: We no longer require that the user call PAPI_get_overflow_address() to get this information. • It is now done implicitly by the substrate's signal handler in a MACRO.
Overflow Handler and Many Counters • Problem: When multiple counters are overflowing, we need to figure out which counter overflowed. • More than one overflow can happen in a single call to the handler. • Solution: The machine specific bits telling us which counter overflowed are hidden inside the context structure.
Single Counter Example extern ushort *profile_buffer; void my_overflow_handler(int EventSet, void *address, void *context) { unsigned bucket; bucket = my_addr_hash_fn(address); profile_buffer[bucket]++; }
Multiple Counter Example extern ushort **profile_buffers; void my_overflow_handler(int EventSet, void *address, void *context) { int j, bucket, num; int papi_event_idxs[PAPI_MAX_OVERFLOW_CNTRS]; ushort *prof_buf; num = PAPI_get_overflow_ctrs(EventSet, context, papi_event_idxs); for (j=0;j<num;j++) { prof_buf = profile_buffers[papi_event_idxs[j]]; bucket = my_addr_hash_fn(address); prof_buf[bucket]++; } }
PAPI_get_overflow_ctrs() int PAPI_get_overflow_ctrs(int EventSet, void *context, int *papi_event_indices) • Given information in the context structure, output an array of PAPI event locations corresponding to those that overflowed. • THIS MUST BE FAST. Internally, this should be constant time for every counter that overflows.
PAPI_get_overflow_ctrs() int PAPI_get_overflow_ctrs(int EventSet, void *context, int *papi_event_indices) { int i; unsigned bits, papi_index, total = 0; /* All CAPITALIZED functions are implemented as Macros! */ bits = GET_OVERFLOW_CTR_BITS(context); do { i = ffs(bits) – 1; bits ^= (1 << i) papi_index = HASH_OVERFLOW_CTR_BITS_TO_PAPI_INDEX(i); papi_event_indices[total] = papi_index; total++; } while (bits); return(total); }
Overflow on Linux/Perfctr Context is now an opaque pointer to a structure built on the stack by the substrates signal handler. Why? Because we need both the siginfo_t and the ucontext structure. See perfctr/examples/signal/signal.c for details. typedef { siginfo_t *si; void *ucontext; } papi_hwd_context_t; void actual_sighandler(int sig, siginfo_t *si, void *puc) { papi_hwd_context_t ctx; ctx.si = si; ctx.ucontext = puc; overflow_handler(....,&ctx); }
Overflow MACROS on Linux/Perfctr Platform specific macros. See perfctr/examples/signal/signal.c for details. #define GET_OVERFLOW_CTR_BITS(context) (((papi_hwd_context_t *)context)->si->si_pmc_ovf_mask) #define HASH_OVERFLOW_CTR_BITS_TO_PAPI_INDEX(bit) (_papi_hwi_event_index_map[bit]) The above array is built incrementally by the PAPI_overflow() call. On Linux, the bits provided by the hardware go from from 0 to 1, (0 to 3 on AMD) and thus requires no adjustment to be used as a DIRECT index into the above array.
PAPI_read() and PAPI_overflow() • Counters set to overflow now have an undefined value. • PAPI_read() returns undetermined values for those counters that have had overflowing enabled. • User can now call PAPI_read() inside a signal handler if he wants the values.
Profiling Extensions • Support profiling: • On more than one counter • On native events • Add 32 and 64 bit buckets. Current SVR4 limit is a bucket with a maximum value of 65536 (16 bits). • PAPI_PROFIL_32 • PAPI_PROFIL_64
Profiling Extensions int PAPI_profil(unsigned short *buf, unsigned bufsiz, unsigned long offset, unsigned scale, int EventSet, int EventCode, int threshold, int flags); typedef struct _papi_sprofil { unsigned short *pr_base; /* buffer base */ unsigned pr_size; /* buffer size */ unsigned long pr_off; /* pc offset */ unsigned pr_scale; /* pc scaling */ } PAPI_sprofil_t; int PAPI_profil(void *buf, unsigned bufsiz, unsigned long offset, unsigned scale, int EventSet, int EventCode, int threshold, int flags); typedef struct _papi_sprofil { void *pr_base; /* buffer base */ unsigned pr_size; /* buffer size */ unsigned long pr_off; /* pc offset */ unsigned pr_scale; /* pc scaling */ } PAPI_sprofil_t;
PAPI profile/sprofile and overflow • Call multiple times to add overflow events • Passing a PAPI_NULL to the profil() or overflow() calls will result in the normal, default profil() behavior on the system. Simply set up emulated overflow with an ITIMER_PROF timer and ever time the handler is executed, we record the PC. It will have no bits set in the part of the context that indicates which hardware counter overflowed. This data will give identical data as that spit out by gprof. This can be used to generate gprof data simultaneously with papi data. Great for a performance tool. • Add configurable multiplexing interval. Should be a run-time and environment variable option. Value is in Hz. PAPI_MPX_HZ • Add configurable sampling interval. Should be a run-time and environment variable option. Value is in Hz. PAPI_SMPL_HZ • Overflow should use the last handler passed in, and NULL turns off the event as an overflow event.
PAPI_mem_info_t • PAPI_mem_info_t must change to have 3 levels each, each as arrays with a length. • Each entry also has: (Not implemented yet) • Latency • Minimum cycles • Maximum cycles
Cache/TLB Information • Provides information on multilevel 3 levels of TLB & Cache architecture. • Return structure will include our best effort at finding minimum and maximum latencies from the architecture manual. • Remove PAPI_get_mem_info() • Values will be returned from PAPI_get_hw_info()
PAPI Debug Support • Currently debugging messages are binary. • On/Off • Debugging messages should be a little like syslog for different PAPI layers. • Threads • Multiplexing • Overflowing • High level • Low level • Substrate • Substrate counter values
Preset Events • Remove all rate events: • Rates break PAPI calling semantics, • Rates need to be floating point numbers. • PAPI_FLOPS: OPS are defined differently. • PAPI_IPS: No one uses IPS as a metric.
Preset Events • Change TLB events to: • PAPI_L1_ITLB, PAPI_L1_DTLB, PAPI_L1_TTLB • Same for L2 and L3 • Add STALL events: • PAPI_FE_STL: Front end stall • PAPI_BE_STL: Back end stall • PAPI_MEM_STL
PAPI 3 Event Description API The next dozen slides describe changes to the API calls that describe events to the user. These changes are intended to provide a simplified common interface for describing both preset and native events.
Event Description API Old API int PAPI_describe_event(char *name, int *EventCode, char *description) int PAPI_query_event(int EventCode) int PAPI_query_event_verbose(int EventCode, PAPI_preset_info_t *info) const PAPI_preset_info_t *PAPI_query_all_events_verbose(void) int PAPI_label_event(int EventCode, char *label) int PAPI_event_code_to_name(int EventCode, char *out) int PAPI_event_name_to_code(char *in, int *out) New API int PAPI_query_event(int EventCode) int PAPI_get_event_info(PAPI_event_info_t *event_info) int PAPI_event_code_to_name(int EventCode, char *out) int PAPI_event_name_to_code(char *in, int * EventCode) int PAPI_enum_event(int EventCode, int modifier)
API Philosophy 101 • K.I.S.S. – keep it simple, stupid • Avoid ‘convenience’ APIs • Fewer APIs == • Less to document • Less code • Less to break • Symmetry • GETs imply SETs • If it works for case A, see if it can work for case B(e.g. presets and native) • Break any of these rules to make things simpler
New Event Description Goals • Operate symmetrically on preset and native event tables • Separate exposed user structure from internal data structures • Eliminate ‘query all’ functions to decouple from internal structures and allow calling code to manage memory • Add an ‘enum’ function to scan valid event table entries
PAPI 3 Event Description API /* Returns event existence status */int PAPI_query_event(int EventCode) /* Returns structure containing human-readable info for an EventCode */int PAPI_get_event_info(int EventCode, PAPI_event_info_t *event_info) /* Returns name of given event code */int PAPI_event_code_to_name(int EventCode, char *name) /* Returns event code for given name */int PAPI_event_name_to_code(char *name, int *EventCode) /* Updates EventCode to next valid value, or returns error; modifer can specify {all / available} for presets, or other values for native tables and may be platform specific (Major groups / all mask bits; P / M / E chip, etc) */int PAPI_enum_event(int *EventCode, int modifier)
PAPI 2 Event Description Structure(s) Hardware Independent Hardware Dependent (e.g. Linux) typedef struct hwd_preset { unsigned char selector; unsigned char derived; unsigned char operand_index; struct perfctr_control counter_cmd; char note[PAPI_MAX_STR_LEN]; } hwd_preset_t; typedef struct pre_info { char *event_name; unsigned int event_code; char *event_descr; char *event_label; int avail; char *event_note; int flags; } PAPI_preset_info_t; In PAPI 2, one structure defined the hardware independent parts of preset events, and another structure defined the hardwaredependent parts. There was no separate description of native events. This changes in PAPI 3.
PAPI 3 Event Description Structure(s) typedef struct preset_search { unsigned int preset; int derived; int natEvent[MAX_COUNTER_TERMS]; } preset_search_t; typedef struct hwi_preset { int derived; int metric_count; int natIndex[MAX_COUNTER_TERMS]; char operation[OPS]; char note[PAPI_MAX_STR_LEN]; } hwi_preset_t; typedef struct pre_info { char *event_name; unsigned int event_code; char *event_descr; char *event_label; int avail; char *event_note; int flags; } PAPI_preset_info_t; Currently, PAPI 3 has 3 hardware independent structures. The first is identical to PAPI 2. The second is a dense structure containing the preset events defined for a specific platform. The third is a sparse array the same size as the first into which the second array is copied at init time. The first and third structures can be merged.
PAPI 3 Event Description Structure(s) typedef struct preset_search { unsigned event_code; int derived; char operation[OPS]; int nativeEvent[MAX_COUNTER_TERMS]; char note[PAPI_MAX_STR_LEN]; } preset_search_t; typedef struct { char *symbol; char *short_descr; char *long_descr; unsigned event_code; int derived; char operation[OPS]; int nativeEvent[MAX_COUNTER_TERMS]; char event_note[PAPI_MAX_STR_LEN]; } PAPI_preset_event_info_t; Combining and rearranging produces the above structures. The first four fields are statically initialized in the hwi. The last four fields are dynamically initialized from static information in the hwd preset_search array. Event existence (avail) is signaled by a non-zero nativeEvent[0]. This captures the internal description of a PAPI preset event.
External Event Description Structure A pointer to a copy of this structure is passed to PAPI_get_event_info() after initializing the event_code field with the event of interest. Strings are constructed if necessary and copied into this structure. Memory management is handled by the caller. typedef struct { unsigned event_code; int avail; unsigned derived; char symbol[PAPI_MIN_STR_LEN]; char short_descr[PAPI_MIN_STR_LEN]; char long_descr [PAPI_MAX_STR_LEN]; char vendor_symbol[PAPI_MAX_STR_LEN]; char vendor_descr[PAPI_HUGE_STR_LEN]; char event_note[PAPI_MAX_STR_LEN]; } PAPI_event_info_t; Preset events may fill in all fields as appropriate, while native events will only fill in the ‘vendor’ and ‘avail’ fields.
PAPI 2 Native Event Support • Native events supported through a 32-bit binary code • Easy to implement • Flexible and powerful • Clumsy for end user • Newer architectures restricted by 32-bit limit
Programmable Event Interface • Consider the following: • How many cycles did the processor retire more than 3 instructions? • How many cycles did the memory request buffer have more than 2 entries? • What is the average number of cycles that memory requests are pending? • You can threshold: • Event (number of events) • Duration (cycles some event is happening)
PAPI 3 Native Event Support • Substrates contain internal native event table. • Similar to old preset table info • Vendor names & descriptions • Counter mappings for register allocation • Preset handling becomes hardware independent by referencing the native table • Self-documenting • Accommodates arbitrary event structure info • May not contain all possible variations
PAPI 3 Event Editing • Provide a mechanism to edit preset and native event tables. • First implementation may be binary only • {get,set}_name, descr, hwd_register_t • Provides low level hooks to modify tables • Tools could be built to translate binary into an ASCII internal representation • Could be used as part of a ‘config file’ mechanism • Later tools could support XML
PAPI_{en,de}code_event() int PAPI_encode_event(char *, int *eventCode) int PAPI_decode_event(int eventCode, char *) • Symmetric across encode/decode • Symmetric across PRESET/Native • Lets users: • explore preset definitions • (re)define presets • define new/custom native events • export preset / native events for: • Documentation • Later import via config file or custom program • Experiment with new or alternative events
PAPI_{en,de}code_event() int PAPI_encode_event(char *, int *eventCode) int PAPI_decode_event(int eventCode, char *) • Specification string could be: • Custom format (varargs or delimited text) • Simple(r) & quicker implementation • XML tag format • Well understood standard • Rich & expressive • Potentially useful elsewhere in PAPI • Language neutral • Possibly big and slow • Longer implementation
High-level Interface • Initialization • Passively check status of library • Thread safety • No knowledge of thread library • Rename high level PAPI_flops to PAPI_flips • Add PAPI_ipc like PAPI_flips • Add utility file that does rate accounting and arithmetic for you.
Profiling Extensions • Support profiling: • On more than one counter • On native events • Add 32 and 64 bit buckets. Current SVR4 limit is a bucket with a maximum value of 65536 (16 bits). • PAPI_PROFIL_32 • PAPI_PROFIL_64
PAPI Library Options • Add predefined options have “GET/SET” removed. • These are implied by the PAPI_set_opt()/PAPI_get_opt() calls. • Remove all GET/SET from #defines in papi.h.
PAPI 3.0 Web Page • Designated web master. (Rotating?) • New reports: • PAPI Web Stats • Code coverage • Build status • Bug tracking/reporting • Improved processor documentation/links • Separate tools section with: • Small synposis with screenshot • Links to main page • Links to reviews (Shirley)
Initialization Status • Some users need to know if PAPI has been initialized and which 'level' has been used. • Solution, a MACRO: PAPI_is_initialized() • Returns: • 0: False • 1: PAPI_LOW_LEVEL_INITED • 2: PAPI_HIGH_LEVEL_INITED