150 likes | 398 Views
dIP: A Non-Intrusive Debugging IP for Dynamic Data Race Detection in Many-core Chi-Neng Wen, Shu-Hsuan Chou and Tien-Fu Chen National Chung-Cheng University, Chia-Yi, Taiwan e-mail: {wcn93, csh93, chen}@cs.ccu.edu.tw
E N D
dIP: A Non-Intrusive Debugging IP for Dynamic Data Race Detection in Many-core Chi-Neng Wen, Shu-Hsuan Chou and Tien-Fu Chen National Chung-Cheng University, Chia-Yi, Taiwan e-mail: {wcn93, csh93, chen}@cs.ccu.edu.tw 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks EICE team Presenter :Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo
Abstract (1) • Traditional debug facilities are limited in providing debugging requirements for multicore parallel programming. Synchronization problems or bugs due to race conditions are particularly difficult to detect with software debugging tools. This work presents a fast and feasible hardware-assistant solution for many-core non-intrusive debugging. The key idea is to keep tracks of data accesses of shared memory areas and their lock synchronization activities by proposed data structures in proposed debugging IP (dIP). A page-based shared variable cache is provided to keep shared variables as long as possible, and an inexpensive pluggable off-chip RAM can eliminate the false-positive rate efficiently.
Abstract (2) 3 • To decrease the debugging traffic block, this work provides a thread library to specify shared memory/lock events and transmit those events to the dIP by a small proper hardware co-processor (eXtend dIP) of each core. Our experimental result shows the debugging traffic block (worse-case) by increasing cores, and adding tolerance buffers in XdIP can efficiently ease off. Moreover, the real workloads (SPLASH-2, MPEG-4, and H.264) are executed by the dIP non-instructive race-detection with only 4.7%~12.2% slow down in average. Finally, the hardware cost of dIP is also low when the growing of many-core.
What’s the problem 4 • Data race detection in multi-cores • Software method • Cause probe effect • Hardware method • Cause lot of memory (or hardware area) needed for log cores behavior • Cause false positive • This paper propose method • Not software method • Use related work [3] to avoid probe effect • Use centralized race detection : don’t increase huge hardware area when increase cores
Related work Race detection (multi-core) Software [5][6] This paper method hardware [7][8][9] Lock-set algo.[4] Related work[3] 5 • Probe effect was introduced in related work [1] • Use related work [4] for data race detection • Related work [3] separate debugging data path from usual data path to avoid probe effect
Propose MPSOC framework • Every core has a XdIP • XdIP as a co-processor for each core • XdIP is used to send debug event to dIP through Debug I/F • The interconnection flow the standard of related work [3] • Data I/F is used for usual data path • Debug I/F is used for debug event path 6
XdIP architecture • The architecture is quite simply • Filter to filter debug event (Lock and Mem access info) to buffer which in packet & send and wait for sending to dIP • Filter is settled by SW setting • Event monitor and transfer in each core • When buffer is full, it will announce dIP to stall all core for event transfer 7
Data race detection flow • First Table manager accept debug event from XdIP and then maintain shard variable cache, lock-set and core-status table • Second Rule logic check if data race happen or not • happen: Alert will be enable to notify exception handler to fix race detection 8
dIP architecture • Data race detection flow corresponds 1~5 • 6 is for ordering debug event (SqID) • 7 is external RAM for cache miss 9
Three tables • Page-base Variable table is used for recording variable latest access state • Lock-key table is used for recording how many lock-set and how many lock key are available • Core-status table is used for recording core state (thread, lock set, SqID) Fully association 10
Allocation/de-allocation lock-key • Allocation • Thread A execute W_lock S1, then the event sent to dIP by XdIP • dIP allocate a lock-key to thread A, then thread A save lock-key number with S1 • de-allocation • Thread A execute W_unlock S1, in the mean time the lock-key will send to dIP together to de-allocate 12
Data race detect • rule core1 core2 2 11 6 13
Experiences Sol: add buffer in XdIP 14 • When XdIP buffer full ,dIP will stall all cores for non-intrusive. • stall will reduce system performance, use a experience to show stall ratio by using SPLASH-2 benchmarks
Experiences 15 • Four different benchmarks worse case performance down is 12.25% • Compare with related work [9]