390 likes | 513 Views
ECE Dept. University of Toronto. Application-Specific Signatures for Transactional Memory in Soft Processors. Martin Labrecque Mark Jeffrey Gregory Steffan. FPGA. Soft Processor. DDR controller. Ethernet MAC controllers. FPGAs for Systems-on-Chip. Increasingly large Systems-on-Chip
E N D
ECE Dept. University of Toronto Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan
FPGA Soft Processor DDR controller Ethernet MAC controllers FPGAs for Systems-on-Chip • Increasingly large Systems-on-Chip • Many CPUs, accelerators, IP blocks • Processors are easier to program than hardware • FPGAs & multicores: similar parallel programming challenge Why are parallel programschallenging?
Atomic Atomic Packet Processing Example SINGLE-THREADED MULTI-THREADED packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++; packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++; Challenges: 1- Must correctly delimit atomic operations 2- Improve performance by finer-grain locking
Atomic Optimisic Parallelism across Connections Atomic Packet Processing Example MULTI-THREADED Opportunity for Parallelism packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++; No Parallelism
Exploit Opportunity for Parallelism • Allow more than 1 thread in a critical section • Will succeed if threads access different data • Transactional Memory • the new hot topic for multiprocessor computers • how to map TM to FPGAs?
Our Transactional Approach • Modify main memory directly: reduce copies, faster commit • Detect conflicts prior to corrupting main memory • Undo changes on transaction abort processor1 processor2 x x Data Data Cache Off-chip DDR • How to efficiently detect conflicts?
Transaction2 Transaction1 Read A Read A OK Read B Write B CONFLICT Write D Write C Write D Read C CONFLICT CONFLICT Conflict Detection • Tracking speculative reads and writes • Compare accesses across transactions: Must detect all conflicts for correctness Reporting false conflicts is acceptable
Related Work on Conflict Detection • FPGAs: test speculative bits in the cache • Complex to evict cache lines • Lots of additional state • Too restrictive in terms of storage capacity • ASIC: compare signatures • Signature: bit vector recording TM memory accesses • No previous signature FPGA implementation Signatures well suited to FPGA bitwise operations How can signatures be efficiently implemented?
AND processor2 Conflict Detection with Signatures • Hash of an address indexes into a bit vector Signatures processor1 load Hash Function Write Read store • More bits per signature more resolution • FPGA timing and area limit the number of bits • Hash functions have varying complexity/accuracy
Goals of this Work • Implement efficient signatures for TM on FPGAs • FPGA reconfigurability better/more-efficient TM • Evaluate with real system
Existing Hash Functions Bit Selection 4 bits hash index into 16 signature bits Address bits Hash = 0 0 ... 1 1 ... 0 1 1 0
Hash_1 = Hash_2 = Multiple hash functions index different parts of the signature Existing Hash Functions (continued) H3: XOR random address bits Address bits Address bits 1 1 0 1 1 0 0 0 1 0 ... 1 ... 1 1 1 1 0 We use 4 hash functions to improve performance/length
Existing Hash Functions (continued) PBX: XOR high-order bits with low-order ones LE-PBX: XOR high-order bits with low-order ones, progressively omit low-order bits in hash functions Hash_1 = Hash_2 = Address bits Address bits Address bits Hash_2 = 1 0 1 0 1 1 0 1 0 ... 0 ... 1 ... 1 0 0 1 0 1 1
Signatures: an Opportunity for FPGAs • ASIC hash functions on FPGA: very area consuming • Due to locality: • applications access certain memory locations more frequently • certain locations will have more conflicts than others • Via app-specific signatures: • increase tracking resolution of conflicting memory locations • decrease tracking resolution of others • FPGAs allow customized hash function for each application Application-specific signatures!
Binary Addresses (profiling) 0 0 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 root 1xx 0xx 11x 10x 01x 00x 111 110 101 100 011 000 Trie-based Hashing for Signatures Leaves are distinct addresses signature bits • Trie gives control on the resolution for different memory regions • Complete trie of all TM accesses is HUGE • Which leaves in the trie can/cannot be merged?
A2,A1,A0 A2,A1,A0 xxx Simulation feedback: 1xx 0xx 11x 10x 01x 00x 111 110 101 100 011 000 A2 & A0 A2 & !A0 !A2 Load/Store A2 A1 A0 Trie-Based Conflict Detection 3 leaves in trie 3 signature bits encompass all accesses Compact trie by only evaluating nodes with remaining branching Representation is very efficient!
Trie-based Hash functionEvaluation Training packet trace is different from test packet trace
Synch. Unit processor1 processor2 I$ I$ 1-thread 1-thread Instr. Data Input mem. Output mem. Input Buffer Shared Data Cache Output Buffer packet output packet input Off-chip DDR Multiprocessor System • NetFPGA: Virtex II Pro 50, 4 GigE + 1 PCI interfaces • 2 processors @ 125 MHz (limited by FPGA) • 64 MB DDR2 SDRAM @ 200 MHz Real system executing real applications
Simulated Ratio of False Conflicts versus Number of Signature Bits NAT, percent false conflicts - Trie-based hashing function requires much fewer signature bits
Simulated Ratio of False Conflicts versus Number of Signature Bits UDHCP NAT Classifier Intruder - Trie-based hashing function requires much fewer signature bits
Ideal Simulated Packet Rate Normalized to Ideal Conflict Detection vs Trie-Based Signature Length Signatures are Critical to Performance
Block RAM Arbitrary hash function Registers ~100 signature bits per thread 2 Best Implementation Options Maximum Design @ 125MHz Bit-Select hash function 2048 signature bits per thread Let’s Compare! Signatures We use trie-based signatures: They perform best at that size
+71% +58% +12% +9% Trie-based Hashing Normalized to BitSelection Area Throughput - At most 5% area overhead - Significantly fewer rollbacks packet rate increase
Conclusions • Conflict detection significantly impacts performance • Trie-based hashing reduces required signature bits • Trie-based hashing can be implemented in LUTs • Preserve frequency, 5% area overhead • Retiming is required to implement in RAMs • Increased performance (up to 71%) versus other best implementation (RAM-based bit-select) - Application-specific signatures enable first fully integrated TM processor for FPGA - We now have an extended version working with 8 threads
ECE Dept. University of Toronto Thank you! Martin Labrecque Mark Jeffrey Gregory Steffan martinL/markJ@eecg.utoronto.ca
Alleviate need for fine grained-synchronization Bool val = f(shared_1); if(val) { Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); } Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); AFTER BEFORE • Reduce conservative synchronization overhead Lock(); if (shared_1) array [ i ] = 0; Unlock(); Only serialized when truly necessary Transactional MemoryParallel Programming Made Easy
Our Transactional Approach • No program change required • Modify directly main memory • Detect conflicts prior to corrupting main memory • Undo changes on transaction abort processor processor x x Data Data Cache x Off-chip DDR
sigsvn_udhcp/statsout fp rates sigsvn_other/mat other stats
Alleviate need for fine grained-synchronization Bool val = f(shared_1); if(val) { Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); } Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); AFTER BEFORE • Reduce conservative synchronization overhead Lock(); if (shared_1) array [ i ] = 0; Unlock(); Only serialized when truly necessary Transactional MemoryParallel Programming Made Easy
Hazard Detection Logic Transactional Single-Threaded Processor (simplified) Data Cache P C Reg. Array Instr. Cache ALU +4 Hazard detection is too slow: use static hazard detection
Transactional Single-Threaded Processor (simplified) Conflict Detection Undo Log Data Cache P C P C Reg. Array Reg. Array Instr. Cache ALU +4
Transactional Packet Processing • Hardware support to revert speculative changes to: • Register file • Program counter • Data memory • To detect failed speculation: • Record read and write sets of speculative threads • Compare sets across threads When does the set comparison take place?
Conflict Detection with Signatures • Suited for FPGA bitwise operations • Hash of an address sets bits in a bit vector • Set comparison is an AND operation • Clearing sets is done in 1 cycle Signature Thread 0 W 01000000 R 00000000 W 00000000 R 00000000 processor x Signature Thread 1 W 01000000 R 00000000 W 00000000 R 00000000 processor x • Requires many bits per thread • Timing constraints allow read and write set tracking for 2 threads • -Made a single-threaded 2-processor implementation
root 1xx 0xx 11x 00x 111 110 000
A New Meaning for Locks • Optimistically consider locks • No program change required Thread1Thread2Thread3Thread4 LOCKS Lock(); if ( f( ) ) shared_1 = a(); else shared_2 = b(); Unlock(); TRANSACTIOAL Thread1Thread2Thread3Thread4 x • Reduce conservative synchronization overhead • Reduce challenge of fine grained-synchronization
* can you list the apps? • emphasize that train != test in methodology page