1 / 39

Application-Specific Signatures for Transactional Memory in Soft Processors

ECE Dept. University of Toronto. Application-Specific Signatures for Transactional Memory in Soft Processors. Martin Labrecque Mark Jeffrey Gregory Steffan. FPGA. Soft Processor. DDR controller. Ethernet MAC controllers. FPGAs for Systems-on-Chip. Increasingly large Systems-on-Chip

min
Download Presentation

Application-Specific Signatures for Transactional Memory in Soft Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE Dept. University of Toronto Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan

  2. FPGA Soft Processor DDR controller Ethernet MAC controllers FPGAs for Systems-on-Chip • Increasingly large Systems-on-Chip • Many CPUs, accelerators, IP blocks • Processors are easier to program than hardware • FPGAs & multicores: similar parallel programming challenge Why are parallel programschallenging?

  3. Atomic Atomic Packet Processing Example SINGLE-THREADED MULTI-THREADED packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++; packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++; Challenges: 1- Must correctly delimit atomic operations 2- Improve performance by finer-grain locking

  4. Atomic Optimisic Parallelism across Connections Atomic Packet Processing Example MULTI-THREADED Opportunity for Parallelism packet = get_packet(); … connection = database->lookup(packet); if(connection == NULL) connection = database->add(packet); connection->count++; … global_packet_count++; No Parallelism

  5. Exploit Opportunity for Parallelism • Allow more than 1 thread in a critical section • Will succeed if threads access different data • Transactional Memory • the new hot topic for multiprocessor computers • how to map TM to FPGAs?

  6. Our Transactional Approach • Modify main memory directly: reduce copies, faster commit • Detect conflicts prior to corrupting main memory • Undo changes on transaction abort processor1 processor2 x x Data Data Cache Off-chip DDR • How to efficiently detect conflicts?

  7. Transaction2 Transaction1 Read A Read A OK Read B Write B CONFLICT Write D Write C Write D Read C CONFLICT CONFLICT Conflict Detection • Tracking speculative reads and writes • Compare accesses across transactions: Must detect all conflicts for correctness Reporting false conflicts is acceptable

  8. Related Work on Conflict Detection • FPGAs: test speculative bits in the cache • Complex to evict cache lines • Lots of additional state • Too restrictive in terms of storage capacity • ASIC: compare signatures • Signature: bit vector recording TM memory accesses • No previous signature FPGA implementation Signatures well suited to FPGA bitwise operations How can signatures be efficiently implemented?

  9. AND processor2 Conflict Detection with Signatures • Hash of an address indexes into a bit vector Signatures processor1 load Hash Function Write Read store • More bits per signature  more resolution • FPGA timing and area limit the number of bits • Hash functions have varying complexity/accuracy

  10. Goals of this Work • Implement efficient signatures for TM on FPGAs • FPGA reconfigurability  better/more-efficient TM • Evaluate with real system

  11. Existing Hash Functions Bit Selection 4 bits hash index into 16 signature bits Address bits Hash = 0 0 ... 1 1 ... 0 1 1 0

  12. Hash_1 = Hash_2 = Multiple hash functions index different parts of the signature Existing Hash Functions (continued) H3: XOR random address bits Address bits Address bits 1 1 0 1 1 0 0 0 1 0 ... 1 ... 1 1 1 1 0 We use 4 hash functions to improve performance/length

  13. Existing Hash Functions (continued) PBX: XOR high-order bits with low-order ones LE-PBX: XOR high-order bits with low-order ones, progressively omit low-order bits in hash functions Hash_1 = Hash_2 = Address bits Address bits Address bits Hash_2 = 1 0 1 0 1 1 0 1 0 ... 0 ... 1 ... 1 0 0 1 0 1 1

  14. Signatures: an Opportunity for FPGAs • ASIC hash functions on FPGA: very area consuming • Due to locality: • applications access certain memory locations more frequently • certain locations will have more conflicts than others • Via app-specific signatures: • increase tracking resolution of conflicting memory locations • decrease tracking resolution of others • FPGAs allow customized hash function for each application Application-specific signatures!

  15. Binary Addresses (profiling) 0 0 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 root 1xx 0xx 11x 10x 01x 00x 111 110 101 100 011 000 Trie-based Hashing for Signatures Leaves are distinct addresses  signature bits • Trie gives control on the resolution for different memory regions • Complete trie of all TM accesses is HUGE • Which leaves in the trie can/cannot be merged?

  16. A2,A1,A0 A2,A1,A0 xxx Simulation feedback: 1xx 0xx 11x 10x 01x 00x 111 110 101 100 011 000 A2 & A0 A2 & !A0 !A2 Load/Store A2 A1 A0 Trie-Based Conflict Detection 3 leaves in trie  3 signature bits encompass all accesses Compact trie by only evaluating nodes with remaining branching Representation is very efficient!

  17. Trie-based Hash functionEvaluation Training packet trace is different from test packet trace

  18. Synch. Unit processor1 processor2 I$ I$ 1-thread 1-thread Instr. Data Input mem. Output mem. Input Buffer Shared Data Cache Output Buffer packet output packet input Off-chip DDR Multiprocessor System • NetFPGA: Virtex II Pro 50, 4 GigE + 1 PCI interfaces • 2 processors @ 125 MHz (limited by FPGA) • 64 MB DDR2 SDRAM @ 200 MHz Real system executing real applications

  19. Simulated Ratio of False Conflicts versus Number of Signature Bits NAT, percent false conflicts - Trie-based hashing function requires much fewer signature bits

  20. Simulated Ratio of False Conflicts versus Number of Signature Bits UDHCP NAT Classifier Intruder - Trie-based hashing function requires much fewer signature bits

  21. Ideal Simulated Packet Rate Normalized to Ideal Conflict Detection vs Trie-Based Signature Length Signatures are Critical to Performance

  22. Block RAM Arbitrary hash function Registers ~100 signature bits per thread 2 Best Implementation Options Maximum Design @ 125MHz Bit-Select hash function 2048 signature bits per thread Let’s Compare! Signatures We use trie-based signatures: They perform best at that size

  23. +71% +58% +12% +9% Trie-based Hashing Normalized to BitSelection Area Throughput - At most 5% area overhead - Significantly fewer rollbacks packet rate increase

  24. Conclusions • Conflict detection significantly impacts performance • Trie-based hashing reduces required signature bits • Trie-based hashing can be implemented in LUTs • Preserve frequency, 5% area overhead • Retiming is required to implement in RAMs • Increased performance (up to 71%) versus other best implementation (RAM-based bit-select) - Application-specific signatures enable first fully integrated TM processor for FPGA - We now have an extended version working with 8 threads

  25. ECE Dept. University of Toronto Thank you! Martin Labrecque Mark Jeffrey Gregory Steffan martinL/markJ@eecg.utoronto.ca

  26. Alleviate need for fine grained-synchronization Bool val = f(shared_1); if(val) { Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); } Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); AFTER BEFORE • Reduce conservative synchronization overhead Lock(); if (shared_1) array [ i ] = 0; Unlock(); Only serialized when truly necessary Transactional MemoryParallel Programming Made Easy

  27. Our Transactional Approach • No program change required • Modify directly main memory • Detect conflicts prior to corrupting main memory • Undo changes on transaction abort processor processor x x Data Data Cache x Off-chip DDR

  28. sigsvn_udhcp/statsout fp rates sigsvn_other/mat other stats

  29. Alleviate need for fine grained-synchronization Bool val = f(shared_1); if(val) { Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); } Lock(); if ( f(shared_1) ) shared_1 = 0; Unlock(); AFTER BEFORE • Reduce conservative synchronization overhead Lock(); if (shared_1) array [ i ] = 0; Unlock(); Only serialized when truly necessary Transactional MemoryParallel Programming Made Easy

  30. Hazard Detection Logic Transactional Single-Threaded Processor (simplified) Data Cache P C Reg. Array Instr. Cache ALU +4 Hazard detection is too slow: use static hazard detection

  31. Transactional Single-Threaded Processor (simplified) Conflict Detection Undo Log Data Cache P C P C Reg. Array Reg. Array Instr. Cache ALU +4

  32. Transactional Packet Processing • Hardware support to revert speculative changes to: • Register file • Program counter • Data memory • To detect failed speculation: • Record read and write sets of speculative threads • Compare sets across threads When does the set comparison take place?

  33. Conflict Detection with Signatures • Suited for FPGA bitwise operations • Hash of an address sets bits in a bit vector • Set comparison is an AND operation • Clearing sets is done in 1 cycle Signature Thread 0 W 01000000 R 00000000 W 00000000 R 00000000 processor x Signature Thread 1 W 01000000 R 00000000 W 00000000 R 00000000 processor x • Requires many bits per thread • Timing constraints allow read and write set tracking for 2 threads • -Made a single-threaded 2-processor implementation

  34. root 1xx 0xx 11x 00x 111 110 000

  35. A New Meaning for Locks • Optimistically consider locks • No program change required Thread1Thread2Thread3Thread4 LOCKS Lock(); if ( f( ) ) shared_1 = a(); else shared_2 = b(); Unlock(); TRANSACTIOAL Thread1Thread2Thread3Thread4 x • Reduce conservative synchronization overhead • Reduce challenge of fine grained-synchronization

  36. * can you list the apps? • emphasize that train != test in methodology page

More Related