220 likes | 226 Views
This paper presentation discusses the concept of Transactional Coherence and Consistency (TCC) and its application in hardware and software systems. It explores the current hardware architecture, TCC in hardware, TCC in software, performance evaluation, and provides a conclusion.
E N D
Coe-502 paper presentation 2 Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g201103010)
OUtline Introduction Current Hardware TCC in Hardware TCC in Software Performance evaluation Conclusion.
Intoduction • Transactional Coherence and Consistency (TCC) provides a lock free transactional model which simplifies parallel hardware and software. • Transactions are the basic unit of parallel work which are defined by the programmer. • Memory coherence, communication and memory consistency are implicit in a transaction.
Current Hardware • Provide illusion of a single shared memory to all processors. • Problem is divided into various parallel tasks that work on a shared data present in shared memory. • Complex cache coherence protocols required. • Memory consistency models are also required to ensure the correctness of the program. • Locks used to prevent data races and provide sequential access. • Too many locks overhead can degrade performance.
TCC in HARDWARE • Processors execute speculative transactions in a continuous cycle. • A transaction is a sequence of instructions marked by software that are guaranteed to execute and complete atomically. • Provides All Transactions All The time model which simplifies parallel hardware and software.
TCC in HARDWARE • When a transaction starts, it produces a block of writes in a local buffer while transaction is executing. • After completing transaction, hardware arbitrates system wide for permission to commit the transaction. • After acquiring permission, the node broadcasts the writes of the transaction as one single packet. • Transmission as a single packet reduces number of inter processor messages and arbitrations. • Other processors snoop on these write packets for dependence violation.
TCC in HARDWARE • TCC simplifies cache design • Processor hold data in unmodified and speculatively modified form. • During snooping invalidation is done if commit packet contains address only. • Update is done if commit packet contains address and data. • Protection against data dependencies. • If a processor has read from any of the commit packet address, the transaction is re executed.
TCC in HARDWARE • Current CMP need features that provide speculative buffering of memory references and commit arbitration control. • Mechanism for gathering all modified cache lines from each transaction into a single packet is required. • Write Buffer completely separate from cache. • Address buffer containing list of tags for lines containing data to be committed.
TCC in HARDWARE • Read BITs • Set on a speculative read during a transaction. • Current transaction is voilated and restarted if the snoop protocal sees a commit packet having address of a location whose read bit is set. • Modified BITs • During a transaction stores set this bit to 1. • During violation lines having modified bit set to 1 are invalidated.
TCC in Software • Programming with TCC is a 3 Step process. • Divide program into transactions. • Specify Transactions Order. • Can be relaxed if not required. • Tuning Performance • TCC provide feedback where in program the violations occur frequently
Loop Based Parallelization • Consider Histogram Calculation for 1000 integer percentage • /* input */ • int *in = load_data(); • int i, buckets[101]; • for (i = 0; i < 1000; i++) { • buckets[data[i]]++; • } • /* output */ • print_buckets(buckets);
Loop Based Parallelization • Can be parallelized using. • t_for (i = 0; i < 1000; i++) • Each loop body becomes a separate transaction. • When two parallel iterations try to update same histogram bucket, TCC hardware causes later transaction to violate, forcing the later transaction to re execute. • A conventional Shared memory model would require locks to protect histogram bins. • Can be further optimized using • t_for_unordered()
Fork Based Parallelization • t_fork() forces the parent transaction to commit and create two completely new transactions. • One continues execution of remaining code • Second start executing the function provided in parameters. E.g • /* Initial setup */ • intPC = INITIAL_PC; • intopcode = i_fetch(PC); • while (opcode ! = END_CODE){ • t_fork(execute, &opcode, • 1, 1, 1); • increment_PC(opcode, &PC); • opcode = i_fetch(PC);}
Explicit transaction commit ordering • Provide partial ordering. • Done by assigning two parameters to each transaction • Sequence Number and Phase Number • Transactions with same sequence number commit in an ordered way defined by programmer. • Transactions with different sequence number are independent. • Order for transactions having same sequence numbered is achieved through phase number. • Transaction having Lowest Phase number is executed first.
Performance Evaluation • Maximize Parallelization. • Create as many transactions as possible • Minimize Violations. • Keep transactions small to reduce amount of work lost on violation • Minimize Transaction Overhead • Not To small size of transaction • Avoid Buffer Overflow • Can result in excessive serialization
Performance Evaluation • Base Case. • Simple parallelization without any optimization. • Unordered • Finding loops that can be un orderd. • Reduction • Finding areas that exploit reduction operations • Privatization • Privatize the variables to each transaction that cause violations. • Using t_commit() • Break large transactions to small ones but execute on same processor. Reduces loss overhead due to violations and prevents buffer overflow. • Loop Adjustments • Using various loop adjustments optimizations provided by the compiler.
Performance Evaluation Inner Loops had too many violations Using outer loop_adjust improved result Privatization and t_commit Improve performance
Performance Evaluation • CMP performance is close to Ideal TCC for small number of processors.
Conclusions • Bandwidth limitation is still a problem for scaling TCC to more processors. • No support for nested for loops. • Dynamic optimization techniques still required to automate performance tuning on TCC