220 likes | 343 Views
Some symbols. ISR. Task. Timer. Binary Semaphore. Mailbox. Message Queue. Counting Semaphore. Event Flag. Mutex Semaphore. 6-5 Synchronization with semaphores. Task 1 … a = f1(…); // Synchronization point f2(b); …. Task 2 … b = g1(…); // Synchronization point g2(a); ….
E N D
Some symbols ISR Task Timer Binary Semaphore Mailbox Message Queue Counting Semaphore Event Flag Mutex Semaphore William Sandqvist william@kth.se
6-5 Synchronization with semaphores Task 1 …a = f1(…);// Synchronization pointf2(b);… Task 2 …b = g1(…);// Synchronization pointg2(a);… Operations: accessSem(Sem) and releaseSem(Sem) Synchronize Code with binary semaphores! William Sandqvist william@kth.se
Binary semaphores Sem1 and Sem2 Sem1 and Sem2 are created with the value ”1” at start! Task 1…accessSem(Sem1);…a = f1(…);releaseSem(Sem1);// Synchronization point accessSem(Sem2);f2(b);releaseSem(Sem2);… Task 2…accessSem(Sem2);…b = g1(…);releaseSem(Sem2);// Synchronization point accessSem(Sem1);g2(a); releaseSem(sem1);… William Sandqvist william@kth.se
Binary semaphores Sem1 and Sem2 Sem1 and Sem2 are created with the value ”0” at start! Task 1…a = f1(…);releaseSem(Sem1);// Synchronization point accessSem(Sem2);f2(b);releaseSem(Sem2);… Task 2…b = g1(…);releaseSem(Sem2);// Synchronization point accessSem(Sem1);g2(a); releaseSem(sem1);… William Sandqvist william@kth.se
Task Triplet P( max execution time, period, deadline ) Create periodical tasks A soft-timer could Release a semaphore periodically. A task could Access a semaphore before execution. William Sandqvist william@kth.se
Finite Impulse Response filter You have programmed a FIR-filter in LAB 2. Every filter stage needs a MAC-operation. MAC = Multiply and ACkumulate. sample = input(); x[oldest] = sample; y = 0;for (k = 0; k < N; k++){ y += h[k] * x[(oldest + k) % N];}oldest = (oldest + 1) % N; output(y); William Sandqvist william@kth.se
a program could do 13,5% moore in the same execution time. 7-1 Hardware Accelerators DSP application. 15% of the execution time are call’s to a function that performs a MAC operation. Multiply and ACkumulate. An alternative is to use an other processor which has a MAC-instruction. Suppose that we have the ratio: How much could the total execution time be increased if the processor with the MAC-instruction is used? Without MAC 15% + 85%= 100% With MAC 1.5% + 85% + 13.5% = 100% William Sandqvist william@kth.se
7-3 Hardware accelerator X = A * B + C * D William Sandqvist william@kth.se
Processor only X = A * B + C * D load p1,A # 2 time units load p2,B # 2load p3,C # 2load p4,D # 2mul p5,p1,p2 # 8mul p6,p3,p4 # 8add p7,p5,p6 # 1store p7,X # 2Grand total = 27 time units Can the Hardware Accelerator improve on this? William Sandqvist william@kth.se
DFG Detects possible parallellism Processor and Accelerator T=C*DX=A*B+T load p1,A # 2 load p2,B # 2mul p3,p1,p2 # 8 load a1,C # 2 load a2,D # 2 mul a3,a1,a2 # 1 store T,a3 # 2 (=7) load p4,T # 2add p5,p4,p3 # 1store p5,X # 2Grand total = 17 time units Parallellism! William Sandqvist william@kth.se
Speedup William Sandqvist william@kth.se
All mul’s with the accelerator load a1,A # 2 load a2,B # 2 load a3,C # 2 load a4,D # 2 mul a5,a1,a2 # 1 mul a6,a3,a4 # 1 store S,a5 # 2 store T,a6 # 2load p1,S # 2 load p2,T # 2add p3,p2,p1 # 1store p3,X # 2 Grand total = 21 S=A*BT=C*DX=S+T No parallellism! William Sandqvist william@kth.se
Speedup William Sandqvist william@kth.se
Accelerators in the Cyclone II chip The Cyclone II chip has Embedded Multipliers to use as Hardware accelerators. (They could be connected to the Embedded Nios II-pro-cessor with the Avalon bus). Up to 150 18bit18bit Multiplicator units can be used! William Sandqvist william@kth.se
5-9 Cache performance This is an example of a problem from part B of the written exam. int i;int y = 0;int u[60];int v[60];. . .for(i = 0; i < 60; i++) y += u[i] * v[i];. . . Datacache size 128 Bytes, Cacheline/Block 32 Bytes (8 int).u and v are located in sequence in memory. Variables i and y are stored in processor registers. William Sandqvist william@kth.se
Hitrate estimation Draw the memory and Cache as Cache-line/Block organized. Block is then 8 int. Vector u and v each occupy 7.5 blocks in memory. We don’t know if the mapping looks exactly this way, but the conflicts will be the same. u[0] M, v[0] M, u[1…3] HHH, v[1…3] HHHu[4] H, v[4] M, conflict misses u[5…7] MMM, v[5…7] MMM… MM HHH HHH H M MMM MMM … 50% (loop stops at 59, numbers 60…63 are not included, the hitrate will actually be > 50%) William Sandqvist william@kth.se
Program changes for max hitrate int i;int y = 0;int u[72]; /* +12 dummy */ int v[60];. . .for(i = 0; i < 60; i++) y += u[i] * v[i];. . . v is moved 12 int’s by extending u with dummy elements. MHHHHHHH.Hitrate 88%. Is 100% possible? No, there must always be one cold miss every cacheline. The index i counts forwards – every int is used only once, no int is reused! William Sandqvist william@kth.se
Good Luck! William Sandqvist william@kth.se