1 / 19

Implementing Associative Functions for ClearSpeed

Implementing Associative Functions for ClearSpeed. Dr. Brian Sumner ClearSpeed Training Session Random Comments. Comments . Dr Brian Sumner used portions of the ClearSpeed Training slides, but also covered material not in the slides

said
Download Presentation

Implementing Associative Functions for ClearSpeed

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementing Associative Functions for ClearSpeed Dr. Brian Sumner ClearSpeed Training Session Random Comments

  2. Comments • Dr Brian Sumner used portions of the ClearSpeed Training slides, but also covered material not in the slides • Dr. Sumner projected online manuals on screen when covering material not in the slides. • My handwritten notes were based on the presentations that Dr. Sumner made, so material referenced may not always be in the slides. • I tried to identify the online references Dr. Sumner used, when possible. • More information may be available in my or Shannon’s handwritten slides on some topics.

  3. Assembly Inserts • Allow low level instructions or highly optimized assembly routines to be targeted from Cn source code. • Uses simple passing of parameters between Cn and assembly code • Avoids paying function call costs, as assembly is executed inline. • Allows user defined assembler macro to be used as customized instruction.

  4. Reductions • Sum, product, min, max • Released reduction source • Not sure where these are located • Inefficient to go back to host. • Use swazzle and have one PE see all pieces. • Then have them to bring back into mono. • Mentioned async_function.h (??) and recommended bringing it into editor to see its code.

  5. My Second Day Notes • On second day, Dr. Sumner seemed to focus in AM on implementing assoc. opns • Initially focused on some new instructions • Ref: CSX 600 Instruction Set Reference Manual mentioned initially • Move & Cast Instructions • mov • Integer divide expensive • Some cost a lot • Integer division can be replaced by multiples

  6. Second Day - Instructions • Broadcast: • Mono register to poly broadcast • 2nd Method: Consolidated PIO • Breakover point 16-32 bytes • Shifts • Stick with constant shifts if possible • Bit operations usual. If can stick to smallest size on poly, come out better. • Compare mono value in loop, put in poly value and save time.

  7. Second Day - Instructions (cont-2) • Use ANY or ALL mono-results • If all (cond), go to some other place • Back path – mono slower • Any/All are mono • Like to run mono ahead of poly since slower • ANY/ALL are not cheap instructions • Aside: Go to installation directory • Go to debugger • See asm code for ALL • asm directory has INC file

  8. Second Day - Instructions (cont-3) • Code for “any.enable” • AEO means “All Enables Off” • Takes time to drain poly; must wait on semaphore • End of sqazzle movement can go into mono. • result_constant.ins gives some timings • andif.lt instruction • Used to turn on/off PE • if.cry instruction looks at status register

  9. Second Day - Instructions (cont-4) • Lots of “if. “ commands • One cycle • Enable stack • Else – toggle the least significant bit of enable stack • Can branch on any predicate bit • pred.set • Pred.not

  10. Second Day – Instructions (cont-4) • branch j.if.pred commands • Some nice instructions with low cycles • Load and store instructions • Can load 1-16 bits in powers of 2 in 6 cycles • Fld (poly with offset) • Forced. Works even if PE disabled • ID (mono with offset)

  11. Second Day – Instructions (cont-5) • Swazzle Instruction • Swazzle.low to high • Swazzle.swap.odd.up • This lets you do reductions pretty easy • Swazzle.lowtohighX4 • Saves 3 cycles over x2 choice • Look at what is in library • Swazzle_functions directory • Look for swazzle_up (almost anything)

  12. Building on Previous Background • Do “cd src” • cs_reduction • Use vi editor to look at instruction’s code • asm poly float • Just a move gets you the max • asm_poly float_cs_min(poly float x, poly float y) • Asm poly float_cs_swazzle_down_float • Next, look at sum reduce • mono float_cs_reduce_mono_sum_float(poly float a)

  13. Building on Previous Background -2 • Use swap_odd_up to move data past • Could improve using swazzlex4 • Can build all partial sums as some PEs see everything that goes by. • Mono float_cs_reduce_more_max_float (poly float a) • Location (???) • Location/3.0-1.1 ?? /src/ • cs_reduction • A working version • Inline assembler

  14. PickOne • Can implement using a min reduction • Find bit that is turned on • Can concatenate PE number with bit number. • Reddaway code – several calls to ANY • Concatenates PE number with bit number • asm poly float_cs_min(poly float x, poly float y) • Play with above code • Could also use mono_cs_reduce_mono_sum_float(poly float a) • Also play with this code

  15. Miscellaneous Instructions • penum • Sets dst to be the PE number of the PE • Gets PE number that you are running on • thread.get • Normally not programming but one thread use. • cycles.get • Sets dst to be the cycle count of the processor • May need to wait for poly unit to have stopped • Mono can get way ahead of poly.

  16. Miscellaneous Instructions - 2 • enable.get • Forced function – always run. • Sets dst to be enable status of PE, regardless of enable status • Strongly recommend using more than one thread, so recommend not use this. • Mutex.start • During multex section, no other thread can activate • Can keep I/O from interrupting.

  17. End of Focus on Assoc. Opns • Dr Sumner went back into training slides at this point. To see where this fits in, the next two topics were • ClearSpeed Vector Math Library • Clear Speed Visual Profiler • However, some items mentioned seemed to still be related to implementing assoc. operations.

  18. Possibly Related Comments • Have to initialize vector math library operations each time used. Could avoid by using a global variable. • Before compiling, source a file in install directory, e.g., using bashrc or cshrc. • More comments in my notes, if needed • ANY or ALL: Simulator 1 may be a bit faster

  19. Possibly Related Comments -2 • Inline assemby function – must be careful not to abuse registers. • Performance evaluation: • Start with a paper estimate of how long it should take • FLOP rates • IPO rates • Work out by hand • Use cycle counter to tell if this is wrong. • Tells what cycle operation executed but not when poly fcns execute • Additional related comments on some topics in my notes, if needed.

More Related