1 / 16

Example: Adding new instructions - prefetch

Example: Adding new instructions - prefetch. New PTX Instructions: Prefetch , Prefetchu. Support new instructions: prefetch.global.L1 [ address ] prefetchu.L1 [address] Modify PTX Internal Representation Add Parser and Emitter Support Implement instruction for devices PTX Emulator

abner
Download Presentation

Example: Adding new instructions - prefetch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Example: Adding new instructions - prefetch

  2. New PTX Instructions: Prefetch, Prefetchu • Support new instructions: • prefetch.global.L1 [address] • prefetchu.L1 [address] • Modify PTX Internal Representation • Add Parser and Emitter Support • Implement instruction for devices • PTX Emulator • NVIDIA GPU • Other devices as needed

  3. Modify PTX Internal Representation: add opcodes class PTXInstruction { enumOpcode{ Prefetch, Prefetchu, }; enumCacheLevel{ L1, L2, CacheLevel_invalid }; static std::string toString(CacheLevel cache); CacheLevelcacheLevel; }; • Add opcodes and modifiers • ocelot/ir/interface/PTXInstruction.h

  4. Modify PTX Internal Representation: add emitters • Add new opcodes and modifiers to emitter • ocelot/ir/implementation/PTXInstruction.cpp std::string ir::PTXInstruction::toString(CacheLevel cache) { switch (cache) { L1: return "L1"; L2: return "L2"; default: return “”; } } std::string ir::PTXInstruction::toString() { switch (opcode) { case Prefetch: { return guard() + "prefetch." + PTXOperand::toString(addressSpace) + "." + PTXInstruction::toString(cacheLevel) + " " + d.toString(); } case Prefetchu: { return guard() + "prefetchu.L1 " + d.toString(); } } }

  5. Modify PTX Internal Representation: validation • Check for valid address space, cache level, and address mode of address operand • ocelot/ir/implementation/PTXInstruction.cpp std::string ir::PTXInstruction::valid () { switch (opcode) { case Prefetch: { if (!(cacheLevel == L1 || cacheLevel == L2)) { return "cache level must be L1 or L2"; } if (!(addressSpace == Local || addressSpace == Global)) { return "address space must be .local or .global, not " + toString(addressSpace); } if (!(d.addressMode == PTXOperand::Indirect || d.addressMode == PTXOperand::Address || d.addressMode == PTXOperand::Immediate)) { return "address mode of destination operand must be Indirect, Address, or Immediate. Not " + PTXOperand::toString(d.addressMode); } } break; } }

  6. Add Parser Support • ptx.ll • PTXParser • ptxgrammar.yy • ocelot/ • parser/ -- parser (to PTX IR)

  7. Add Parser Support: lexical analysis • Define lexical analysis rules for opcode and cache level tokens • ocelot/parser/implementation/ptx.ll ".L1" { yylval->value = TOKEN_L1; return TOKEN_L1; } ".L2" { yylval->value = TOKEN_L2; return TOKEN_L2; } "prefetch" { sstrcpy( yylval->text, yytext, 1024 ); \ return OPCODE_PREFETCH; } "prefetchu" { sstrcpy( yylval->text, yytext, 1024 ); \ return OPCODE_PREFETCHU; }

  8. Add Parser Support: modify class PTXParser • Enhance class PTXParser (translates tokens to PTX IR) • ocelot/parser/implementation/PTXParser.cpp class PTXParser { class State { void cacheLevel(inttoken ); } ir::PTXInstruction::CacheLeveltokenToCacheLevel(int token); } • ocelot/parser/implementation/PTXParser.cpp ir::PTXInstruction::CacheLevelPTXParser::tokenToCacheLevel(int token) { switch (token) { case TOKEN_L1: return ir::PTXInstruction::L1; case TOKEN_L2: return ir::PTXInstruction::L2; default: break; } return ir::PTXInstruction::CacheLevel_invalid; } void PTXParser::State::cacheLevel(int token ) { statement.instruction.cacheLevel = tokenToCacheLevel(token); }

  9. Add Parser Support: modify class PTXParser • Translate new opcodes from string to enumir::PTXInstruction::Opcode • ocelot/parser/implementation/PTXParser.cpp ir::PTXInstruction::OpcodePTXParser::stringToOpcode( std::string string ) { if( string == "prefetch" ) return ir::PTXInstruction::Prefetch; if( string == "prefetchu" ) return ir::PTXInstruction::Prefetchu; }

  10. Add Parser Support: modify PTX grammar • Define parse rules for prefetch and prefetchu instructions • ocelot/parser/implementation/ptx.ll %token<text> OPCODE_PREFETCH OPCODE_PREFETCHU %token<value> TOKEN_L1 TOKEN_L2 instruction : .... | prefetch | prefetchu | .... cacheLevel : TOKEN_L1 | TOKEN_L2 { state.cacheLevel( $<value>1 ); }; prefetch : OPCODE_PREFETCH addressSpacecacheLevel '[' memoryOperand ']' ';' { state.instruction( $<text>1 ); }; prefetchu : OPCODE_PREFETCHU cacheLevel '[' memoryOperand ']' ';' { state.instruction( $<text>1 ); };

  11. Supported devices: PTX Emulator • Add methods to evaluate prefetch and prefetchu instructions • ocelot/executive/interface/CooperativeThreadArray.h class CooperativeThreadArray { void eval_Prefetch(CTAContext &context, constir::PTXInstruction &instr); void eval_Prefetchu(CTAContext&context, constir::PTXInstruction &instr); } • ocelot/executive/implementation/CooperativeThreadArray.cpp void executive::CooperativeThreadArray::execute(constir::Dim3& block) { do { PTXInstruction & instr = instructions[PC]; switch (instr.opcode) { case PTXInstruction::Prefetch: eval_Prefetch(context, instr); break; case PTXInstruction::Prefetchu: eval_Prefetchu(context, instr); break; } } while (running); }

  12. Supported devices: PTX Emulator • Implement methods to evaluate prefetch and prefetchu • ocelot/executive/implementation/CooperativeThreadArray.cpp void executive::eval_Prefetch(CTAContext &context, constir::PTXInstruction &instr) { currentEvent.memory_size = 1; for (intthreadID = 0; threadID < threadCount; threadID++) { if (!context.predicated(threadID, instr)) { continue; } const char *source = 0; switch (instr.d.addressMode) { case PTXOperand::Indirect: source += getRegAsU64(threadID, instr.d.reg); break; case PTXOperand::Address: case PTXOperand::Immediate: source += instr.d.imm_uint; break; default: throw RuntimeException("unsupported", context.PC, instr); } source += instr.d.offset; currentEvent.memory_addresses.push_back((ir::PTXU64)source); } trace(); }

  13. Example: PTX Emulator • Sample PTX kernel augmented with prefetch instruction virtual void event(const trace::TraceEvent & event){ if (event.instruction->opcode == ir::PTXInstruction::Prefetch) { std::cout << event.instruction->toString() << "\n"; trace::TraceEvent::U64Vector::const_iterator address = event.memory_addresses.begin(); for (int tid = 0; tid < event.active.size(); tid++) { std::cout << " t" << tid << " - 0x" << std::hex << *address << std::dec << "\n"; ++address; } } } prefetch.global.L1 [%r4] t0 - 0x1e16800 t1 - 0x1e16804 t2 - 0x1e16808 t3 - 0x1e1680c __global__ void sequence(int*A) { A[threadIdx.x] *= 2; } .entry sequence(.param .u64 param) { .reg .s32 %r<6>; .reg .s64 %rl<5>; ld.param.u64 %rl1, [param]; cvta.to.global.u64 %rl2, %rl1; mov.u32 %r1, %tid.x; mul.wide.u32 %rl3, %r1, 4; add.s64 %rl4, %rl2, %rl3; prefetch.global.L1 [%rl4]; ld.global.u32 %r2, [%rl4]; shl.b32 %r4, %r2, 1; st.global.u32 [%rl4], %r4; ret; }

  14. Supported devices: NVIDIA GPU • No additional support required • prefetch and prefetchu instructions do not produce values • PTX emitter is sufficient to execute on native GPU • ir::PTXInstruction::toString( ) • ir::PTXInstruction::valid( ) • To support other devices • Multicore CPU: • Add translation rules to PTX-to-LLVM translator • Target LLVM prefetchintrinsics • AMD GPU: • Depends on support from CAL IL

  15. Example Application #include <fstream> #include <iostream> #include <ocelot/cuda/interface/cuda_runtime.h> #include <ocelot/api/interface/ocelot.h> #include <ocelot/trace/interface/TraceEvent.h> #include <ocelot/trace/interface/TraceGenerator.h> class TraceGen: public trace::TraceGenerator { public: virtual void event(const trace::TraceEvent & event){ if (event.instruction->opcode== ir::PTXInstruction::Prefetch) { std::cout << event.instruction->toString() << "\n"; } } }; int main() { TraceGentraceGen; ocelot::addTraceGenerator(traceGen, true); constint N = 4; int *devPtr; cudaMalloc((void **)&devPtr, sizeof(int)*N); std::ifstreamptxFile("example-sequence.ptx"); ocelot::registerPTXModule(ptxFile, "example.ptx"); cudaConfigureCall(dim3(1,1), dim3(N, 1)); cudaSetupArgument(&devPtr, sizeof(int *), 0); ocelot::launch("example.ptx", "sequence"); cudaFree(devPtr); return 0; }

  16. Questions?

More Related