170 likes | 354 Views
Example: Adding new instructions - prefetch. New PTX Instructions: Prefetch , Prefetchu. Support new instructions: prefetch.global.L1 [ address ] prefetchu.L1 [address] Modify PTX Internal Representation Add Parser and Emitter Support Implement instruction for devices PTX Emulator
E N D
New PTX Instructions: Prefetch, Prefetchu • Support new instructions: • prefetch.global.L1 [address] • prefetchu.L1 [address] • Modify PTX Internal Representation • Add Parser and Emitter Support • Implement instruction for devices • PTX Emulator • NVIDIA GPU • Other devices as needed
Modify PTX Internal Representation: add opcodes class PTXInstruction { enumOpcode{ Prefetch, Prefetchu, }; enumCacheLevel{ L1, L2, CacheLevel_invalid }; static std::string toString(CacheLevel cache); CacheLevelcacheLevel; }; • Add opcodes and modifiers • ocelot/ir/interface/PTXInstruction.h
Modify PTX Internal Representation: add emitters • Add new opcodes and modifiers to emitter • ocelot/ir/implementation/PTXInstruction.cpp std::string ir::PTXInstruction::toString(CacheLevel cache) { switch (cache) { L1: return "L1"; L2: return "L2"; default: return “”; } } std::string ir::PTXInstruction::toString() { switch (opcode) { case Prefetch: { return guard() + "prefetch." + PTXOperand::toString(addressSpace) + "." + PTXInstruction::toString(cacheLevel) + " " + d.toString(); } case Prefetchu: { return guard() + "prefetchu.L1 " + d.toString(); } } }
Modify PTX Internal Representation: validation • Check for valid address space, cache level, and address mode of address operand • ocelot/ir/implementation/PTXInstruction.cpp std::string ir::PTXInstruction::valid () { switch (opcode) { case Prefetch: { if (!(cacheLevel == L1 || cacheLevel == L2)) { return "cache level must be L1 or L2"; } if (!(addressSpace == Local || addressSpace == Global)) { return "address space must be .local or .global, not " + toString(addressSpace); } if (!(d.addressMode == PTXOperand::Indirect || d.addressMode == PTXOperand::Address || d.addressMode == PTXOperand::Immediate)) { return "address mode of destination operand must be Indirect, Address, or Immediate. Not " + PTXOperand::toString(d.addressMode); } } break; } }
Add Parser Support • ptx.ll • PTXParser • ptxgrammar.yy • ocelot/ • parser/ -- parser (to PTX IR)
Add Parser Support: lexical analysis • Define lexical analysis rules for opcode and cache level tokens • ocelot/parser/implementation/ptx.ll ".L1" { yylval->value = TOKEN_L1; return TOKEN_L1; } ".L2" { yylval->value = TOKEN_L2; return TOKEN_L2; } "prefetch" { sstrcpy( yylval->text, yytext, 1024 ); \ return OPCODE_PREFETCH; } "prefetchu" { sstrcpy( yylval->text, yytext, 1024 ); \ return OPCODE_PREFETCHU; }
Add Parser Support: modify class PTXParser • Enhance class PTXParser (translates tokens to PTX IR) • ocelot/parser/implementation/PTXParser.cpp class PTXParser { class State { void cacheLevel(inttoken ); } ir::PTXInstruction::CacheLeveltokenToCacheLevel(int token); } • ocelot/parser/implementation/PTXParser.cpp ir::PTXInstruction::CacheLevelPTXParser::tokenToCacheLevel(int token) { switch (token) { case TOKEN_L1: return ir::PTXInstruction::L1; case TOKEN_L2: return ir::PTXInstruction::L2; default: break; } return ir::PTXInstruction::CacheLevel_invalid; } void PTXParser::State::cacheLevel(int token ) { statement.instruction.cacheLevel = tokenToCacheLevel(token); }
Add Parser Support: modify class PTXParser • Translate new opcodes from string to enumir::PTXInstruction::Opcode • ocelot/parser/implementation/PTXParser.cpp ir::PTXInstruction::OpcodePTXParser::stringToOpcode( std::string string ) { if( string == "prefetch" ) return ir::PTXInstruction::Prefetch; if( string == "prefetchu" ) return ir::PTXInstruction::Prefetchu; }
Add Parser Support: modify PTX grammar • Define parse rules for prefetch and prefetchu instructions • ocelot/parser/implementation/ptx.ll %token<text> OPCODE_PREFETCH OPCODE_PREFETCHU %token<value> TOKEN_L1 TOKEN_L2 instruction : .... | prefetch | prefetchu | .... cacheLevel : TOKEN_L1 | TOKEN_L2 { state.cacheLevel( $<value>1 ); }; prefetch : OPCODE_PREFETCH addressSpacecacheLevel '[' memoryOperand ']' ';' { state.instruction( $<text>1 ); }; prefetchu : OPCODE_PREFETCHU cacheLevel '[' memoryOperand ']' ';' { state.instruction( $<text>1 ); };
Supported devices: PTX Emulator • Add methods to evaluate prefetch and prefetchu instructions • ocelot/executive/interface/CooperativeThreadArray.h class CooperativeThreadArray { void eval_Prefetch(CTAContext &context, constir::PTXInstruction &instr); void eval_Prefetchu(CTAContext&context, constir::PTXInstruction &instr); } • ocelot/executive/implementation/CooperativeThreadArray.cpp void executive::CooperativeThreadArray::execute(constir::Dim3& block) { do { PTXInstruction & instr = instructions[PC]; switch (instr.opcode) { case PTXInstruction::Prefetch: eval_Prefetch(context, instr); break; case PTXInstruction::Prefetchu: eval_Prefetchu(context, instr); break; } } while (running); }
Supported devices: PTX Emulator • Implement methods to evaluate prefetch and prefetchu • ocelot/executive/implementation/CooperativeThreadArray.cpp void executive::eval_Prefetch(CTAContext &context, constir::PTXInstruction &instr) { currentEvent.memory_size = 1; for (intthreadID = 0; threadID < threadCount; threadID++) { if (!context.predicated(threadID, instr)) { continue; } const char *source = 0; switch (instr.d.addressMode) { case PTXOperand::Indirect: source += getRegAsU64(threadID, instr.d.reg); break; case PTXOperand::Address: case PTXOperand::Immediate: source += instr.d.imm_uint; break; default: throw RuntimeException("unsupported", context.PC, instr); } source += instr.d.offset; currentEvent.memory_addresses.push_back((ir::PTXU64)source); } trace(); }
Example: PTX Emulator • Sample PTX kernel augmented with prefetch instruction virtual void event(const trace::TraceEvent & event){ if (event.instruction->opcode == ir::PTXInstruction::Prefetch) { std::cout << event.instruction->toString() << "\n"; trace::TraceEvent::U64Vector::const_iterator address = event.memory_addresses.begin(); for (int tid = 0; tid < event.active.size(); tid++) { std::cout << " t" << tid << " - 0x" << std::hex << *address << std::dec << "\n"; ++address; } } } prefetch.global.L1 [%r4] t0 - 0x1e16800 t1 - 0x1e16804 t2 - 0x1e16808 t3 - 0x1e1680c __global__ void sequence(int*A) { A[threadIdx.x] *= 2; } .entry sequence(.param .u64 param) { .reg .s32 %r<6>; .reg .s64 %rl<5>; ld.param.u64 %rl1, [param]; cvta.to.global.u64 %rl2, %rl1; mov.u32 %r1, %tid.x; mul.wide.u32 %rl3, %r1, 4; add.s64 %rl4, %rl2, %rl3; prefetch.global.L1 [%rl4]; ld.global.u32 %r2, [%rl4]; shl.b32 %r4, %r2, 1; st.global.u32 [%rl4], %r4; ret; }
Supported devices: NVIDIA GPU • No additional support required • prefetch and prefetchu instructions do not produce values • PTX emitter is sufficient to execute on native GPU • ir::PTXInstruction::toString( ) • ir::PTXInstruction::valid( ) • To support other devices • Multicore CPU: • Add translation rules to PTX-to-LLVM translator • Target LLVM prefetchintrinsics • AMD GPU: • Depends on support from CAL IL
Example Application #include <fstream> #include <iostream> #include <ocelot/cuda/interface/cuda_runtime.h> #include <ocelot/api/interface/ocelot.h> #include <ocelot/trace/interface/TraceEvent.h> #include <ocelot/trace/interface/TraceGenerator.h> class TraceGen: public trace::TraceGenerator { public: virtual void event(const trace::TraceEvent & event){ if (event.instruction->opcode== ir::PTXInstruction::Prefetch) { std::cout << event.instruction->toString() << "\n"; } } }; int main() { TraceGentraceGen; ocelot::addTraceGenerator(traceGen, true); constint N = 4; int *devPtr; cudaMalloc((void **)&devPtr, sizeof(int)*N); std::ifstreamptxFile("example-sequence.ptx"); ocelot::registerPTXModule(ptxFile, "example.ptx"); cudaConfigureCall(dim3(1,1), dim3(N, 1)); cudaSetupArgument(&devPtr, sizeof(int *), 0); ocelot::launch("example.ptx", "sequence"); cudaFree(devPtr); return 0; }