160 likes | 294 Views
Query Reordering for Photon Mapping. Rohit Saboo. Photon Mapping. A two step solution for global illumination:. Step 1: Build the Photon Map. Step 2: Shoot eye rays and perform a “gather”. Gather variants. Approximate. Accurate. Bandwidth estimate. 100 queries for one eye-ray
E N D
Query Reordering for Photon Mapping Rohit Saboo
Photon Mapping A two step solution for global illumination: Step 1: Build the Photon Map Step 2: Shoot eye rays and perform a “gather”
Gather variants • Approximate • Accurate
Bandwidth estimate • 100 queries for one eye-ray • 20 bytes per photon • 512x512 image • super-sampling give bandwidth estimates of 50GB. Caches fetch data in blocks + other factors -> bandwidth requirement could go upto 200 GB
Reordering Queries Improve locality of data/queries for cache effectiveness Two ways- • Generate the queries in some order • Generate the queries, reorder them in some manner and then run the queries.
R A M P R O C E S S O R I-Cache L2 Cache D-Cache Cache Hierarchy A very naïve hierarchy
Reordering methods • Row ordering • Tiled row ordering • Direction binning • Hashed • Tiled Direction binned hashed • Hilbert Curve • Tiled Hilbert curve
The Cornell Box Not one of the results
Performance monitoring • Intel Pentium M processor 1.7GHz (The frequency scaling feature was disabled) • FSB 533 MHz • 2MB L2 cache • Separate I-cache and D-cache • 32KB 8-way set associative each I-cache and D-cache • 768MB RAM • Windows XP (with most services disabled) • pbrt • Intel C++ compiler with all optimizations • VTune performance analysis package
Results With plain irradiance caching – • Branch mispredictions account for ~25% of the time • Algo seems to be too complicated to be optimized successfully. • Bus utilization factor – 0.0024 (no of times bus was asserted busy vs clockticks) which is very low. • ~10% of time spent due to cache misses.
Results… Naïve reordering… • Bus utilization – 0.0014 – again very low • CPU load port – 0.54 loads per clocktick (maximum I could achieve is 1.07) • ~7% of time wasted due to cache misses.
Results… Hilbert curve … • Bus utilization – 0.00074 (an order of magnitude lower) • 0.93 loads per clocktick (almost as high as one can get) • Not much impact due to L2 cache misses.
Multi threading • Multithreaded the kd-tree data structure • Simply starts two threads to do the search. • Results show very small changes • Maybe some other threading approach would be better? • Cost of threading overshadows any gains.
Any Possible Discrepencies • Pentium M processor vs Desktop processors – results are highly architecture dependent. (eg if processor has more than one port connected to D-cache) • Not running the analysis over the entire duration of the run.
Conclusions • L2-memory bandwidth is not the bottleneck. • The bottleneck is more in the form of cpu-L1 accesses and computations. • There does exist scope for improving performance • But this would need algos which have very little overhead and simple enough to be optimized by the compiler and at the same time exploit cache coherency
References • Reordering for Cache conscious photon mapping – Josh Steinhurst • Realistic Image Synthesis using Photon Mapping – Jensen • IA-32 Intel Architecture Software Developer’s Manual, Volume 3: System Programming Guide. ftp://download.intel.com/design/Pentium4/manuals/25366814.pdf • VTune Performance Analyzer http://www.intel.com/software/products/vtune/vpa/