260 likes | 496 Views
Efficient Run-Time Dispatching in Generic Programming with Minimal Code Bloat. Lubomir Bourdev Advanced Technology Labs Adobe Systems. Jaakko J ä rvi Computer Science Department Texas A&M University. Agenda. Context & problem statement Background – previous approaches
E N D
Efficient Run-Time Dispatching in Generic Programming with Minimal Code Bloat Lubomir Bourdev Advanced Technology Labs Adobe Systems Jaakko Järvi Computer Science Department Texas A&M University
Agenda • Context & problem statement • Background – previous approaches • Our approach to code bloat reduction • Code bloat reduction in run-time dispatch • Results & conclusion
Agenda • Context & problem statement • Background – previous approaches • Our approach to code bloat reduction • Code bloat reduction in run-time dispatch • Results & conclusion
Context: Image Manipulation • Images vary in many different ways • Writing generic and efficient image processing algorithms is challenging
Color space (RGB, CMYK…) optional padding at the end of rows channel order (RGB vs. BGR) planar vs. interleaved channel depth 8-bit, 16-bit… Image Representations • 4x3 image in which the second pixel is hilighted • In interleaved form: • In planar form:
Generic Image Library (GIL) • Adobe’s Open Source Image Library http://opensource.adobe.com/gil • Abstracts image representations from algorithms on images • Allows for writing the algorithm once & having it work on images of any representation, without loss of performance
Problem Statement • How do we write image processing algorithms that are: • Generic • Efficient • Compact • Run-Time Flexible
Agenda • Context & problem statement • Background – previous approaches • Our approach to code bloat reduction • Code bloat reduction in run-time dispatch • Results & conclusion
Image algorithms via inheritance & polymorphism struct pixel { virtual void invert()=0; }; struct rgb_pixel : public pixel { virtual void invert(); }; struct gray_pixel : public pixel { virtual void invert(); }; struct image { pixel* operator[](size_t i); }; void invert(image* img) { for (i=0; i<img.size(); ++i) img[i]->invert(); } Generic X Efficient X Compact √ Run-Time Flexible √ Performance problem: dynamic dispatch once per pixel
Image Algorithms via Generic Programming struct rgb_pixel {…}; struct gray_pixel {…}; void invert_pixel(rgb_pixel&) {…} void invert_pixel(gray_pixel&) {…} template <typename Pixel> struct image { Pixel& operator[](size_t i); }; template <typename Image> void invert(Image& img) { for (i=0; i<img.size(); ++i) invert_pixel(img[i]); } Generic √ Efficient √ Compact √ Run-Time Flexible X
Generic Code Lacks Flexibility • We need run-time flexibility: typedef boost::mpl::vector<rgb8_image, gray8_image> images; gil::any_image<images> runtime_image; gil::jpeg_read_image(runtime_image, “test.jpg”); invert(runtime_image); • How can we do that without loss of performance? • Variant construct (see boost::variant) • runtime_image holds: • index: index to the type of image • bits: buffer containing the currently instantiated image • To invoke an algorithm, go through a switch statement & cast • Efficient: invoke dynamic dispatch only once per algorithm
Variant invocation void invert_image(void* bits, int index) { switch (index) { case kLAB: invert(*(image<lab_pixel>*)(bits)); case kRGB: invert(*(image<rgb_pixel>*)(bits)); } } Generic version: template <typename Op> void apply_operation(void* bits, int index, Op op) { switch (index) { case kLAB: op(*(image<lab_pixel>*)(bits)); case kRGB: op(*(image<rgb_pixel>*)(bits)); } } Generic √ Efficient √ Compactx Run-Time Flexible √
Solution: Template Hoisting • Define a class hierarchy: template <int k> class k_channel_image {…}; class rgb_image : public k_channel_image<3> {}; class lab_image : public k_channel_image<3> {}; • Define the algorithm at the appropriate level of the hierarchy: template <int k> void invert(k_channel_image<k>&) {…} Genericx Efficient √ Compact Run-Time Flexible √ - enforces a specific hierarchy - different algorithms may need different hierarchies • - switch statement overhead remains • does not help when the function is inlined
Agenda • Context & problem statement • Background – previous approaches • Our approach to code bloat reduction • Code bloat reduction in run-time dispatch • Results & conclusion
Type Reduction • Every algorithm partitions the space of its argument types into a set of equivalence classes • Members of an equivalence result in the same assembly when instantiated • The algorithm is instantiated only with one representative from each equivalence class
Type Reduction Implementation • Metafunction to define the partition: template <typename Op, typename T> struct reduce { typedef T type; }; • Generic algorithm invocation: template <typename Op, typename T> inline void apply_operation(const T& argument, Op op) { typedef typename reduce<Op,T>::type base_t; op(reinterpret_cast<const base_t&>(argument)); }
Example: The invert algorithm • Define the algorithm as a function object: struct invert_op { template <typename Image> void operator()(Image&){…} }; • Provide a function overload to invoke it: template <typename Image> inline void invert(Image& image) { apply_operation(image, invert_op()); } • Inverting RGB and LAB images is assembly-level identical: template<> struct reduce<invert_op, lab8_image_t> { typedef rgb8_image_t; };
The technique generalizes to multiple dimensions template <typename Op, typename T1, typename T2> void apply_operation(T1& arg1, T2& arg2, Op op) { typedef typename reduce<Op,T1>::type base1_t; typedef typename reduce<Op,T2>::type base2_t; typedef std::pair<T1*, T2*> pair_t; typedef typename reduce<Op,pair_t>::type base_pair_t; std::pair<void*,void*> p(&arg1,&arg2); op(reinterpret_cast<base_pair_t&>(p)); } template <> struct reduce<copy_pixels_op,lab8_image_t> {…}; template <> struct reduce<copy_pixels_op, std::pair<lab8_image_t,lab8_image_t> > {…};
Defining Reduce Specializations • Reduce dimensions separately, then combine: template <typename Image> struct reduce<invert_pixels_op, Image> { typedef reduce_cs<Image::color_space_t>::type cs; typedef reduce_ch<Image::channel_t>::type channel; typedef image_type<cs,channel,…>::type type; }; • Reuse structures via metafunction forwarding: template <typename T1, typename T2> struct reduce<resample_pixels_op, std::pair<T1,T2> > : public reduce<copy_pixels_op, std::pair<T1,T2> > {};
Agenda • Context & problem statement • Background – previous approaches • Our approach to code bloat reduction • Code bloat reduction in run-time dispatch • Results & conclusion
Reduction in variants Input: a variant of: input_types: [rgb8_image, lab8_image, cmyk16_image, rgba16_image] input_index: 2 • Step 1: Reduce each member of the vector: reduced_t: [rgb8_image, rgb8_image, rgba16_image, rgba16_image] • Step 2: Remove duplicates: output_types_t: [rgb8_image, rgba16_image] • Step 3: Create index vector from reduced_t to output_types_t: indices_t: [0, 0, 1, 1] • Step 4: Use indices_t to map the input index to an output index: output_index = indices_t[input_index] = indices[2] = 1 Invoke the algorithm on a variant of: output_types_t: [rgb8_image, rgba16_image] output_index: 1
Binary reduction in variants • Step 1: Perform unary pre-reduction on each argument [A1, A2, A3, A4] with index 2 -> [A1, A3] with out_index1 = 1 [B1, B2, B3] with index 3 -> [B1, B2] with out_index2 = 0 • Step 2: Compute a vector of the cross-products of types [(A1,B1), (A1,B2), (A3,B1), (A3,B2)] • Step 3: Apply unary reduction on it: output_types_t = [(A1,B1), (A1,B2), (A3,B2)] • Step 4: Compute the index in the output vector out_index = out_index1 * size(Vec1) + out_index2 Invoke the algorithm on a single variant of: output_types_t = [(A1,B1), (A1,B2), (A3,B2)] out_index
Agenda • Context & problem statement • Background – previous approaches • Our approach to code bloat reduction • Code bloat reduction in run-time dispatch • Results & conclusion
Tests • Test sets • Set A: 90 types (10 color spaces, 3 channel types, other variations) • Set B: 10 types (4 color spaces, other) • Set C: 12 types (3 color spaces, planar/interleaved, step/nonstep) • Tests • Test 1: copy_pixels on Set B (inlined binary algorithm) • Test 2: copy_pixels on Set C (inlined binary algorithm) • Test 3: resample_pixels on Set B (non-inlined binary algorithm) • Test 4: resample_pixels on Set C (non-inlined binary algorithm) • Test 5: invert_pixels on Set A (inlined unary algorithm)
Results Reduction in code bloat Effect on compile time
Conclusion • Drawbacks • Unsafe • Requires intimate knowledge of the types and the algorithm • Some compilers can optimize most of the code bloat • Benefits • Works even when functions are inlined • Simplifies code generated by variants (especially double dispatch) • Does not impose class hierarchy (essential for generic code!) • Works when algorithms differ in requirements