160 likes | 306 Views
Dibyendu Das, Madhavi Valluri, Michael Wong, Chris Cambly dibyendu.das@in.ibm.com,mvalluri@us.ibm.com, michaelw@ca.ibm.com, ccambly@ca.ibm.com. Software/Systems Tech Group. Rational. Speeding up STL Set/Map Usage in C++ Applications SIPEW 2008 . An idea and implementation.
E N D
Dibyendu Das, Madhavi Valluri, Michael Wong, Chris Cambly dibyendu.das@in.ibm.com,mvalluri@us.ibm.com, michaelw@ca.ibm.com, ccambly@ca.ibm.com Software/Systems Tech Group Rational Speeding up STL Set/Map Usage in C++ ApplicationsSIPEW 2008 SPEC CPU 2006
An idea and implementation • A way to speed up SPEC CPU 2006 dealII • that can work for all compiler vendors • Without violating C++ Std library rules • small increase in memory usage does not change cache • IBM’s P5+/P6 shows ~ 20% improvement • Delivered on IBM’s xlC C++ compiler V10.1 IBM
C++ Standard Template Library and Generic Programming • Better Data structure can provide the best speed gain • Generic programming is about lifting common algorithms, and data structures • C++ Standard Template Library unifies algorithms, with data structures, glued by iterators • Effectively match any algorithms with any data structures through the abstractions of iterators • Universally supplied by all C++ compiler vendors • Vector, dequeue, list, set, map • With limits on performance and memory usage • Written by the best C++ programmers to be reusable and composable IBM
The right tool for the right time • What data structures are used in each SPEC CPU 2006 C++ benchmark IBM
Which data structure to choose • Depends on how to balance the cost of lookups, erasures, insertions, copies, traversals(++/--) • Found that dealII slows down due to long traversal time (++/-- is costly ) for set<> • in the traditional binary tree search implementation of set<>/map<> • Optimized for a mixed combination of insertions, erasure, then some lookups,traversals, then maybe more insertions, etc. IBM
What are we allowed to do? • Can’t change the data structure in SPEC CPU benchmarks • However, we are allowed to alter the underlying vendor implementation of libraries if we can sense how data is used • Sometimes they are indeed chaotic • Sometimes they more organized • Setup through insertion • Lookup to find information • Traversal for doing something applicable to many elements • Reorganize to a more suitable set, then return to lookup IBM
As a balanced binary tree known as red-black trees O(logn) for insertion and deletion O(logn) for lookups(find) O(1) amortized cost for traversals via ++/-- iterators A set<int> iSet as a red-black tree Details of normal set implementation IBM
Starting with sitr=iSet.begin() Advance (++sitr) will put it on node (2) after 1 link Advance again will put it on node (5) after 2 links Advance again will put it on node (7) after 1 link Advance again will put it on node (8) after 1 link Advance again will put it on node (11) after 3 links Advance again will put it on node (12) after 2 links Advance again will put it on node (14) after 1 link Advance again will put it on node (15) after 1 link Total is 12 links after 9 traversals = 1.3 links/traversal A set<int> iSet as a red-black tree What does O (1) amortized cost for ++/-- mean? IBM
Our Implementation • Add a doubly-linked list on top of the red-black tree • Using _Next and _Prev pointers to the next sorted tree node in non-decreasing order and non-increasing order respectively • Now it is exactly Θ(1) for ++/-- operations • But insert and delete has added O(1) complexity, still within O(logn) needed by C++ Standard • Copy adds O(1) for every copied node IBM
New in IBM xlC 10.1 compiler • Just released June 2008 with many new features • Compiler defined flag to enable • -D __IBM_FAST_SET_MAP_ITERATOR • Default is to not enable this behavior • Entire application must be compiled with this, or we can have erroneous behavior. IBM
Results of our implementation • dealII, xalancbmk, omnetpp all use set and map • Only deallII and xalancbmk will benefit • Omnetpp use of set is cold • In peak mode (-O5 with profile directed feedback enabled) • Verified no cache effect IBM
Other work and future investigation • All commercial implementations use some form of red-black tree • No commercial implementations use doubly-linked list to augment red-black tree • Some research use a B-tree • But it slows deletion compared to RB trees • Advised dealII author to switch to sorted vector instead of associative container IBM
BACKUP IBM
Insertion • Inserts a node _Z in a red-black tree • if it is left of a node _Y • Then update RB_Prev(_Y), _Y, _Z IBM
Erasure • Delete node 5 • Need to modify _Left, _Right, _Parent pointers • Increment and Decrement only need to follow 1 link instead of multiple links IBM
Copy • When we use = in C++, it will create a copy • Allocate new nodes, copy contents from source to destination tree • Scan from first to last node in new tree in sorted order • Set up _Prev and _Next pointers • Traversal requires multiple links using original Increment and decrement • Requires additional O(1) amortized time for every copied node IBM