400 likes | 558 Views
Effect of Node Size on the Performance of Cache-Conscious B+ Trees. Written by: R. Hankins and J.Patel. Presented by: Ori Calvo. Introduction. Who cares about cache improvement Traditional databases are designed to reduce IO accesses. But … Chips are cheap. Chips are big.
E N D
Effect of Node Size on the Performance of Cache-Conscious B+ Trees Written by: R. Hankins and J.Patel Presented by: Ori Calvo
Introduction • Who cares about cache improvement • Traditional databases are designed to reduce IO accesses. But… • Chips are cheap. • Chips are big. • Why not store all the database in memory? • Reducing main memory accesses is the next challenge.
Objectives • Introduction to cache-conscious B+Trees. • Provide a model to analyze the effect of node size. • Examine “real-life” results against our model’s conclusions.
B+Tree Refresher • d Ordered B+Tree has between d and 2d keys in each node. • Root has between 1 and 2d keys. • Every node must be at least half full. • 2*(d+1)^(h-1) <= N <= (2d+1)^h • Fill percentage is usually ln2 ~ 69%
B+Tree Refresher (Cont…) • Good search performance. • Good incremental performance. • Better cache behavior than T-Tree. • What is the optimal node size ?
Improving B+Tree Question: Assuming node size = cache line size, how can we make B+Tree algorithm to utilize better the cache? Hint: Locality !!!
Pointer Elimination • Node size = cache line size. • Only half of a node is used for storing keys. • Get rid of pointers and store more keys. • Instead of pointers to child nodes use offsets.
Introducing CSB+Tree • Balanced search tree. • Each node contains m keys, where d<=m<=2d and d is the order of the tree. • All child nodes are put into a node group. • Nodes within a node group are stored contiguously. • Each node holds: • pFirstChild - pointer to first child • nKeys - number of keys • arrKeys[2d] - array of keys
CSB+Tree P N K1 K2 P N K1 K2 P N K1 K2 P N K1 K2 P N K1 K2 P N K1 K2 P N K1 K2 P N K1 K2 P N K1 K2 P N K1 K2 P N K1 K2 P N K1 K2 P N K1 K2
CSB+Tree vs. B+Tree • Assuming, node size = 64B • B+Tree: 7 Keys + 8 Pointers + 1 Counter • CSB+Tree: 1 Pointer + 1 Counter + 14 Keys • Results: • A cache line can satisfy almost one more level of comparisons • The fan out is larger Less space
CSS Tree • Can we do more elimination ?
Shaking our foundations • Should node size be equal to cache line size ? • What about instructions count ? • How can we measure the effect of node size on the overall performance ?
Building Execution Time Model • We need to take into account: • Instruction executed. • Data cache misses. • Instruction cache misses (Only 0.5%). • Mis-predicted branches. • Model the above during an equality search. • Should be independent of implementation and platform details, but …
Execution Time Model T = I*cpi + M*miss_latency + B*pred_penalty
CPI – 0.63 ? • Can be extracted from a processor’s design manual, but.. • Modern processor are very complex • Some instructions require more time to retire than others • On Pentium 3 CPI is between 0.33 to 14
Other PSV – Where do they come from? • Miss_latency • Same problems as CPI • Pred_penalty • The manual provides tight upper and lower bounds.
PSV Experiment For(I=0; I<Queries; I++) { address = origin + random offset val = *address; for(j=0; j<Instructions; j++) { /* Computing involving “val” */ } }
Calculate I • I is depended upon the actual implementation of the CSB+Tree • Two main components: • I_search - Searching inside a node • I_trav - Node traversals • Analyzing code leads to the following conclusions: • I_search ~ 5 • I_trav ~ 30
Calculate I_Serach BinarySearch: middle = (p1+p2)/2; comp *middle,key; jle less; p1 = middle; less: p2 = middle; jump BinarySearch;
Calculate T_Trav Node *Find(Node *pNode,int key) { int *pKeysBegin = pNode->Keys; (1) int *pKeysEnd = pNode->Keys + pNode->nKeys; (3) int *pFoundKey,foundKey; pFoundKey = BinarySearch(pKeysBegin,pKeysEnd,key); (8) ? if( pFoundKey < pKeysEnd ) {foundKey = *pFoundKey;} (3,1) else {foundKey = INFINITE;} (1) int offset = (int)(pFoundKey - pKeysBegin); (2) Node *pChild = NULL; if( key < foundKey ) {pChild = pNode->pChilds + offset;} (4,1) else {pChild = pNode->pChilds + offset + 1;} (3) return pChild; -------- } (23-25)
Calculate I (Finishing) • h - Height of the tree • f - Fill percentage • e - Max number of keys in a node
Calculate M • M_node – Cache misses while searching inside a node When L is the number of cache line inside a node
Calculate M (Cont…) • Cache misses per tree traversal is bounded by: TreeHeight * M_node • What about q traversal ?
Calculate M for q traversals • Let’s assume there are no cache conflicts and no capacity misses • On first traversal there are M_node cache misses per node access • On subsequent traversals • Nodes near the root will have high probability of being found in the cache • Leaf nodes will have substantially lower probability
Calculate M for q traversals (Cont..) • Suppose, • q is the number of queries • b is the number of blocks • Then, the number of Unique Blocks that are visited is:
Calculate M for q traversals (Finishing) • Assuming q*M_node queries is performed by each tree traversal, then: M is the sum of UB at each level of the tree:
Calculate B • h - Height of the tree • f - Fill percentage • e - Max number of keys in a node
Mid year evaluation • We built a simple model • T = I*cpi + M*miss_latency + B*pred_penalty • Now, we want to use it
Our model’s prediction • We want to look at the performance behavior that our model predicts on Pentium 3 • The following parameters are used • 10,000,000 items • Number of queries = 10000 • Fill percentage = 67% • Cache line size = 32 bytes
Numbers • Best cache utilization at small node sizes: 64-256 bytes • For larger node sizes there ate fewer instructions executed, the minimum is reached at 1632 bytes. • Optimal node size is 1632 bytes, performing 26% faster over a node size of 32 bytes.
Our Model Conclusions • Conventional wisdom suggests: Node size = Cache line size • We show: Using large node size can result in better search performance.
Experimental Setup • Pentium 3 • 768MB of main memory • 16KB of L1 data cache • 512KB of L2 data/instruction cache • 4-way, set associative • 32 byte of cache line • Linux, kernel version 2.4.13 • 10,000,000 entries in database • The database is queried 10,000 times
Final Conclusions • We investigated the performance of CSB+Tree • We introduced first-order analytical models • We showed that cache misses and instruction count must be balanced • Node size of 512 bytes performs well • Larger node size suffer from poor insert performance