1 / 54

Building an Efficient Integer Set Search Engine

Explore different data structures for storing and searching integers efficiently with a practical example in C++. Learn about array-based and STL implementations and their trade-offs. Enhance your search algorithms knowledge.

msamuel
Download Presentation

Building an Efficient Integer Set Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Practical Example - Searching

  2. A practical example - searching • Searching problems come in many varieties: • A compiler looks up a variable name. • A spelling checker looks up a word in a dictionary. • A subscriber service looks up a subscribers phone number. • We are interested in building a spell checker for English. We will assume: • 75,000 words are in the dictionary. • The average word length is 6 characters. • We have limited memory available (say a 64K address space). • Can we do this? • Lets use a simple example to explore the options… CS352 - Software Engineering (AY2005)

  3. A practical example - searching • Lets look at a very simple problem: • How do we best store a set of integers with no other associated data? • We want to generate a sorted sequence of random numbers in the range [0, maxval), chosen without duplicates. initialize set S to empty size = 0 while size < m do t = bigrand() % maxval if t is not in S insert t into S size++ Print the elements of S in sorted order CS352 - Software Engineering (AY2005)

  4. Searching • Let’s call our data structure IntSet. We’ll define the interface to be a C++ class with these public members: class IntSetImp { pubilc: IntSetImp(int maxelements, int maxval); void insert(int t); int size(); void report(int *v); }; • This interface is clearly for instructional use only. • An industrial strength class would include error handling and a destructor. • An expert C++ programmer would probably use an abstract class with virtual functions to specify the interface and then write each implementation as a derived class. • We will use the simpler (and sometimes more efficient) approach of using names such as IntSetArr for an array-based implementation, IntSetList for a list implementation. We will use IntSetImp to represent an arbitrary implementation. CS352 - Software Engineering (AY2005)

  5. Searching • The C++ code uses such a data structure to generate a sorted set of random integers. void gensets(int m, int maxval) { int *v = new int[m]; IntSetImp S(m, maxval); while (S.size() < m) S.insert(bigrand()%maxval); S.report(v); for (int i=0; i<m; i++) cout << v{i} << “/n”; } • If the insert function does not put duplicate elements into the set we need not test if the element is in the set before we insert it. CS352 - Software Engineering (AY2005)

  6. Searching – general solution • The easiest implementation of an IntSet uses the powerful and general set template from the C++ Standard Template Library. class IntSetSTL { private: set<int> S; public: IntSetSTL(int maxelements, int maxval) { } int size() {return S.size();} void report(int *v) { int j=0; set<int>::iterator i: for (i=S.begin(); i!=S.end(); ++i) v[j++] = *i; } }; • The constructor ignores its two arguments; • The report function uses the standard iterator to write the elements of the set into the array, in sorted order. CS352 - Software Engineering (AY2005)

  7. Searching – general solution • The general purpose structure is good, but not perfect. We can get a factor of five improvement in both space and time efficiency! • The reason for the inefficiency is the fact that the STL was written for generality, not application specific efficiency. • It incorporates a number of features that we do not need and these features occupy both space and time. • So while it is easy and convenient to program with the STL, it does not generate an appropriate solution in this instance. CS352 - Software Engineering (AY2005)

  8. Searching – linear structures • The simplest linear structure is an array of integers. class IntSetArray { private: int n, *x; public: IntSetArray(int maxelems, int maxval) { x = new int[1 + maxelems]; n = 0; x[0] = maxval; } int size() {return n;} void insert(int t) { for (int i = 0; x[i] < t; i++) ; if (x[i] == t) return; for (int j = n; j >=i; j--) x[j+1] = x[j]; x[i] = t; n++; } CS352 - Software Engineering (AY2005)

  9. Searching – linear structures void report(int *v) { for (int I = 0; I < n; i++) v[i] = x[i]; } } • Our class keeps the current number of elements in the integer n, and the integers themselves in the vector x. • The constructor allocates an array with maxelems elements + 1 for a sentinel value. It uses maxval as the sentinel value since it can never legally be inserted into the set. The constructor initializes n to 0. • Since we have to report the elements in sorted order, we store them that way at all times. The sentinel value is always at the end of the valid portion of the array. CS352 - Software Engineering (AY2005)

  10. Searching – linear structures • This makes our insert code simpler (and faster) than testing to see if we run off the end of the list – O(n) time. • The report function copies all of the elements to the output array in O(n) time. • Arrays are an excellent structure for sets when the size is known in advance. • Since the array is sorted, we can build a member function using binary search that operates in O(log n) time. • If the size of the array is not known, linked lists are a prime candidate for set representation. CS352 - Software Engineering (AY2005)

  11. Searching – linear structures class IntSetList { private: int n; struct node { int val; node *next; node(int v; node *p) {val = v; next = p;} }; node *head, *sentinel; node *rinsert(node *p, int t) { if (p->val < t) { p->next = rinsert(p->next, t); } else if (p->val > t) { p = new node(t, p); n++; } return p; } CS352 - Software Engineering (AY2005)

  12. Searching – linear structures public: IntSetList(int maxelems, int maxval) { sentinel = head = new node(maxval, 0); n = 0; } int size() {return n;} void insert(int t) {head = rinsert(head, t);} void report(int *v) { int j = 0; for (node *p = head; p != sentinel; p = p->next) v[j++] = p->val; } }; CS352 - Software Engineering (AY2005)

  13. Searching – linear structures • Each node in the linked list has a value and a pointer to the next element. As a result it will use twice the storage space as a array of the same size. • Like the array implementation, we use a sentinel to denote the end of the list. The constructor builds such a node and sets head to point at it. • The report function simply walks the list and copies the values into the result array. • To insert an item into a sorted list is relatively complex. Our insert function calls rinsert (private) which recursively traverses the list. • To generate m random numbers, each of the searches runs in time proportional to m (on average). Therefore, each structure takes time proportional to m2. [Generating the set with m integers, insert is called m times.] CS352 - Software Engineering (AY2005)

  14. Searching – linear structures • We might suspect that the list version would run slightly faster than the array version: it uses extra space (pointers) to avoid shifting upper values of the array during an insert operation. • Below are some rum times with n held fixed at 1,000,000 and m varying from 10,000 to 40,000. CS352 - Software Engineering (AY2005)

  15. Searching – linear structures • The run-time of arrays grew quadratically as expected. • The simple implementation of lists started out an order of magnitude slower then arrays and grew faster than n2 – SOMETHING WAS WRONG! • The first response is to blame recursion. There is an overhead for establishing and destroying activation records for each recursive call. Also after recurring all the way down the list (O(n) depth), the code assigns the original value back into almost every pointer. • Changing the recursive version for an iterative one dropped the run-time by a factor of three. CS352 - Software Engineering (AY2005)

  16. Searching – linear structures • The next response is to change the storage allocation so that a block is allocated initially and then get values from this allocation rather than constantly allocating them from the heap (storage allocation is about 2 orders of magnitude more expensive than most simple operations). • Also if we allocate nodes as a block, each node consumes 8 bytes (4 for the integer and 4 for the pointer). 40,000 nodes then consume 320KB which fits comfortably into the Level-2 cache on the machine. • If we allocate nodes individually, then each node consumes 48 bytes. Collectively their 1.92MB overflows the Level-2 cache. • The code tuning performed gives different results on different machines since it is dependent on the efficiency of the underlying system implementation. CS352 - Software Engineering (AY2005)

  17. Searching – linear structures • The array based implementation searches down the list to find the insertion point and then shifts all the greater values to make room for it. • So if lists do half the work, why do they take twice the time? • Part of the reason is that they take twice the memory. Large list must read 8 byte nodes into a cache to access the 4 byte integer. • Arrays access data with perfect predictability, while access patterns for lists bounce all over memory. CS352 - Software Engineering (AY2005)

  18. Searching – BST • Let us consider a structure that supports fast search and insertion – a binary search tree. class intSetBST { private: int n, *v, vn; struct node { int vl; node *left, *right; node(int v) { val = v; left = right = 0; } }; node *root; node rinsert(node *p, int t) { if (p==0) { p = new node(t); n++; } else if (t < p->val) { p->left = rinsert(p->left, t); } else if (t > p->val) { p->right =insert(p->right, t); } // do nothing if p->val == t return p; } CS352 - Software Engineering (AY2005)

  19. Searching – BST void traverse(node *p) { if (p==0) return; travere(p->left); v[vn++] = p->val; traverse(p->right); } public: IntSetBST(int maxelems, int maxval) { root = 0; n = 0; } int size() { return n; } void insert(int t) { root = rinsert(root, t); } void report(int *x) { v = x; vn = 0; traverse(root); } }; • We initialize the tree by setting the root to empty and perform other actions by calling recursive functions. CS352 - Software Engineering (AY2005)

  20. Searching – BST • Since the element in or application are inserted in random order we do not need to worry about sophisticated balancing algorithms (exploit your knowledge of the application). • The table gives run-times for the Standard Template Library, BST and other structures. The maximum integer size is held at n=108, and m goes as high as possible until the system runs out of RAM. • STL = Standard Template Library • BST = Binary Search Tree • BST* = BST with optimizations including the allocation of all nodes at once. • Bins = Hash Table (discussed next). • Bins* = Bins with optimziations • BitVec = Bit Vector (discussed shortly). CS352 - Software Engineering (AY2005)

  21. Searching – BST These times do not include the times to print the output, which is slightly greater than the time of the STL implementation. CS352 - Software Engineering (AY2005)

  22. Searching • Our simple BST implementation avoided the complex balancing scheme used by the STL and istherefore slightly faster and uses less space. • The STL started thrashing at about m=1,600,000. • The BST started thrashing at about m=1,900,000 CS352 - Software Engineering (AY2005)

  23. Searching – exploiting knowledge • Let us use a bit vector. 1 represents the fact that the integer is present, 0 indicates that it is absent. class IntSetBitVec { private: enum {BITSPERWORD=32, SHIFT=5, MASK=0x1F}; int n, hi, *x; void set(int i) {x[i>>SHIFT] != (1<<(i&MASK));} void clr(int i) {x[i>>SHIFT] &= ~(1<<(i&MASK));} int test(int i) {return x[i>>SHIFT] & (1<<(i&MASK));} public: IntSetBitVec(int maxelems, int maxval) { hi = maxval; x = new int[1 + hi/BITSPERWORD]; for (int i = 0; i < hi; i++) clr(i); n = 0; } CS352 - Software Engineering (AY2005)

  24. Searching – exploiting knowledge int size() {return n;} void insert(int t) { if (test(t)) return; set(t); n++; } void report(int *v) { int j = 0; for (int i = 0; i < hi; i++) if (test(i)) v[j++] = i; } }; CS352 - Software Engineering (AY2005)

  25. Searching – exploiting knowledge enum {BITSPERWORD=32, SHIFT=5, MASK=0x1F}; x[i>>SHIFT] &= ~(1<<(i&MASK)) • C++ refresher…. • ~ is unary bitwise complement (NOT) • & is binary bitwise AND • >> is a right shift (right shift by 1 is equivalent to dividing by 2). • << is a left shift (left shift by 1 place is equivalent to multiplying by 2) • MASK has the 5 right most bits set, all others are 0. CS352 - Software Engineering (AY2005)

  26. Searching – exploiting knowledge • The constructor allocates the arrays and turns off all bits. • Report cycles through the bit vector and tests to see if each integer value is present, adding it to the result array if it is. • The insert function turns the bit on and increments n, but only if the bit was previously off. • The bit vector requires half a gigabyte of main memory if n=232. CS352 - Software Engineering (AY2005)

  27. Searching – exploiting knowledge • Let us combine the strengths of lists and bit vectors by placing integers into a sequence of bins or buckets. As an example, if we have 4 integers in the range 0..99, we place them into 4 bins. Bin 0 contains integers in the range 0..24, bin 1 represents 25..49, bin 2 represents 50..74 and bin 3 represents 75..99. • The m bins can be viewed as a kind of hashing. • Since the integers are uniformly distributed, each linked list has an expected length of 1. CS352 - Software Engineering (AY2005)

  28. Searching – exploiting knowledge class IntSetBins { private: int n, bins, maxval; struct node { int val; node *next; node(int v, node *p) {val = v; next = p;} }; node **bin, *sentinel; node *rinsert(node *p, int t) { if (p->val < t) { p->next = rinsert(p->next, t); } else if (p->val > t) { p = new node(t, p); n++; } return p; } CS352 - Software Engineering (AY2005)

  29. Searching – exploiting knowledge public: IntSetBins(int maxelems, int pmaxval) { bins = maxelems; maxval = pmaxval; bin = new node*[bins]; sentinel = new node(maxval, 0); for (int i = 0; i < bins; i++) bin[i] = sentinel; n = 0; } int size() {return n;} void insert(int t) { int i = t / (1 + maxval/bins); bin[i] = rinsert(bin[i], t); } void report(int *v) { int j = 0; for (int i = 0; i < bins; i++) for (node *p = bin[i]; p != sentinel; p = p->next) v[j++] = p->val; } }; CS352 - Software Engineering (AY2005)

  30. Searching – exploiting knowledge • The constructor allocates the array of bins and a sentinel element with a lare value and initializes each bin to point to the sentinel. • The insert function needs to place the integer t into its proper bin. The obvious mapping of t*bins/maxval can lead to numerical overflow (and some nasty debugging…). We instead use a safer mapping in the code. • The rinsert function is the same as the one we used for linked lists. • The report function is essentially the linked list code applied to every bin in turn. CS352 - Software Engineering (AY2005)

  31. Searching • The table gives the average performance when m is small compared to n. • b denotes the number of bits per word CS352 - Software Engineering (AY2005)

  32. The real world • After that insight based on a simple problem (set of integers), lets look at a real-world problem. • In 1978, Doug McIlroy wrote the unix spell program. A requirement was that it had to fit within a 64 kilobyte address space on a PDP-11 computer. • See his paper “Development of a spelling list” in the IEEE Transactions on Communications, January 1982, pp. 91-99, for more details than presented here. • The first problem McIlroy faced was representing the word list. • He started by intersecting an unabridged dictionary (for validity) with the million word Brown University corpus (for currency). • This was a good start, but dictionaries do not deal with proper nouns. CS352 - Software Engineering (AY2005)

  33. The real world • He add the 1,000 most common last names from a large telephone directory, a list of boys’ and girls’ names, famous names (like Dijkstra and Nixon) and mythological names from an index to Bulfinsh. • After observing “mis-spellings” like Texaco and Xerox, he add companies on the Fortune 500 list. Publishing companies are rampant in bibliographies, so he added them also. • He also added all the nations and their capitals, the states and theirs, the hundred largest cities in the US and the world, as well as oceans, planets and stars. • He also added common names of animals and plants, terms from chemistry, anatomy and computing. • He was careful not to add too much: he kept out valid words that tend to be real-life mis-spellings (like the geological tem cwm), and included only one of several alternative spellings (hence traveling, but not travelling). CS352 - Software Engineering (AY2005)

  34. The real world • McIlroy’s trick was to examine spell’s output from real runs. For sometime, spell automatically mailed a copy of the output to him. (Privacy issues were viewed a little differently in bygone days). Whenever he spotted a problem, he would apply the broadest possible solution. • The final result was a list of 75,000 words that made up the dictionary. CS352 - Software Engineering (AY2005)

  35. The real world • McIlroy them used his knowledge of English to minimize the space required by the dictionary. • He used affix analysis to peel off prefixes and suffixes – this is both necessary and convenient. Convenient because it makes reduce the size of the dictionary. Necessary because there is no such thing as a word list for English – a spelling hecker either must guess at the derivation of words like misrepresented or report as errors a lot of valid English words. • The goal of affix analysis is to reduce misrepresented down to sent, stripping of the mis-, re-, pre- and -ed. • The program’s tables contain 40 prefx rules nd 30 suffix rules. • A “stop list” of 1,300 exceptions halts good but incorrect guesses like reducing entend (a mis-spelling of intend) to en- + tend. • The analysis reduces the 75,000 word list to 30,000 words. CS352 - Software Engineering (AY2005)

  36. The real world • McIlroy’s program loops on each word, stripping affixes and looking up the result until it either finds a match or no affixes remain (and the word is declared to be erroneous). • A back-of-the-envelope analysis showed the importance of keeping the program in main memory. • Stripping prefixes and suffixes reduced the list to below one third of its original size, hashing discards 60% of the bits that remain, and data compression halves that again. • As a result a list of 75,000 words (and roughly as many inflicted forms) was represented in 26,000 16-bit computer words (52K). CS352 - Software Engineering (AY2005)

  37. The real world • McIlroy used hashing to represent 30,000 English words in 27 bits each. • Consider the toy list of words: A list of five words • The first hashing method uses an n-element hash table roughly the same size as the list and a hash function that maps a string into an integer in the range [0, n). • The ith entry of the table points to a linked list that contains all strings that hash to i. • If null lists are represented by empty cells and the hash function yields h(a) = 2, h(list) = 1, then a 5 element table might look like: of list a words five CS352 - Software Engineering (AY2005)

  38. The real world • To look up a word, we perform a sequential search of the list pointed to by the h(w)th cell. • The next scheme uses a much larger table. Choosing n=23 makes it likely that most hash cells contain just one element. • The spell program uses n=227 (roughly 134 million)., and all but a few of the non-empty lists contain just a single element. list words a of five CS352 - Software Engineering (AY2005)

  39. The real world • The next is daring! Instead of a linked list of words. McIlroy stores just a single bit in each table entry. This reduces space dramatically, but introduces errors. • To look up a word w, the program accesses the h(w)th bit in the table. If this bit is a 0 then the program correctly reports that the word is not in the table. Sometimes a bad word happens to hash to a valid bit, but the probability of such an error is just 30,000/227, or roughly 1/4,000. • On average, therefore, 1 in 4,000 bad words will sneak by undetected. • McIlroy observed typical rough drafts rarely contained more than 20 errors, so this defect hampers at most one run in every hundred. 1 1 1 1 1 CS352 - Software Engineering (AY2005)

  40. The real world • Representing the hash table by a string of n=227 bits would consume over 16 million bytes. The program, therefore just represents the 1 bits; in the previous example it just stores 2, 8, 13, 15, 22 • The word is declared to be in the table if h(w) is present. • The obvious representation of these numbers uses 30,000 27-bit words, but McIlroy’s machine only had 32,000 16-bit words in its address space. • He sorted the list and used a variable length code to represent the differences between successive hash values. Assuming a starting value of zero, the above list is compressed to: 2, 6, 5, 2, 7 CS352 - Software Engineering (AY2005)

  41. The real world • McIlroy’s program represents the differences in an average of 13.6 bits each. • The left a few hundred words to point at useful starting points in the compressed list and thereby speed up the sequential search. • The result was a 64 kilobyte dictionary that has fast access times and rarely makes mistakes. CS352 - Software Engineering (AY2005)

  42. Estimation • Before embarking on a significant programming job, it is often wise to undertake an estimate of resources you need to complete the task – the amount of memory, the execution time of the application etc. • Before developing a complete complexity analysis on the application in order to estimate average case and worst case time complexity (for example), it is often desirable to undertake a quick “back of the envelope” calculation to get a rough idea of the answer. • This will tell you if the plan you have is feasible, or if you need to undertake some refinements. CS352 - Software Engineering (AY2005)

  43. Mississippi River • How much water flows out of the Mississippi River each day? CS352 - Software Engineering (AY2005)

  44. Mississippi River • Estimate that the mouth of the river is about a mile wide. • Estimate that the average depth at the month is about 20 feet (one two hundred and fiftieth of a mile). • Guess that the rate of flow is about five miles an hour (one hundred and twenty miles a day). • Multiplying gives: • 1 mile x 1/250 mile x 120 miles/day ≈ ½ mile3/day. CS352 - Software Engineering (AY2005)

  45. Mississippi River • It is always god to double check your calculations (especially quick ones). Two answers are better than one. • Estimate that the Mississippi River basin is 1000 miles x 1000 miles. • Estimate that the annual run-off from rainfall there is about one foot (one five thousandth of a mile). • That gives: • 1000 miles x 1000 miles x 1/5000 miles/year ≈ 200 miles3/year • 200 miles3/year / 400 days/year ≈ ½ mile3/day CS352 - Software Engineering (AY2005)

  46. Mississippi River • As a cheating triple check, we could consult an almanac and discover that the river’s discharge is 640,000 ft3/second. • Working from that gives: • 640,000 ft3/sec x 3,600 secs/hr ≈ 2.3 x 109 ft3/hr • 2.3 x 109 ft3/hr x 24 hrs/day ≈ 6 x 1010 ft3/day • 6x1010 ft3/day / (5000 ft/mile)3 ≈ 6x1010 ft3/day / 125x109 ft3/mile3) ≈ 60/125 mile3/day ≈ ½ mile3/day CS352 - Software Engineering (AY2005)

  47. Real-life • Such a back of the envelope calculation was done to calculate the feasibility of an E-mail system proposed for the Summer Olympic Games. • The proposal contained most of the numbers needed. • The time to send yourself a one character mail message can be measured. • Performing calculations as simple as that for the Mississippi River, revealed that the proposed system could only work if there were 120 seconds/minute! • The proposal was set back to the drawing board and revised. • The system was built a year later and was used during the Olympic Games without a hitch. • The idea of estimating is standard fare in Engineering schools and is the bread and butter for most practicing engineers. • As software engineers we need to be able to estimate resource (especially space and time) requirements. CS352 - Software Engineering (AY2005)

  48. Rule of 72 • If you invest your money for y years at an interest rate of r percent per year, then if y x r = 72, then you will roughly double your money. • For example, investing $1000 at 6% for 12 years, gives #2012. • $1000 at 8% for 9 years gives you $1999. • The Rule of 72 is handy for estimating the growth of any exponential process. If a bacterial colony in a dish grows by 3%/hr, then it doubles size every day. • Doubling brings programmers back to some familiar rules of thumb: • Since 210 = 1024, 10 doublings is about 1000, • 20 doublings is about a million • 30 doubling is about a billion. CS352 - Software Engineering (AY2005)

  49. Rule of 72 • Suppose an exponential program takes 10 seconds to solve a problem of size n=40. • Suppose that increasing n by 1 increases the run time by 12%. • The rule of 72 tells is that the run time doubles when n increases by 6. • When n=100, the program should therefore take about 10,000 seconds (a few hours). • What happens if n increases to 160? The time rises to 107 seconds. How much time is that? • You might remember that 3.155 x 107 seconds = 1 year, or yo might remember Tom Duff’s handy rule of thumb that, to within ½%, π seconds is a nanocentury. • Because the expected running time is 107 seconds, we should be prepared to wait about 4 months!. CS352 - Software Engineering (AY2005)

  50. Computer Science • Nodes in your data structure hold an integer and a pointer to a node: struct node {int i; struct node *p; }; • Will 2 million nodes fit into the main memory of your 128 MB computer? • Looking at the system performance monitor of your 128 MB computer you might find that you have about 85 MB free. (The operating system hogs a lot of space). • How much memory will 2 million node take? • If you have a 32-bit machine, each integer and pointer will take 4 bytes. 2,000,000 x (4 + 4) = 16MB • If you have a 64 bit machine, each will take 8 bytes. 2,000,000 x (8 + 8) = 32MB. CS352 - Software Engineering (AY2005)

More Related