1 / 37

Computer Science 112

Computer Science 112. Fundamentals of Programming II Implementation Strategies for Unordered Collections. What They Are. Bag - a collection of items in no particular order Set - a collection of unique items in no particular order

adler
Download Presentation

Computer Science 112

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections

  2. What They Are • Bag - a collection of items in no particular order • Set - a collection of unique items in no particular order • Dictionary - a collection of values associated with unique keys

  3. Variations • SortedBag- a bagthatallows clients to access items in sorted order • SortedSet - a set that allows clients to access items in sorted order • SortedDictionary - a dictionary that allows clients to access keys in sorted order

  4. Sorted Set and Dictionary Implementations • Array-based, using a sorted list • Linked, using a linked binary search tree • Must keep the tree balanced; insertions and removals will then be logarithmic as well

  5. Dictionary Interface d.isEmpty() len(d) iter(d) # Iterate through the keys str(d) key in d d.get(key, defaultValue = None) item = d[key] d[key] = item # Add or replace d.pop(key, defaultValue = None) d.entries() # A set of entries d.keys() # An iterator on the keys d.values() # A iterator on the values

  6. Dictionary Implementations • Array-based (like ArraySet and SortedSet) • Linked structure (like LinkedSet and LinkedSortedSet) • Both use an Entry class to contain the key/value pair

  7. Possible Organization I AbstractCollection AbstractBag ArrayBag LinkedBag ArraySet LinkedSet ArrayDict LinkedDict Is a dictionary just a type of set with some additional methods?

  8. Possible Organization II AbstractCollection AbstractBag AbstractDict ArrayBag LinkedBag ArrayDict LinkedDict ArraySet LinkedSet Which methods are implemented in AbstractDict?

  9. The Entry Class class Entry(object): def__init__(self, key, value): self.key = key self.value = value def__eq__(self, other): iftype(self) != type(other): returnFalse returnself.key == other.key def__lt__(self, other): iftype(self) != type(other): returnFalse returnself.key < other.key def__le__(self, other): iftype(self) != type(other): returnFalse returnself.key <= other.key Goes in abstractdict.py, where all dictionaries can see it

  10. The AbstractDict Class fromabstractcollectionimportAbstractCollection classAbstractDict(AbstractCollection): def__init__(self): AbstractCollection.__init__(self, None) def__str__(self): return" {" + ", ".join(map(lambda entry: str(entry.key) + \ ":" + str(entry.value), self.entries())) + "}" {2:3, 6:7}

  11. Can We Do Better? • If we could associate each unordered set element or each unordered dictionary key with a unique index position in an array, we could have • Constant-time search • Constant-time insertion • Constant-time removal

  12. Hashing • Each data element has a unique hash value, which is an integer • This value can be computed in constant time by a hash function • This computation can be performed on each insertion, access, and removal

  13. How Are the Elements Stored? • The hash value is used to locate the element’s index in an array, thus preserving constant-time access • How to compute this: hashValue % capacity of array Position will be >= 0 and < capacity

  14. A Sample Access Method (Set) def__contains__(self, item): index = abs(hash(item)) % len(self._array) returnself._array[index] != None • self._array is an array of items • len(self._array) is the array’s current physical size • hash(item) is a function that returns an item’s hash value • Other access methods have a similar structure

  15. A Sample Mutator Method (Set) def add(self, item): ifnot item in self: index = abs(hash(item)) % len(self._array) self._array[index] = item

  16. A Adding Items mySet.add("A") index = 10

  17. B A Adding Items mySet.add("B") index = 5

  18. C B A Adding Items mySet.add("C") index = 0

  19. C B A D Adding Items mySet.add("D") index = 14

  20. C B A D Adding Items Add 12 more items

  21. K Y G W D A B Q M C N E T L F I Adding Items Array is full Resize the array and rehash all elements

  22. Performance • O(1) lookups, insertions, removals - wow! • Cost of resizing the array is amortized over many insertions and removals • Works as long as hashValue % capacity is not the same for two items

  23. Problem: Collisions • As more elements fill the array, the likelihood that their hash values map to the same array position increases • A collision then occurs: that is, items compete for the same position in the array

  24. A Tester Program def testHash(arrayLength = 10, numberOfItems = 5): print(" Item hash code array index") for i inrange(1, numberOfItems + 1): item = "Item" + str(i) code = hash(item) index = abs(code) % arrayLength print("%7s%12d%8d" % (item, code, index))

  25. Load Factor • An array’s load factor expresses the ratio of the number of elements to its capacity • Example: elements(10) / length(30) = .3333 • Try to keep load factor low to minimize collisions • Does waste some memory, though

  26. Collision Processing Strategies • Linear collision processing - search for the next available empty slot in the array, wrapping around if the end is reached • Can lead to clustering, where several elements that have collided now occupy consecutive positions • Several small clusters may coalesce into a large cluster and thus degrade performance

  27. Collision Processing Strategies • Rehashing - run one or more additional hash functions until a collision does not occur • Works well when the load factor is small • Multiple hash functions may contribute a large constant of proportionality to the running time

  28. Collision Processing Strategies • Quadratic collision processing - Move a considerable distance from the initial collision • Does not require other rehashing functions • When k is the collision position, we enter a loop that repeatedly attempts to locate an empty position k + 12 // The first attempt to locate a position k + 22 // The second attempt to locate a position k + r2 // The rth attempt to locate a position

  29. Collision Processing Strategies • Chaining • Each hash value specifies an index or bucket in the array • This bucket is at the head of a linked structure or chain of items with the same hash value

  30. D5 D2 D6 D4 D8 D3 D1 D7 Some Buckets and Chains index 0 1 2 3 4

  31. HashSet Data # Instance variables for locating data self._foundEntry # Pointer to item just located # undefined if not found self._priorEntry # Pointer to item prior to one just located # undefined if not found self._index # Index of chain in which item was located # undefined if not found # Instance variables for data self._array # the array of collision lists self._size # number of items in the set Extra instance variables support pointer manipulations during insertions and removals

  32. HashSet Initialization from node import Node from abstractset import AbstractSet from abstractcollection import AbstractCollection class HashSet(AbstractCollection, AbstractSet): DEFAULT_CAPACITY = 1000; def __init__(self, sourceCollection = None): self._array = Array(HashSet.DEFAULT_CAPACITY) self._foundEntry = self._priorEntry = None self._index = -1 AbstractCollection.__init__(self, sourceCollection) Uses singly linked nodes for the collision lists

  33. HashSet Searching def __contains__(self, item): self._index = abs(hash(item)) % len(self._array) self._priorEntry = None self._foundEntry = self._array[self._index] while self._foundEntry != None: if self._foundEntry.data == item: returnTrue else: self._priorEntry = self._foundEntry self._foundEntry = self._foundEntry.next returnFalse If this method returns True, the instance variables _index, _foundEntry, and _priorEntry allow other methods to locate and manipulate an item in the array’s collision list efficiently

  34. HashSet Insertion def add(self, item): ifnot item in self: newEntry = Node(item, self._array[self._index]) self._array[self._index] = newEntry self._size += 1 returnTrue else: returnFalse Link to head of chain

  35. HashSet Removal def remove(self, item): if not item in self: returnFalse elif self._priorEntry == None: self._array[self._index] = self._foundEntry.next else: self._priorEntry.next = self._foundEntry.next self._size -= 1 returnTrue

  36. Performance of Chaining • If chains are evenly distributed across the array, close to O(1) • If one or two chains get very long, processing tends towards linear • Can use a large array but wastes memory • On the average and for the most part, close to O(1)

  37. For Friday Introduction to Graphs (Chapter 20)

More Related