Algorithms and Data Structures

Algorithms and Data Structures Lecture 6

Agenda: • Hash tables • Collisions • Hash functions • Binary heap

Data Structures: hash tables • U – set of possible keys • K – set of used keys, K is a subset of U • T- hash table • In case of Direct Address Table |U|=|T| (|U|=|DAT|) • Construction of DAT becomes memory consuming or even impossible if |U| is a quite large number • If |K| is much less than |U| - significant part of DAT is unused

Data Structures: hash tables - sample • E.g. U is a set of 16 bit integers and K = { x | x ∊ U and x < 1024 } • |U|=216, |K|=210 then 216 - 210 = 64512 slots are unused, assuming slot occupies 4 bytes we get 258048 bytes = 252K of allocated but unused memory

Data Structures: hash tables • It is reasonable constraining cardinality of T (let it be m) to be close to |K| - that is the idea behind hash table • If k is a key from U, position in DAT is determined by k; in other words, form viewpoint of DAT, k is a position in DAT • Position in T is determined by some function h(k) • Where h is a hash function, which calculates a position in T dependently on value of a key k

Data Structures: hash tables • h : U -> { 0,1,… m }, so the domain of function h is a set of possible keys, domain of function values is a set of positions in T • Value of function h(k) is also named as a hash value • Function f : X ->Y is a single-valued function  for any x1∊ X and x2∊ X values of f(x1) and f(x2) are not equal while x1≠x2; otherwise function f is nonsingle-valued

Data Structures: hash tables • In case if m (cardinality of T) is less than |U| function h(k) is a nonsingle-valued function; it means that may exist two different keys k1,k2 and h(k1)=h(k2) • If hash values calculated on different keys are equal, we say that there is a collision in hash table • In case of DAT: |U|=m, h(k)=k is a single valued function and therefore there are no collisions in DAT

Data Structures: hash tables

Data Structures: collisions • It is desirable constructing a hash function so the collisions would be less probable • Any hash function must always produce the same value for any number of subsequent calls with the same input • There is a number of collision resolution methods available: chaining, open addressing and others

Data Structures: collisions – chaining • Each element of a hash table has an associated linked list of elements representing keys with the same hash value

Data Structures: collisions – chaining • Let n is a number of elements in a hash table T (with chains) and m is a cardinality of a hash table T (number of positions in table) • α = n/m is load factor of a hash table; α∊ [1/m; n] • E.g. load factor of any DAT is always between 0 and 1 as n may not exceed m • Load factor of arbitrary hash table T may have value between 0 and n

Data Structures: collisions – chaining • Let’s consider time characteristics of n element- and m slot- hash table (with chains) • In worst case (hash function is constructed improperly) all the keys may have the same hash value; it means that all n elements will be organized into the list • Time characteristics of operations are similar to the characteristics of list operations: search is Θ(n), add and delete are O(1)

Data Structures: collisions – chaining • In best case (hash function is constructed properly) we assume that hash values are distributed uniformly – hypothesis of simple uniform hashing • Let’s consider search operation: while evaluating search operation it is desirable considering two cases: (a) unsuccessful result (there are no elements with given key in the table T) and (b) successful result

Data Structures: collisions – chaining • Theorem 1: Given a hash table (with chains) T, which load factor is α and hypothesis of simple uniform hashing is true. Then during unsuccessful search operation (1) α elements will be visited in average and (2) average time (including calculation of hash value ) will be Θ(1+α). • (1) Taking into account assumption of hypothesis all the positions of T are equiprobable for the given key. Therefore in order to perform unsuccessful search we have to look through one of the m lists. Average length of a list is n/m=α. Hereby statement (1) is proved.

Data Structures: collisions – chaining • (2) From (1) we can state that average time needed to look through α elements (list) is Θ(α); time needed to calculate hash value is Θ(1). Thus average time needed to accomplish unsuccessful search operation is Θ(1+α). Thereby statement (2) is proved. • Theorem 2 (add on to theorem 1): Average time needed for successful search operation in table T is Θ(1+α). • Average search time is a sum of times needed to find each element of the table T divided by number of elements.

Data Structures: collisions – chaining • Let’s consider arbitrary list (of m lists); time needed to find i-th element of a list is Θ(1+(i-1)/m) • Thus average time is a sum of all times divided by number of elements. • 1/n ∑ [1+(i-1)/m], i=1, …, n • 1/n ∑ [1+(i-1)/m]=1/n [n+ 1/m∑(i-1)] • 1+1/nm ∑(i-1)=1+1/nm (n-1)n/2= • 1+α(n-1)/2n=1 + α/2 – 1/(2m) • Θ(1+α). Thereby theorem is proved.

Data Structures: collisions – chaining • Let’s assume that m is proportional to n; it means that n=mc, where c – some constant. Therefore n = O(m) (by the definition of O). In other hand α=n/m=O(m)/m=O(1) and O(α+1)=O(1) • Statement: If growth of m and n are proportional and hypothesis of simple uniform hashing is true then average search time does not depend on n and is always O(1). • Other operations: add is O(1), delete is O(1)

Data Structures: hash functions • Good hash function must comply with assumption of uniform hashing: for key k all m hash values must be equally probable; where m is a cardinality of a hash table T • It is usually assumed that domain of a hash function (set of keys) is a set of natural numbers • If keys are not natural numbers they usually may be transformed to the required form, even if keys are strings and etc.

Data Structures: hash functions • E.g. if keys are two letter strings they may be converted either to natural number or to a pair of natural numbers • “pt” –> pair <112, 116>, where 112 and 116 ASCII codes of “p” and “t” correspondingly • “pt” -> <14452> , 14452 =112*128 + 116 in base-128 system (ASCII value of a standard character may not be greater than 128) • If string may contain non-standard characters, we have to deal with base-256 system: 112*256+116=28788

Data Structures: hash functions - construction • Division method: for any key k corresponding hash value is a remainder of division k by m (cardinality of T); h(k) = k mod m • E.g. m=12, k=100, h(100)=4 • In order to construct good hash function (complying with assumption of uniform hashing) value of m must be chosen carefully dependently on set of keys • Counter-example: U={ x | 2n, n – natural number}, U={1, 2, 4, 8, 16, 32 …}; let m = 32 = 25 • Function h(k) = k mod 32 does not provide uniform hashing, if k >= 32 h(k)=0

Data Structures: hash functions - construction • Multiplication method: for any key k corresponding hash value is calculated by h(k) = lmax[ m ( kA mod 1 ) ], where A some constant, 0<A<1 • lmax(x) is a function that returns maximal natural number that is less or equal to x (x may be any positive number) • Method is less dependant on chosen value of m

Data Structures: hash table - sample

Data Structures: binary heap • Binary heap is an array of elements that may be organized to a binary tree by the following rules: • 1st element of array is root of a tree • If node has index j, left and right child nodes (if any) have indexes 2j and 2j+1 correspondingly, parent node (if any) has index equal to integral part of j/2 • Heap may not occupy the whole array • Size of the heap is less or equal to the size of the array it occupies

Data Structures: binary heap • The main property of the heap: any child element is always less or equal to the parent one

Data Structures: binary heap-sample

Data Structures: binary heap • Heapify is O(logn) • Left, Right and Parent are O(1) • Buildheap is O(n) • Heapsort is O(n logn)

Q&A

Algorithms and Data Structures