400 likes | 477 Views
Algorithms and data structures. Protected by http://creativecommons.org/licenses/by-nc-sa/3.0/hr/. Creative Commons. You are free to: share — copy and redistribute the material in any medium or format adapt — remix, transform, and build upon the material Under the following terms:
E N D
Algorithms and data structures Protected by http://creativecommons.org/licenses/by-nc-sa/3.0/hr/
Creative Commons • You are free to: • share — copy and redistribute the material in any medium or format • adapt — remix, transform, and build upon the material • Under the following terms: • Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. • NonCommercial — You may not use the material for commercial purposes. • ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits. Notices: You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation. No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material. Text copiedfrom http://creativecommons.org/licenses/by-nc-sa/3.0/ Algorithms and data structures, FER
Addressing techniques Basics Retrieval procedures Hashing
Basics • Knowing the key of some record, the question arises how to find this record • Primary key • Defines a record uniquely • E.g. StudentID • Concatenated (composite) keys • necessary for unique identification of some types of records • E.g. StudentID & CourseCode & ExamDate uniquely define a record of examination (a possible examination by a committee due to student’s complaint is neglected!) • Secondary key • Need not uniquely define the record, but it points at some attribute value • E.g. YearOfStudy in the record with course data Algorithms and data structures, FER
Sequential search • Searching the file record by record - the most primitive way • It is used for sequential files where all the records have to be read anyhow • Other terms: linear, serial search • The records need not be sorted • On average n/2 records are read • Complexity: O(n) • Best case: O(1) • Worst case: O(n) • Repeat for all the records • If the current record equals the searched one • Record is found • Leave the loop • IspisiTrazi(PrintSearch) Algorithms and data structures, FER
Sequential searching of sorted records • How to improve the sequential search? • Sort the records according to a key! • What are the complexities in the best, worst and average case while searching sorted records? • Sort the records • Repeat for all records • If the current record equals the searched one • Record is found • Leave the loop • If the current record is larger than the searched one • Record does not exist • Leave the loop Algorithms and data structures, FER
Questions and exercises • A colleague tells you that s/he wrote a sequential search algorithm with complexity O(log n). Shall you congratulate himor laugh at him? • In the best case, the record will be found after the minimum number of comparisons. Where was this record located? • Where was the record found after the maximum number of comparisons? Algorithms and data structures, FER
Block wise reading • In direct/random access files (all the records are of the same length!) sorted by the primary key, it is not indispensable to check all the records • E.g. only each hundredth record is examined • When the block of the record with the searched key is located, the block is searched sequentially • How to find the optimal block size? Algorithms and data structures, FER
Example – places in Croatia • We are looking for the city of Malinska in a list of F=6935 places; each page contains B=60 places • There are F / Bleading records (and corresponding pages) - F / B = 116 116 115 Živković Kosa Živogošće Žlebec Gorički . Žutnica Žužići Žužići 1 2 Zvijerci Zvjerinac Zvoneća . . . Žitomir Živaja Živike 60 61 Ada Adamovec Adžamovci . . . Bair Bajagić Bajčići Bajići Bajkini Bakar-dio . . . Barilović Barkovići Barlabaševec Mali Gradac Mali Grđevac Mali Iž . Malinska . Manja Vas Manjadvorci Manjerovići Maovice Maovice Maračići . . . Martin Martina Martinac CitanjePoBlokovima (ReadingBlockWise) Algorithms and data structures, FER
Optimal block size • In case of F records, and the block sizeB,there are F / B leading records and blocks • It is expected that during the search by blocks it is necessary toread a half of the existing leading records of the blocks • On average, the searched leading record will be found after having read (F / B) / 2 = F /( 2 B) records • Within the located block there are B records so it can be expected that the searched record would be found after on average B / 2 (sequential!) readings within that block • The total expected number of readings is F / (2 B) + B / 2 • After the derivation by Bis equalled to zero, optimal block size is obtained: B = √F • What is the optimal block size for the list of places in Croatia? Algorithms and data structures, FER
Binary search • The binary search starts at the half of the file/arrayand continues with constant halving of the interval where the searched record could be found • prerequisite: data sorted! • Average number of searching steps log2 n • The procedure is inappropriate for disk memories with direct access due to time/consuming positioning of reading heads. It is recommendable for core memory. • The fact is used that the array is sorted and in each step the search area is halved • Complexity is O(log2n) Number of elements = n The searched element Search steps = log2n Algorithms and data structures, FER
Example of binary search • Looking for25 2 5 6 8 9 12 15 21 23 25 31 39 Algorithms and data structures, FER
Algorithm for binary search • lower_bound = 0 • upper_bound = total_elements_count • Repeat • Find the middle record • If the middlerecord equals to the searched one • Record is found • Leave the loop • If the lower bound is greater thanor equal to the upper bound • Record is not found • Leave the loop • If the middlerecordis smaller then the searched one • Set the lower bound to the position of the current record + 1 • If the middlerecordis larger then the searched one • Set the upper bound to the position of the current record - 1 • (BinarnoPretrazivanje) BinarnySearch Algorithms and data structures, FER
Questions • What are the execution times in binary search of n records for the best, worst and average case? Average case is just for 1 step simpler than the worst case. The proof can be seen on: http://www.mcs.sdsmt.edu/ecorwin/cs251/binavg/binavg.htm, March 24th, 2014 • For the search of places in Croatia (list of 6935 places), what is the maximum number of steps necessary to locate the searched place? • Shall the binary search be always faster then the sequential, even for a large set of data? Algorithms and data structures, FER
Problem • Suppose that nunsorted data in a set can be sorted in timeO(n log2 n) . You have to performnsearches in this data set. What is better: • To use the sequential search? • To sort data and then apply thebinary search? • Solution: it makes more sense to sort and then search binary! • nsequential searches: n * O(n) = O(n2) • sort + binary: O(n log2 n) + n * O (log2 n) = O(n log2 n) O(n log2 n) < O(n2) Algorithms and data structures, FER
Index-sequential files • Every record contains a key as a unique identifier • If a file is sorted by the key, it is appropriate to form a look-up table • Input for the table is the key of the searched record • Output is the information regarding the more precise location of the searched record • Such a table is called index • The index need not point to each record but only to a block • In the example – places in Croatia, it was shown that the optimal block size was √F • For large files there are indices on multiple levels– with the optimal organisation, in the worst casein each of the indices and in the data file, the same number of records is read • Optimal sizes of indices on 2levels are 3√F i3√F 2 O (3√F ) • For k levels: k+1√F , k+1√F2, k+1√F3 ,…k+1 √F k-1 , k+1 √F k O (k+1√F ) Algorithms and data structures, FER
Index-sequential files • Insertion and deletion • In traditional sequential file (magnetic tape), they are possible only in the way by copying the whole file with addition and/or deletion of records • In direct (random) access files, deletion is performed logically, i.e. a tag is written to mark a deleted record • After a certain number of revisions, the file has to be reorganised • Data are written sequentially and by the key value, indexing is repeated (so called file maintenance) Algorithms and data structures, FER
Search procedures • Index non-sequential files • If the search by multiple keys is required or if addition and deletion of records is frequent (volatile files), in the first case it is very difficult (if not impossible) and in the latter case difficult, to maintain the requirement that the records are sorted and contained within their initial block • In that case, the index should contain the address (relative or absolute) of each single record • The key contains the address • The simplest case is to form the key so that some part of it contains the record address • E.g. At some entrance examination for enrolment, the application number can serve as the key, and simultaneously it can be the ordinal number of the record in a direct access file • Very often, such a simple procedure is not possible because the coding scheme cannot be adapted to each single application Algorithms and data structures, FER
The idea of hashing • Problem: a company employs about hundred thousand employees, every person has his or her own unique identifier ID, generated from the interval [0, 1 000 000]. The records read & write must be fast. How to organise the file? • Direct file with the key equals the ID? • 1 000 000 x 4 bytes~ 4MB, 90% of space is unused! • It is possible to devise procedures to transform the key into address, or, even better, into some ordinal number • The position of the record is stored under this ordinal number • This modification improves the flexibility Algorithms and data structures, FER
Hashing • Let us suppose to have M buckets (blocks of records with the same starting address of the block) available • A pseudo-random number from the interval 0 to M-1 is calculated from the key value using a hash-function • This number is the address of a group of data (of a bucket) where all of the respective keys are transformed into the same pseudo-random number • Collision happens when two different keys are transformed into a same address • If a bucket is full, it is possible to insert a pointer to the overflow area, or insertion is attempted in the next neighbouring bucket - Bad neighbour policy • In hashing the following parameters can vary: • Bucket capacity • Packing density Algorithms and data structures, FER
Example 0 1 2 3 • Store the names into a hash table • hash-function = (sum of ASCII codes) % (number of buckets) Vanja Matija Andrea Doris ? Saša Alex Sandi Perica Iva Algorithms and data structures, FER
Bucket capacity • A pseudo-random number is generated through a transformation of the key, yielding the bucket address • If the bucket capacity equals1, overflow is frequent • With increasing of the bucket size, decreses the probability of overflow, but reading of a single bucket is more time consuming and the amount of sequential search within a bucket increases • It is recommendable to match the bucket size with the physical size of the record on external memory (disk block) Algorithms and data structures, FER
Packing density • After the bucket size has been chosen, the packing density can be selected, i.e. the number of buckets to store the foreseen number of records • To reduce the number of overflows, a larger capacity is chosen • Packing density = number of records / total capacity • N = number of records to be stored • M = number of buckets • C = number of records within a bucket • Packing density = N / (M *C) Algorithms and data structures, FER
How to deal with overflow?” • Using the primary area • If a bucket is full, use the next one, etc. • After the last one, comes the first one • Efficient if the bucket size exceeds 10 • Separate chaining • Buckets are organised as linear lists Algorithms and data structures, FER
Statistics of hashing • Let M be the number of buckets, and N the amount of input data. The probability for directing xrecords into a certain bucket obeys the binomial distribution: • probability for Y overflows: P(C + Y) • the expected number of overflows from a givenbucket: • Total expected number of overflows: 100s M/N • The average number of records to be entered into the hash table before collision~ 1.25 √M • The average total number of entered records before every bucket contains at least 1 record isM lnM Algorithms and data structures, FER
Transformation of the key into address • Generally, the key is transformed into theaddress in 3 steps: • If the key is not numeric, it should be transformed into a number, preferably without loss of information • An algorithm is applied to transform the key, as uniformly as possible, into a pseudo-random number with an order of magnitude of the bucket count • The result is multiplied by a constant 1 for transformation into an interval of relative addresses, equal to the number of buckets • relative addresses are converted into the absolute ones on a concrete physical unit and, as a rule, that is the task of system programs • An ideal transformation: probability oftransforming2 different keys in a table of size M into the same address is1/M Algorithms and data structures, FER
Characteristics of a good transformation • The output value depends only on the input data • If it were dependent also on some other variable, its value should be also known at the search • The function uses all the information from the input data • If it were not using it, under a small variation of the input data a large number of equal outputs would be achieved– the distribution would depart from the wished one • Uniformly distributes the output values • Otherwise efficiency is decreased • For similar input data, results in very different output values • In reality, the input data are often very similar, while a uniform distribution at the output is required • Which of these requirements are not fulfilled by our hash function from the previous example? Algorithms and data structures, FER
Usage of hashing • When is it appropriate? • Compilers use it for recording of declared variables • For spelling checkers and dictionaries • In games to store the positions of players • For equality checks (in information security) • If two elements result in different hash values, they must be different • When quick and frequent search is required • When is it not appropriate? • When records are searched by a non-key attributevalue • When the data need to be sorted • E.g. To find the minimum value of the key Algorithms and data structures, FER
Example: Transformation of key into address • 6 digit key, 7000 buckets; key: 172148 • Method: middle digits of the key squared • Key squared yields a 12 digits number. Digits from the 5th to the 8th are used 1721482 = 029634933904 • The middle 4 digits should be transformed into the interval [0, 6999] • As the pseudo-random number obtains values from the interval [0, 9999], while the bucket addresses can be from the interval [0, 6999], the multiplication factor is 6999/9999 0.7 • Bucket address = 3493 * 0.7 = 2445 • The results correspond to behaviour of the roulette Algorithms and data structures, FER
Methods for transformation of key into address • Root from the middle digits of the key squared • Like in the previous example but after the square operation the root from the 8 middle digits is calculated and cut-off to obtain a four digit integer: • sqrt (96349339) = 9815 • Division • The key is divided by a prime number slightly less or equal to the number of buckets (e.g. 6997) • Remainder after division is the bucket address • Bucket address = 172148 mod (6997) = 4220 • Keys having a sequence of values are well distributed • Shifting of digits and addition • E.g. key= 17207359 • 1720 + 7359 = 9079 Algorithms and data structures, FER
Methods for transformation of key into address • Overlapping • Overlapping resembles shifting, but is more appropriate for long keys • E.g. the key= 172407359 • 407 + 953 + 271 = 1631 • Change of the counting base • The number is calculated as if having another counting base B • E.g. B = 11, key = 172148 • 1*115 + 7*114 + 2*113 + 1*112 + 4*111 + 8*110 = 266373 • The necessary number of least significant digits is selected and transformed into the address interval: bucket address = 6373 * 0.7 = 4461 • The best method can be chosen after simulation for a concrete application • Generally, division gives the best results Algorithms and data structures, FER
Determination of parameters • Example: • There are 350 students enrolled in a study. Their ID - identification number (11 characters) and the family name (14 characters) should be stored, under the requirement to retrieve the records fast using their ID. • Remark: • ID contains11digits • The last digit is for control and it can but need not be stored, if the rule how to calculate it from the rest of the digits is known Algorithms and data structures, FER
Solution • A record contains11+1 + 14+1 = 27 bytes • Let the physical block on the disk be 512 bytes • The bucket size should be equal or less than that • 512/27 = 18,963 • Therefore, a bucket shall contain data for 18 students and 26 bytes of unused space • A somewhat larger table capacity, e.g. 30%, shall be provided to reduce the number of expected overflows • That means there are 350/18 *1.3 = 25 buckets • ID should be transformed into a bucket address from the interval [0, 24] • ID is rather long,so overlapping can be considered • Methods can be combined – after overlapping division can be performed • Address shall be calculated by division with a prime number close to the bucket number, e.g. 23 Algorithms and data structures, FER
Writing of records into buckets of the hash table C ID FamilyName 0 1 HASH M M-1 BLOCK on disk Algorithms and data structures, FER
Examples of key transformation • ID = 5702926036x • 2926 • 630 • 075 • 3631 mod (23) = 20 • ID = 6702926036x • 2926 • 630 • 076 • 3632 mod (23) = 21 • If the bucket is full, entering into the next bucket is attempted cyclically (bad neighbour policy) • ID = 6702926037x • 2926 • 730 • 076 • 3732 mod (23) = 6 • ID = 5702926037x • 2926 • 730 • 075 • 3731 mod (23) = 5 Algorithms and data structures, FER
Algorithm - 1 Create an empty table on a disk Read ID and family name sequentially, until there are data If the control digit is not correct “Incorrect ID" Else Set the tag that the record is not written Calculate the bucket address Remember it as the initial address Repeat Read the existing records from the bucket Repeat for all the records from the bucket If the record is not empty If the already written ID is equal to the input one “Record already exists" Put the tag that the record is written Leave the loop Else Algorithms and data structures, FER
Algorithm - 2 Write the input record Put the tag that the record is written Leave the loop If the record is not written Increment the bucket address for 1 and calculate the modulo(buckets count) If the achieved address equals the initial Table is full Until record written or table full End Hash Algorithms and data structures, FER
Exercises • Update Hash such that instead JMBG (13 character ID) it uses OIB (11 character ID). Update the function “Kontrola” for checking the control digit. Implement deletion by ID. • Let products be the name of the unformatted file organised using hashing. Each record consists of ID (int), name (50+1 char), quantity (int) and price (float). The record is considered empty if its ID equals zero. Count the number of non-empty records in a file. The size of the block on the disk is BLOCK, and expected maximum number of records is MAXREC. These parameters are contained in parameters.h. • Let an unformatted file be organised using hashing. Each record of the file consists of name (50+1 char), quantity (int) and price (float). The record is considered empty if its quantity equals zero. The size of the block on the disk is BLOCK what is contained in parameters.h. Write the function that will find the packing density. The prototype of the function is: float density (const char *file_name); Algorithms and data structures, FER
Exercises • Write the function for the insertion of the ID (int) and the name (char 20+1) into the memory resident hash table with 500 buckets. Each bucket consists of a single record. If the bucket is full, the next bucket is used (cyclically). The input arguments are the bucket address (previously computed), ID and name. The function returns 1 if the insertion is successful, 0 if the record already exists, and -1 if the table is full so the record cannot be inserted. • Write the function that will find the ID (int) and the company name (30+1) in the memory resident hash table with 200 buckets. Each bucket consists of a single record. If the bucket is full and does not contain the searched key value, the next bucket is used (cyclically). Input arguments are the bucket address (previously computed) and the ID. The output argument is the company name. The function returns 1 if the record is found, and 0 if it is not found. Algorithms and data structures, FER
Exercises • Let a key be 7-digit telephone number. Write the function for transformation of the key into address. Hash table consists of M buckets. The division method should be used. The prototype of the function is: int address (int m, long teleph); • Let products be the name of the unformatted file organised using hashing. A record consists of ID (4-digit integer), name (up to 30 char) and price (float). The record is considered empty if its ID equals zero. Write the function for emptying of the file. The size of the block on the disk is BLOCK, and expected maximum number of records is MAXREC. These parameters are contained in parameters.h. • Write the function that will compute the bucket address in the table of 500 buckets. The key is 4-digit ID, and the method is the root from the central digits of the key squared. Algorithms and data structures, FER