930 likes | 1.09k Views
Hashing. Dr. Thomas Hicks Computer Science Department Trinity University. 1. Address Calculator. Hashing attempts to accomplish Insertion, Deletion, and Searching in Constant Time. ----------------------. Address Calculator. N. ----------------------. ----------------------.
E N D
Hashing Dr. Thomas Hicks Computer Science Department Trinity University 1
Address Calculator
Hashing attempts to accomplish Insertion, Deletion, and Searching in Constant Time. ---------------------- Address Calculator N ---------------------- ---------------------- unsigned int ---------------------- 0 Can Be Done!
Hashing Requires You To Make Two Important Decisions 1] What Hash Function To Use
Hashing Is A Two Decision Application (1) The First Decision Is What Hash Function To Use: Hash Function - A function that converts an item into an integer suitable to index an array or a direct access file where the item is to be stored. ---------------------- We Are Going To Be Using Social Security Numbers In Our Hashing. 275-75-7575 275,757,575 Hash Function Could Be Modulus 275,757,575 MOD 20 + 1 = ---------------------- 20 275,757,575 ---------------------- 20 ----------------------
Social Security Numbers:. 275-75-7575 235,757,575 How Many Of You Think We Could Organize Our Data In Such A Way That We Could Find Any SSN In 1 Look? 999,999,999 ------------------- Address Calculator SSN ------------------- SSN % 999,999,999 ------------------- ------------------- 0
Hashing Is For Large Populations Of Data
Hashing Is Designed For Large Collections Of Data! Would The Student Population Of UT Constitute A Large Collection Of Data? Yea 50,000+ Is Generally Considered A Large Population Of Data
Hash Function = SSN % 999,999,999 How Many Of You Think This Would Be A Hash Function For UT? 999,999,999 ------------------- Address Calculator SSN ------------------- ------------------- ------------------- 0
Hash Function = SSN % 999,999,999 This Would Not Be A Hash Function For UT? 999,999,999 ------------------- 50,000------------------- 1,000,000,000 5------------------- 100,000 1------------------- 20,000 .01 % = = = ------------------- If Record Size = 10,000 It Would Require 50,000 x 10,000 = 100,000,000 (1/10 Gig) Hard Drive Space ------------------- How Would You Feel About Using 2,000 GB Of Space For Your Data? ------------------- 0
Acceptable Hashing At Least 80% Loading Factor
Two RequirementsConstituted AcceptableHashing 1 ] At Least 80% Loading Factor 14
Loading The Hash Table Using Linear Probing As A Strategy For Handling Collisions (Generally A File But Could Be An Array)
An Example Of A Perfect Hash FunctionCould Use Modulus(%) to Distribute The Data
Suppose Our Hash Function = SSN % 5 + 1 Suppose We Have 4 Social Security Numbers 454-13-3881 = 454,133,881 = 454133881460-27-3802 = 460,273,802 = 460273802450,273,504 = 450,273,504 = 450273504456-66-2055 = 456,662,055 = 456662055 80% Loading Factor 5 = 2 454,133,881 % 5 + 1 = ___?___ ------------------- 4 ------------------- 3 ------------------- Take 30 Seconds & Fill In Some More Of The Data 2 454133881 ------------------- 1
All Will Not Always Work Out So Nicely! Hash Function = SSN % 5 + 1 Suppose We Have 4 Social Security Numbers =MOD(454133881,5)+1 = 2=MOD(460273802,5)+1 = 3=MOD(450273504,5)+1 = 5=MOD(456662055,5)+1 = 1 No Searches To Find 1 5 450273504 Average Search? ------------------- 4 Total SearchesAverage Search = ---------------------- # Items ------------------- 1 3 460273802 ------------------- 4Average Search = ------ = 1 4 1 2 454133881 ------------------- 1 1 456662055
"Average Search"Also Called "An Access Quotient" In Hashing Total SearchesAverage Search = ---------------------- # Items
All Will Not Always Work Out So Nicely! Hash Function = SSN % 5 + 1 Suppose We Have These 4 Social Security Numbers =MOD(454133881,5)+1 = 2=MOD(456662053,5)+1 = 4=MOD(450273806,5)+1 = 2=MOD(460273802,5)+1 = 3 "Clash" - "Collision" - The result when two or more items in a Hash Table hash out to the same position. 5 ------------------- 4 456662053 ------------------- 3 450273806 ? ------------------- 2 454,133,881 ------------------- 1
Hashing Is A Two Decision Application (1) The First Decision Is What Hash Function To Use: (2) The Strategy For Handling Collisions Example Of A Strategy For Handling Collisions:"Linear Probing" - Place The Item In The Next Available Cell (Go Up - Wrap If Necessary)
All Will Not Always Work Out So Nicely! Hash Function = SSN % 5 + 1 Suppose We Have 4 Social Security Numbers =MOD(454133881,5)+1 = 2=MOD(456662053,5)+1 = 4=MOD(450273806,5)+1 = 2=MOD(460273802,5)+1 = 3 No Searches To Find Linear Probing 5 ------------------- 1 4 456662053 ------------------- 2 3 450273806 ------------------- 1 2 454,133,881 ------------------- 1
All Will Not Always Work Out So Nicely! Hash Function = SSN % 5 + 1 Suppose We Have 4 Social Security Numbers =MOD(454133881,5)+1 = 2=MOD(456662053,5)+1 = 4=MOD(450273806,5)+1 = 2=MOD(460273802,5)+1 = 3 No Searches To Find Linear Probing 5 3 460273802 ------------------- 1 4 456662053 Total SearchesAverage Search = ---------------------- # Items ------------------- 2 3 450273806 ------------------- 7Average Search = ---- = 1.75 4 1 2 454,133,881 ------------------- 1
Acceptable Hashing At Least 80% Loading Factor&Access Quotient Of 1.2 Or Better
Two RequirementsConstituted AcceptableHashing 1 ] At Least 80% Loading Factor 2 ] No More Than 1.2 Access Ratio (Avr Search) 25
What Is A Hash Function? A Hash function is a function that converts an item into an integer suitable to index an array or a direct access file where the item is to be stored.
What Are The Two Requirements For Acceptable Hashing? 1] At Least 80% Loading Factor&2] Access Quotient Of 1.2 Or Better
Hashing Requires You To Make Two Important Decisions 1] What Hash Function To Use 2] What Strategy Do I Use To Handle Collisions/Clashes
What Did You Think Of The Hash Function : SSN % 5 + 1 ? Suppose We Have 4 Social Security Numbers =MOD(454133881,5)+1 = 2=MOD(460273802,5)+1 = 3=MOD(450273504,5)+1 = 5=MOD(456662055,5)+1 = 1 No Searches To Find 1 5 AbslutelyAwesome! 450273504 ------------------- 4 ------------------- 1 3 460273802 ------------------- 4Average Search = ------ = 1 4 1 2 454133881 ------------------- 1 1 456662055
What Did You Think Of The Hash Function : SSN % 5 + 1 ? Suppose We Have 4 Social Security Numbers =MOD(454133881,5)+1 = 2=MOD(456662053,5)+1 = 4=MOD(450273806,5)+1 = 2=MOD(460273802,5)+1 = 3 No Searches To Find 5 3 460273802 Really GoodOnly One Collision ------------------- 1 4 456662053 ------------------- 2 3 450273806 ------------------- 7Average Search = ---- = 1.75 4 1 2 454,133,881 ------------------- 1
How Good Is The Linear Probing Strategy For Handling The Collisions?
What Did You Think Of Strategy Selected To Handle Collisions : Linear Probing? Suppose We Have 4 Social Security Numbers =MOD(454133881,5)+1 = 2=MOD(460273802,5)+1 = 3=MOD(450273504,5)+1 = 5=MOD(456662055,5)+1 = 1 No Searches To Find 1 5 The Hash FunctionWas So Good It Did Not Matter! 450273504 ------------------- 4 ------------------- 1 3 460273802 ------------------- 4Average Search = ------ = 1 4 1 2 454133881 ------------------- 1 1 456662055
What Did You Think Of Strategy Selected To Handle Collisions : Linear Probing? Suppose We Have 4 Social Security Numbers =MOD(454133881,5)+1 = 2=MOD(456662053,5)+1 = 4=MOD(450273806,5)+1 = 2=MOD(460273802,5)+1 = 3 No Searches To Find 5 3 460273802 Perhaps We Can Find A Better One? ------------------- 1 4 456662053 ------------------- 2 3 450273806 ------------------- 7Average Search = ---- = 1.75 4 1 2 454,133,881 ------------------- 1
Suppose The Hash Function Did Not Distribute The Data Well! After all, the purpose of a good hash function is to randomize something that generally is not random (Part Name, Part No, etc.)
Consider The Following Set Of Social Security Numbers What Is The Least Random Part Of This Collection Of Numbers? First Digit = 4 First Two DigitsOften = 45 or 46 It Is Often Easier To Find A Successful Hash Function If You Can Chop Off (TRUNCATE)The Least Random Portion(s)
TRUNCATE The First Two Digits - Then Mod + 1 Key The Key Can Be A Combination Of More Than One Data Field In The Record (i.e. Maybe Combine The Last Name & The Phone Number) TRUNCATION Chop Out/Remove The Non-RandomPortion Of The Key Combination
Suppose The Hash Function Did Not Distribute The Data Well! After all, the purpose of a good hash function is to randomize something that often is not random (Part Name, Part No, etc.)
Consider The Following Set Of Social Security Numbers 464 + 133 + 881456 + 662 + 055 464 * 133 * 881456 * 662 * 055 464 * 133 - 881456 * 662 - 055 ABS(464 * (133 - 881))ABS(456 * (662 + 055)) Folding Partitioning The Sequence Digits & Performing Mathematical Constructs On The Subcomponents.
Suppose We Have 10,000 Social Security Numbers Might Folding Of Three Digit CombinationsBe OK?464 + 133 + 881 + 1 10,000 ------------------- ------------------- ------------------- ------------------- 1
POOR SOLUTION - YUK! Your Hash Function Must Be Capable Of Generating All The Values (1 - 10,000) In Your Key Set Might Folding Of Three Digit CombinationsBe OK?464 + 133 + 881 + 1 Does Any One Have SSN 000-00-0000 + 1 1 Largest SSN 500-99-9999 = 500 + 999 + 999 + 1 ~2500 10,000 ------------------- Address Calculator ------------------- 500-99-9999 ------------------- ------------------- 1 000-00-0000
Use Common Sense & Your Knowledge Of Mathematics To Make Sure That All Values In The Hash Table/File Can Be Generated By Your Hash Function.
Strategies To Resolve Collisions: Adding Data With Linear Probing
20 Linear Probing Always 80% Loading Factor