230 likes | 362 Views
URL. Access method. Server and domain. Path. Document. The Web is a Graph. http://www.wellesley.edu/Resources/about/wdoc.html. Directed Graph of Nodes and Arcs (directed edges) Nodes = web pages Arcs = hyperlinks from a page to another A graph can be explored A graph can be indexed.
E N D
URL Access method Server and domain Path Document The Web is a Graph http://www.wellesley.edu/Resources/about/wdoc.html • Directed Graph of Nodes and Arcs (directed edges) • Nodes = web pages • Arcs = hyperlinks from a page to another • A graph can be explored • A graph can be indexed
How Google (and the other search engines) Work THE WEB Document IDs crawl the web create inverted index Rank results Search engine servers Inverted index user query
Count Green Eggs and Ham! i am sam i am sam sam i am that sam i am that sam i am i do not like that sam i am do you like green eggs and ham i do not like them sam i am i do not like green eggs and ham would you like them here or there i would not like them here or there i would not like them anywhere i do not like green eggs and ham i do not like them sam i am would you like them in a house would you like them with a mouse i do not like them in a house i do not like them with a mouse i do not like them here or there i do not like them Anywhere i do not like green eggs and ham i do not like them sam i am would you eat them in a box would you eat them with a fox not in a box not with a fox not in a house not with a mouse i would not eat them here or there i would not eat them anywhere i would not eat green eggs and ham i do not like them sam i am would you could you in a car eat them eat them here they are i would not could not in a car a :59 am :16 and :25 anywhere :8 are :2 be :4 boat :3 box :7 car :7 could :14 dark :7 do :37 eat :25 eggs :11 fox :7 goat :4 … try :4 will :21 with :19 would :26 you :34
hash(“Brian”) hash(“Stella”) hash(“Ellen”) 4 hash(“Takis”) “Takis” 6 hash(“Mark”) 12 hash(“Randy”) 11 “Lyn” “Mark” Let’s play darts (aka: let’s “hash” the keys) 0 “Brian” 1 “Brian” 1 2 “Stella” 5 3 “Ellen” 4 “Ellen” “Stella” 5 6 “Takis” 7 “Mark” 8 9 “Lyn” 10 11 Pros and cons? 12
Hashing • To search for an entry in the table, compute the hash function on the entry’s key. • Use the value of the hash function as an index into the Table. • Two or more keys may hash to the same index? Then employ some method of collision resolution. What properties would we like in our hash function?
Pros and Cons • Pros • Searching O( ) • Inserting O( ) • Deleting O( ) • Cons • Cannot keep adding new elements • Table size is fixed (like an array) • Needs expansion capabilities (costly) • Would be nice to have a perfect hashing function but many items may end up on same location • Collisions needs resolution policy
Load Factor 0 • N/M = load factor of a hashtable • number of entries N in table • divided by the table capacity M. • Heuristics: • If you know N, make M = 1.5 * N • If you do not know N, provide for dynamic resizing • Create larger Hash Table • Insert old elements into new “Brian” 1 2 3 “Ellen” 4 “Stella” 5 “Takis” 6 7 8 9 10 “Lyn” 11 “Mark” 12
What is a good Hash Function? • Good:h(hashCode) = hashCode mod MM: primeNote: mod in Java not the same as % • Better:h(hashCode) = ((a*hashCode + b) mod p) mod Mp: prime >> Na, b: positive integers
Hashing on array #1: Open Addressing • Open addressing (M>>N): • How are collisions are resolved with this technique? . . . aa ab zz 0 1 2 3 4 673 674 675
Hashing on array #2: Linear Probing • When the index hashed to is occupied by a stranger, probethe next position. • If that position is empty, we insert the entry, otherwise, we probe the next position and repeat. H A S H I N G I S F U N 8 1 19 8 9 14 7 9 19 6 21 14
There is a problem though: Clustering • As the table begins to fill up, more and more entries must be examined before the desired entry is found. • Insertion of one entry may greatly increase the search time for others. For example, consider H, S, H, I, ... H A S H I N G I S F U N 8 1 19 8 9 14 7 9 19 6 21 14
Separate Chaining “Brian” “Stella” “Ellen” “Lyn” “Mark” 0 1 “Takis” 2 3 4 5 6
“Brian” Separate Chaining hash(“Brian”) “Brian” 1 “Stella” “Ellen” “Lyn” “Mark” 0 1 “Takis” 2 3 4 5 6
“Brian” “Stella” Separate Chaining hash(“Brian”) “Brian” 1 hash(“Stella”) “Stella” 4 “Ellen” “Lyn” “Mark” 0 1 “Takis” 2 3 4 5 6
“Stella” “Brian” Separate Chaining hash(“Brian”) “Brian” 1 hash(“Stella”) “Stella” 4 hash(“Ellen”) “Ellen” 4 “Lyn” “Mark” 0 1 “Takis” 2 3 4 “Ellen” 5 6
“Stella” “Brian” Separate Chaining hash(“Brian”) “Brian” 1 hash(“Stella”) “Stella” 4 hash(“Ellen”) “Ellen” 4 hash(“Lyn”) “Lyn” 4 “Mark” 0 1 “Takis” 2 3 4 “Lyn” “Ellen” 5 6
“Stella” “Brian” “Mark” Separate Chaining hash(“Brian”) “Brian” 1 hash(“Stella”) “Stella” 4 hash(“Ellen”) “Ellen” 4 hash(“Lyn”) “Lyn” 4 hash(“Mark”) “Mark” 5 0 1 “Takis” 2 3 4 “Lyn” “Ellen” 5 6
“Stella” “Brian” “Mark” Separate Chaining hash(“Brian”) “Brian” 1 hash(“Stella”) “Stella” 4 hash(“Ellen”) “Ellen” 4 Load factor? hash(“Lyn”) “Lyn” 4 hash(“Mark”) “Mark” 5 0 hash(“Takis”) 1 “Takis” 5 2 3 4 “Lyn” “Ellen” 5 “Takis” 6
Load Factor for Separate Chaining • N = the number of entries • M = the capacity of the table • In terms of N and M, what is the average size of a chain (linked list)? • How long do searches take? • If N > M • If N < M
Implementation of Dictionary public class Word { public static final int LETTERS = 26, WORDS = LETTERS * LETTERS; public String word; public int hashCode() { return LETTERS * (word.charAt(0) - 'a') + (word.charAt(1) - 'a'); } } public class WordDictionary { private Definition[] defTable = new Definition[Word.WORDS]; public void insert(Word w, Definition d) { defTable[w.hashCode()] = d; } Definition find(Word w) { return defTable[w.hashCode()]; } }
The Java Hashtable<K,V> Class • Located in java.util • Methods • int size() • // returns number of keys in table • V get(Object key) • // returns value to which specified key is mapped in table • V put(K key, V value) • // maps key to specified value in table • V remove(Object key) • // removes key and corresponding value from table • ...
Counting Occurrences import java.util.*; public static void main (String[] args) throws IOException { Hashtable wordTable = new Hashtable(); // set up variables for reading words from a textfile FileReader fr = new FileReader("GreenEggs.txt"); BufferedReader br = new BufferedReader(fr); StreamTokenizer textwords = new StreamTokenizer(br); // process words from the textfile while (textwords.nextToken() != textwords.TT_EOF) { String word = textwords.sval; // get the next word if (wordTable.containsKey(word)) { // check if word is already in table int num = ((Integer)wordTable.get(word)).intValue(); wordTable.put(word, new Integer(num+1)); } else // word is new, so add new entry to table wordTable.put(word, new Integer(1)); } System.out.println(wordTable); fr.close(); } }