Indexing The World Wide Web: The Journey So Far

Indexing The World Wide Web: The Journey So Far Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: AnamikaMukherji Indexing The World Wide Web

Is Indexing Difficult?- Yes! • Words not known beforehand • Content available in different languages • Variations in Grammar and Style • No structure – riddled with colors, fonts, images, etc. • Various byte-encoding schemes Indexing The World Wide Web

Answering The User’s Query • Retrieval for a typical query • Find terms in dictionary • Start with the least frequent term since posting list will be the shortest. • Fetch corresponding posting lists • Intersect the lists on document identifiers to get relevant documents • Rank and re-order the documents to present it to user. • To get quality results as fast as possible, understanding of each usage is required • Disk Space • Disk Transfer • Memory • CPU Time • Choice of data structure impacts CPU and storage • Fixed-length array wasteful if posting lists kept in memory • Singly linked list allows cheap insertions and updates • Variable length array require less CPU time • Linked list of fixed length arrays can be used for each term. • Avoid pointers when storing the posting list in memory. Indexing The World Wide Web

Better Understanding of User Intent • Check proximity of different terms • Positional Index expands storage, slows down query processing . • Phrase based Indexing – expensive, no accurate mechanism for identifying which phrase might be used. – Use a good phrase. Indexing The World Wide Web

Document vs. Term Based Partitioning Indexing The World Wide Web

Memory vs. Disk Storage Indexing The World Wide Web

Compressing The Index • Advantages of compressed index • Faster transfer of data from disk to memory • Reduces disk seek time • Compressions schemes • Variable Encoding • Bit-level Encoding • Using gaps • Original posting lists: the: ⟨1, 9⟩ ⟨2, 8⟩ ⟨3, 8⟩ ⟨4, 5⟩ ⟨5, 6⟩ ⟨6, 9⟩ to: ⟨1, 5⟩ ⟨3, 1⟩ ⟨4, 2⟩ ⟨5, 2⟩ ⟨6, 6⟩ john: ⟨2, 4⟩ ⟨4, 1⟩ ⟨6, 4⟩ • With gaps: the: ⟨1, 9⟩ ⟨1, 8⟩ ⟨1, 8⟩ ⟨1, 5⟩ ⟨1, 6⟩ ⟨1, 9⟩ to: ⟨1, 5⟩ ⟨2, 1⟩ ⟨1, 2⟩ ⟨1, 2⟩ ⟨1, 6⟩ john: ⟨2, 4⟩ ⟨2, 1⟩ ⟨2, 4⟩ Indexing The World Wide Web

Variable Byte Encoding • Uses an integral but adaptive number of bytes depending upon the gap size. • First bit of each byte is a continuation bit. • Remaining 7 bits in each byte are used to encode part of gap. • To decode a byte: • Read sequence of bytes till continuation bit flips. • Extract and concatenate the 7-bit parts to get the magnitude of a gap. Indexing The World Wide Web

Bit Level Encoding • Used when disk space is at premium. • These codes adapt the length of the code on a finer grained bit level. • Codeword is divided into 2 parts – prefix and suffix • Prefix indicates the binary magnitude of the value and tells the decoder how many bits are there in the suffix part. • Suffix indicates the value of the number within the corresponding binary range. • Query processing is more time consuming. Indexing The World Wide Web

Ordering by Highest Impact First Example: • (<doc id, term frequency>): • ⟨12, 2⟩ ⟨17, 2⟩ ⟨29, 1⟩ ⟨32, 1⟩ ⟨40, 6⟩ ⟨78, 1⟩ ⟨101, 3⟩ ⟨106, 1⟩. • When the list is reordered by term frequency, it gets transformed: • ⟨40, 6⟩ ⟨101, 3⟩ ⟨12, 2⟩ ⟨17, 2⟩ ⟨29, 1⟩ ⟨32, 1⟩ ⟨78, 1⟩ ⟨106, 1⟩. • The repeated frequency information can then be factored out into a prefix component with a counter that indicates how many documents there are with this same frequency value: • ⟨6 : 1 : 40⟩ ⟨3 : 1 : 101⟩ ⟨2 : 2 : 12, 17⟩ ⟨1 : 4 : 29, 32, 78, 106⟩. • Not storing the repeated frequencies gives a considerable saving. Finally, if differences of document • identifiers are taken, we get the following: • ⟨6 : 1 : 40⟩ ⟨3 : 1 : 101⟩ ⟨2 : 2 : 12, 5⟩ ⟨1 : 4 : 29, 3, 46, 28⟩. • The document gaps within each equal-frequency segment of the list are now on average larger than when the document identifiers were sorted, thereby requiring more encoding bits/bytes. Indexing The World Wide Web

Managing Multiple Indices • Multiples indices bucketed by rate of refreshing. • The Large, rarely refreshing pages index • The small, ever-refreshing pages index • The dynamic real-time/news pages index • Waterfall approach • Pages discovered in one tier can be passed over the next over time. • Invalidate older index and crawl file entries Indexing The World Wide Web

SCALING THE SYSTEM • Web search engines use Distributed indexing algorithms for index construction • Distributed File System • In order to manage large amounts of data across large commodity clusters, a distributed file system that provides efficient remote file access, file transfers, and the ability to carry out concurrent independent operations while being extremely fault tolerant is essential. • Map-Shuffle-Reduce • Map: The master node chops up the problem into small chunks and assigns each chunk to a worker. The worker either processes the chunk of data with the mapper and returns the result to the master or further chops up the input data and assigns it hierarchically. • Shuffle: Group key-value pair from mapper. • Reduce: Take sub-answers and combine to create final output. Indexing The World Wide Web

FUTURE RESEARCH DIRECTIONS • Real Time Data and Search – What can we do with each tweet? • Create a Social Graph • Extract and Index links • Real-Time Related Topics • Sentiment Analysis • Social and Personalized Web Search • Facebook, Twitter, etc. • Facebook Users post a wealth of information • Static – book, movie interest • Dynamic – user locations, status updates, wall posts • Learning user’s personal information can personalize search results • Facebook impacting the world of search • Opened data to third party service • Search for 2 degrees of user Indexing The World Wide Web

Pros and Cons • What I liked about it • Delves into the history of Search Engines • Talks about the Future Enhancement • Explains how a search engine works • What I didn’t like • Skims through the surface without going deep. • Includes very few examples which make understanding difficult. • Compressing the Index section lacks structure which makes it difficult to understand. Indexing The World Wide Web

Indexing The World Wide Web: The Journey So Far

Indexing The World Wide Web: The Journey So Far

Presentation Transcript

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The journey so far...

The Journey So Far…

Direct Assessment The Journey So Far

The World Wide Web

The World Wide Web

The World Wide Web

The World Wide Web

The journey so far …

The World Wide Web

The journey so far

teaching: the journey so far

Indexing The World Wide Web: The Journey So Far

The World Wide Web

The World Wide Web

The journey so far…..

The World Wide Web

The World Wide Web