1 / 22

Why indexing?

Learn about building and using inverted files for efficient text indexing, retrieval, and document organization. Explore storage, update, and retrieval performance considerations.

jshulman
Download Presentation

Why indexing?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Why indexing? • For efficient searching of a document • Sequential text search • Small documents • Text volatile • Data structures • Large, semi-stable document collection • Efficient search

  2. Representation of Inverted Files Index (word list, vocabulary) file: Stores list of terms (keywords). Designed for searching and sequential processing, e.g., for range queries, (lexicographic index). Often held in memory. Postings file: Stores an inverted list (postings list) of postings for each term. Designed for rapid merging of lists and calculation of similarities. Each list is usually stored sequentially. Document file: Stores the documents. Important for user interface design.

  3. Organization of Inverted Files Index file Postings file Documents file Term Pointer to postings ant bee cat dog elk fox gnu hog Inverted lists

  4. Decisions in Building Inverted Files: What is a Term? • Underlying character set, e.g., printable ASCII, Unicode, UTF8. • Is there a controlled vocabulary? If so, what words are included? Stemming? • List of stopwords. • Rules to decide the beginning and end of words, e.g., spaces or punctuation. • Character sequences not to be indexed, e.g., sequences of numbers.

  5. Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources.

  6. Data Structure Indexing Methods • Inverted index • Suffix trees and arrays • Signature files • Word oriented index structures based on hashing (usually not used for large texts)

  7. Inverted Index • This is the primary data structure for text indexes • Basically two elements: • (Vocabulary, Occurrences) • Main Idea: • Invert documents into a big index • Basic steps: • Make a “dictionary” of all the tokens in the collection • For each token, list all the docs it occurs in. • Possibly location in document • Compress to reduce redundancy in the data structure • Also reduces I/O and storage required

  8. Inverted File Types • Index file • Actual posting list for each distinct term in the collection • Document file • Information about each document; ID, name, when published, etc. • Weight file • Similarity between document and query

  9. Inverted Indexes We have seen “Vector files”. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

  10. How Are Inverted Files Created • Documents are parsed one document at a time to extract tokens. These are saved with the Document ID. <token, DID> Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

  11. How Inverted Files are Created • After all documents have been parsed, the inverted file is sorted alphabetically and in document order.

  12. How InvertedFiles are Created • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled. • Result <token,DID,tf> <the,1,2>

  13. How Inverted Files are Created • Then the file can be split into • A Dictionary file • File of unique tokens and • A Postingsfile • File of what document the token is in and how often. • Sometimes where the token is in the document. • Worst case O(n); n size of database.

  14. Dictionary and Posting Files Dictionary Postings

  15. Inverted indexes • Permit fast search for individual terms • For each term, you get a list consisting of: • document ID • frequency of term in doc (optional) • position of term in doc (optional) • <token,DID,tf,position> • <token,(DIDi,tf,positionij),…> • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2

  16. How Inverted Files are Used Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings

  17. Inverted index • Associates a posting list with each term • POSTING LIST example • a (d1, 1) • … • the (d1,2) (d2,2) • Replace frequency with tfidf • Compress index and put hash links • Match query to index and rank

  18. Position in inverted file posting • POSTING LIST example • now (d1;1,1) • … • time (d1;1,10) (d2;1,126) Doc 1 Doc 2 1 4 6 10 Now is the time for all good men to come to the aid of their country 69 It was a dark and stormy night in the country manor. The time was past midnight

  19. Change weight • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled. • Replace term freq by tfidf.

  20. Documents File for Web Search System For Web search systems: • A Document is a Web page. • The Documents File is the Web. • The Document ID is the URL of the document. Indexes are built using a Web crawler, which retrieves each page on the Web (or a subset). After indexing each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores special information)

  21. Index Files On disk If an index is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

  22. Index File Structures: Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for sequential processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

More Related