Why indexing?

Why indexing? • For efficient searching of a document • Sequential text search • Small documents • Text volatile • Data structures • Large, semi-stable document collection • Efficient search

Representation of Inverted Files Index (word list, vocabulary) file: Stores list of terms (keywords). Designed for searching and sequential processing, e.g., for range queries, (lexicographic index). Often held in memory. Postings file: Stores an inverted list (postings list) of postings for each term. Designed for rapid merging of lists and calculation of similarities. Each list is usually stored sequentially. Document file: Stores the documents. Important for user interface design.

Organization of Inverted Files Index file Postings file Documents file Term Pointer to postings ant bee cat dog elk fox gnu hog Inverted lists

Decisions in Building Inverted Files: What is a Term? • Underlying character set, e.g., printable ASCII, Unicode, UTF8. • Is there a controlled vocabulary? If so, what words are included? Stemming? • List of stopwords. • Rules to decide the beginning and end of words, e.g., spaces or punctuation. • Character sequences not to be indexed, e.g., sequences of numbers.

Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resources.

Data Structure Indexing Methods • Inverted index • Suffix trees and arrays • Signature files • Word oriented index structures based on hashing (usually not used for large texts)

Inverted Index • This is the primary data structure for text indexes • Basically two elements: • (Vocabulary, Occurrences) • Main Idea: • Invert documents into a big index • Basic steps: • Make a “dictionary” of all the tokens in the collection • For each token, list all the docs it occurs in. • Possibly location in document • Compress to reduce redundancy in the data structure • Also reduces I/O and storage required

Inverted File Types • Index file • Actual posting list for each distinct term in the collection • Document file • Information about each document; ID, name, when published, etc. • Weight file • Similarity between document and query

Inverted Indexes We have seen “Vector files”. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

How Are Inverted Files Created • Documents are parsed one document at a time to extract tokens. These are saved with the Document ID. <token, DID> Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

How Inverted Files are Created • After all documents have been parsed, the inverted file is sorted alphabetically and in document order.

How InvertedFiles are Created • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled. • Result <token,DID,tf> <the,1,2>

How Inverted Files are Created • Then the file can be split into • A Dictionary file • File of unique tokens and • A Postingsfile • File of what document the token is in and how often. • Sometimes where the token is in the document. • Worst case O(n); n size of database.

Dictionary and Posting Files Dictionary Postings

Inverted indexes • Permit fast search for individual terms • For each term, you get a list consisting of: • document ID • frequency of term in doc (optional) • position of term in doc (optional) • <token,DID,tf,position> • <token,(DIDi,tf,positionij),…> • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2

How Inverted Files are Used Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings

Inverted index • Associates a posting list with each term • POSTING LIST example • a (d1, 1) • … • the (d1,2) (d2,2) • Replace frequency with tfidf • Compress index and put hash links • Match query to index and rank

Position in inverted file posting • POSTING LIST example • now (d1;1,1) • … • time (d1;1,10) (d2;1,126) Doc 1 Doc 2 1 4 6 10 Now is the time for all good men to come to the aid of their country 69 It was a dark and stormy night in the country manor. The time was past midnight

Change weight • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled. • Replace term freq by tfidf.

Documents File for Web Search System For Web search systems: • A Document is a Web page. • The Documents File is the Web. • The Document ID is the URL of the document. Indexes are built using a Web crawler, which retrieves each page on the Web (or a subset). After indexing each page is discarded, unless stored in a cache. (In addition to the usual index file and postings file the indexing system stores special information)

Index Files On disk If an index is held on disk, search time is dominated by the number of disk accesses. In memory Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term, some basic statistics and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory of a dedicated computer.

Index File Structures: Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for sequential processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

Why indexing?

Why indexing?

Presentation Transcript

EE 587 SoC Design & Test

Test Taking Skills:

Test Anxiety

This Is Only a Test: Managing Test Anxiety

Test Development

8.3 t- test for a mean

Dynamic Testing

Test Beds Service – Science Linkage with the Outside Community

Marketing Mix Test

Test Case

TEST

This Is Only a Test: Managing Test Anxiety

System Test Specification

Test Taking Strategies, Test Preparation and Reducing Test Anxiety

Power of 1

Sign Test and Run Test

Test Security

Warm up

Chi-square test or c 2 test

Test Plan- 46 Test Wing

Why indexing?

Why indexing?

Presentation Transcript

EE 587 SoC Design &amp; Test

Test Taking Skills:

Test Anxiety

This Is Only a Test: Managing Test Anxiety

Test Development

8.3 t- test for a mean

Dynamic Testing

Test Beds Service – Science Linkage with the Outside Community

Marketing Mix Test

Test Case

TEST

This Is Only a Test: Managing Test Anxiety

System Test Specification

Test Taking Strategies, Test Preparation and Reducing Test Anxiety

Power of 1

Sign Test and Run Test

Test Security

Warm up

Chi-square test or c 2 test

Test Plan- 46 Test Wing

EE 587 SoC Design & Test