CS 430: Information Discovery

CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations

Course Administration • Assignment 1 will be posted during the next couple of days. It is due on Friday, September 21 at 5 p.m.

Inverted File (Basic Definition) • Inverted file: a list of the words in a set of documents and the documents in which they appear. • Word Document • abacus 3 • 19 • 22 • actor 2 • 19 • 29 • aspen 5 • atoll 11 • 34 Stop words are removed before building the index.

Inverted List • Inverted list: All the entries in an inverted file that apply to a specific word, e.g. • abacus 3 • 19 • 22 Posting: Entry in an inverted list, e.g., the postings for "abacus" are documents 3, 19, 22.

Keywords and Controlled Vocabulary Keyword: A term that is used to describe the subject matter in a document. It is sometimes called an index term. Keywords can be extracted automatically from a document or assigned by a human cataloguer or indexer. Controlled vocabulary: A list of words that can be used as keywords, e.g., in a medical system, a list of medical terms. Inverted file (more complete definition): A list of the keywords that apply to a set of documents and the documents in which they appear.

Enhancements to Inverted Files Location: The inverted file holds information about the location of each term within the document. Uses adjacency and near operators user interface design -- highlight location of search term Frequency: The inverted file includes the number of postings for each term. Uses term weighting query processing optimization user interface design

Inverted File (Enhanced) • Word Postings Document Location • abacus 4 3 94 • 19 7 • 19 212 • 22 56 • actor 3 2 66 • 19 213 • 29 45 • aspen 1 5 43 • atoll 3 11 3 • 11 70 • 34 40

Example: Boolean Queries Boolean query: two or more search terms, related by logical operators, e.g., andornot Examples: abacusandactor abacusoractor (abacus and actor)or(abacus and atoll) not actor

Boolean Diagram not (A or B) A and B A B A or B

Evaluating a Boolean Query 3 19 22 2 19 29 Examples: abacusandactor Postings for abacus Postings for actor Document 19 is the only document that contains both terms, "abacus" and "actor". To evaluate the and operator, merge the two inverted lists with a logical AND operation.

Adjacent and Near Operators abacusadjactor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacusnear 4actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).

Evaluating an Adjacency Operation 3 94 19 7 19 212 22 56 2 66 19 213 29 45 Examples: abacusadjactor Postings for abacus Postings for actor Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.

Evaluation of Boolean Operators Precedence of operators must be defined: adj, near high and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)

Sizes of Inverted Files Set Records Unique Terms A 2,653 5,123 B 38,304 c.25,000 Set A has an average of 14 postings per term and a maximum of over 2,000 postings per term. Set B has an average of 88 postings per record. Examples from Harman and Candela, 1990

Representation of Inverted Files Index (vocabulary) file: Stores list of terms (keywords). Designed for rapid searching and processing range queries. Often held in memory. Postings file: Stores a list of postings for each term. Designed for rapid merging of lists. Each list may be stored sequentially. Document file: [Repositories for the storage of document collections are covered in CS 502.]

Organization of Inverted Files Index (vocabulary) file Postings file Documents file Term Pointer to postings ant bee cat dog elk fox gnu hog Inverted lists

Decisions in Building Inverted Files • Underlying character set, e.g., printable ASCII, Unicode, UTF8. • Whether to use a controlled vocabulary. If so, what words to include. • List of stopwords. • Rules to decide the beginning and end of words, e.g., spaces or punctuation. • Character sequences not to be indexed, e.g., sequences of numbers.

Efficiency Criteria Storage Inverted files are big, typically 10% to 100% the size of the collection of documents. Update performance It must be possible, with a reasonable amount of computation, to: (a) Add a large batch of documents (b) Add a single document Retrieval performance Retrieval must be fast enough to satisfy users and not use excessive resource.

Index File If an index is held on disk, search time is dominated by the number of disk accesses. Suppose that an index has 1,000,000 distinct terms. Each index entry consists of the term and a pointer to the inverted list, average 100 characters. Size of index is 100 megabytes, which can easily be held in memory.

Postings File Since inverted lists may be very long, it is important to match postings efficiently. Usually, the inverted lists will be held on disk. Therefore algorithms for matching posting use sequential file processing. For efficient matching, the inverted lists should all be sorted in the same sequence, usually alphabetic order, "lexicographic index". Merging inverted lists is the most computationally intensive task in many information retrieval systems.

Efficiency and Query Languages Some query options may require huge computation, e.g., Regular expressions If inverted files are stored in alphabetical order, comp* can be processed efficiently *comp cannot be processed efficiently Boolean terms If A and B are search terms A or B can be processed by comparing two moderate sized lists (not A) or (not B) requires two very large lists

Index File Structures: Linear Index Term Pointer to list of postings ant bee cat dog elk fox gnu hog Inverted lists

Linear Index Advantages Can be searched quickly, e.g., by binary search, O(log n) Good for sequential processing, e.g., comp* Convenient for batch updating Economical use of storage Disadvantages Index must be rebuilt if an extra term is added

CS 430: Information Discovery

CS 430: Information Discovery

Presentation Transcript

INFORMATION THEORY

Context

VIRUSES.

Major Recent Developments in Electronic Discovery and Thoughts on Information Management

Discovery Indonesia Journal Hosting

Drug Discovery Programs at BMS

CIS664-Knowledge Discovery and Data Mining

WHII.04: European Age of Discovery

Web-Scale Discovery from Alpha to Omega

SNP Discovery and Analysis Application to Association Studies

Computational Discovery in Evolving Complex Networks

Machine Learning Methods for Decision Support and Discovery

Distributed Monitoring and Information Services for the Grid

Phonetic perspectives on modelling information in the speech signal

Road to Discovery: Lecture 1

CIS664-Knowledge Discovery and Data Mining

Cisco Discovery Module 1

Pytheas the Greek and and the discovery of Britain

Fall Rally Information

Chapter 18

Discovery of DNA