WMES3103 : INFORMATION RETRIEVAL

WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING

INTRODUCTION • Searching for a basic query done via 2 options: • Scanning the text sequentially = sequential or online searching = finding the occurrences of a pattern in a text when the text is not preprocessed • Good when the text is small or text collection is volatile (modified frequently) or no indexing space available • Build data structures over the text or indexes to speed up the search • Good to build and maintain index when text collection is large and semi-static (updated at reasonably regular intervals)

INDEXING • Key weight – frequency dependent , determine ranking  best match • tf*idf – weighting • tf: key frequency in a document • idf: the inverse of the number of documents containing the key

AUTOMATIC INDEXING PROCESS Replace stems by identifiers Text representation Text Count posting Recognize string Weight Delete Stopwords Use thesaurus And phrases Identify Stems

AUTOMATIC INDEXING PROCESS • In the process: • Stem identification – word normalization, NLP • Short codes are used as identifiers • Thesaurus – rare stems are clustered • Phrases – frequent stems are combined into less frequent phrases

Nowadays, medium size databases (200 Mb) combine online and indexed searching • 3 main indexing techniques • Inverted files – best choice for most applications • Suffix trees and arrays – faster for phrase searching but harder to build and maintain • Signature files – popular in 1980’s but outperformed by inverted files • Will concentrate on inverted files only

INVERTED FILE • Inverted file = inverted index = word-oriented mechanism for indexing a text collection in order to speed up the searching task • Composed of 2 elements – vocabulary and occurrences • Vocabulary = set of all different words in the text • For each word a list of all the text positions where the appears is stored • Occurrences = the set of all those lists

Example • A sample text and an inverted index built on it • the words are converted to lower-case and some are not indexed • the occurences point to character positions in the text

INVERTED FILE • Positions can refer to words or characters • Word positions (eg. position i refers to the i-th word) simplifies phrase and proximity queries • Character positions (eg. position i refers to the i-th character) facilitates direct access to matching text positions • Space required for vocabulary is small - eg. 1 Gb of the TREC-2 collection has a size of 5 Mb – can be further reduced by stemming and other techniques

INVERTED FILE • Occurrences require more space – each word in the text is referenced once in the structure • building an inverted index from the sample text • Refer to word doc. Attached.

Searching on an inverted file • Done via 3 basic steps : • Vocabulary search – the words and patterns present in the query are isolated and searched in the vocabulary • Retrieval of occurrences – lists of the occurrences of all the words found are retrieved • Manipulation of occurrences – occurrences are processed to solve phrases, proximity or Boolean operations

TRIES • * Tries or digital search trees are multiway trees that store set of strings.Every edge of the tree is labelled with a letter. To search a string in a trie, one starts at the root and scans the string characterwise, descending by the appropriate edge of the trie. This continues until a leaf is found.

WMES3103 : INFORMATION RETRIEVAL