1 / 12

WMES3103 : INFORMATION RETRIEVAL

Learn about indexing process, key concepts like tf*idf, automatic indexing methods, and the use of inverted files for efficient text retrieval. Dive into vocabulary, occurrences, and the process of searching on an inverted file for optimized results.

nbruce
Download Presentation

WMES3103 : INFORMATION RETRIEVAL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING

  2. INTRODUCTION • Searching for a basic query done via 2 options: • Scanning the text sequentially = sequential or online searching = finding the occurrences of a pattern in a text when the text is not preprocessed • Good when the text is small or text collection is volatile (modified frequently) or no indexing space available • Build data structures over the text or indexes to speed up the search • Good to build and maintain index when text collection is large and semi-static (updated at reasonably regular intervals)

  3. INDEXING • Key weight – frequency dependent , determine ranking  best match • tf*idf – weighting • tf: key frequency in a document • idf: the inverse of the number of documents containing the key

  4. AUTOMATIC INDEXING PROCESS Replace stems by identifiers Text representation Text Count posting Recognize string Weight Delete Stopwords Use thesaurus And phrases Identify Stems

  5. AUTOMATIC INDEXING PROCESS • In the process: • Stem identification – word normalization, NLP • Short codes are used as identifiers • Thesaurus – rare stems are clustered • Phrases – frequent stems are combined into less frequent phrases

  6. Nowadays, medium size databases (200 Mb) combine online and indexed searching • 3 main indexing techniques • Inverted files – best choice for most applications • Suffix trees and arrays – faster for phrase searching but harder to build and maintain • Signature files – popular in 1980’s but outperformed by inverted files • Will concentrate on inverted files only

  7. INVERTED FILE • Inverted file = inverted index = word-oriented mechanism for indexing a text collection in order to speed up the searching task • Composed of 2 elements – vocabulary and occurrences • Vocabulary = set of all different words in the text • For each word a list of all the text positions where the appears is stored • Occurrences = the set of all those lists

  8. Example • A sample text and an inverted index built on it • the words are converted to lower-case and some are not indexed • the occurences point to character positions in the text

  9. INVERTED FILE • Positions can refer to words or characters • Word positions (eg. position i refers to the i-th word) simplifies phrase and proximity queries • Character positions (eg. position i refers to the i-th character) facilitates direct access to matching text positions • Space required for vocabulary is small - eg. 1 Gb of the TREC-2 collection has a size of 5 Mb – can be further reduced by stemming and other techniques

  10. INVERTED FILE • Occurrences require more space – each word in the text is referenced once in the structure • building an inverted index from the sample text • Refer to word doc. Attached.

  11. Searching on an inverted file • Done via 3 basic steps : • Vocabulary search – the words and patterns present in the query are isolated and searched in the vocabulary • Retrieval of occurrences – lists of the occurrences of all the words found are retrieved • Manipulation of occurrences – occurrences are processed to solve phrases, proximity or Boolean operations

  12. TRIES • * Tries or digital search trees are multiway trees that store set of strings.Every edge of the tree is labelled with a letter. To search a string in a trie, one starts at the root and scans the string characterwise, descending by the appropriate edge of the trie. This continues until a leaf is found.

More Related