Information Retrieval

Information Retrieval Ugochukwu Chimbo EJIKEME

Structured Vs Unstructured Data • Coperate information not stored in the database • In General • * The structure of the data itself. • * The structure of the container that hosts the data. • * The structure of the access method used to access the data.

Information Retrieval Systems (IRS) • Information-retrieval systems are used to store and query textual data such as documents. They use a simpler data model than do database systems. Traditional examples of information-retrieval systems are online library catalogs and online document-management systems such as those that store newspaper articles.

Characteristics of IRS • Documents are typically described by a set of keywords. • Information in the database is organized simply as a collection of unstructured documents. • Cares less about transactional requirements.

Relevance Ranking • Using Terms (Keywords) • Ranking Using TF-IDF • Similarity-Based Retrieval • Hyperlinks (WEB) • Popularity Ranking (prestige ranking) • PageRank • Combining TF-IDF and Popularity Ranking Measures

Ranking using TF-IDF • Term Frequency (TF) – Relevance of a document (d) to a term (t). • “Multiple Keyword” Queries ? n Σ (TF(d,ti)) i=1

Inverse document frequency (IDF) • Query: “Facebook Ugo”???. • Relevance therefore: • Proximity??? The closer the word to each other in the document, the higher the rank.

Similarity-Based Retrieval • Retrieve document similar to another. • Similarity may be defined on the basics of terms. • Cosine similarity metrics • Relevance feedback – start new search based on user feedback on prior search.

Hyperlink • Popularity Ranking • Rank “popular” documents higher among set of documents with specific keywords. • Determining “Popularity” • Access rate ? • How to get accurate data? • Bookmarks? • Might be private? • Links to related pages? • Using web crawler to analyze external links.

transfer of prestige • a link from a popular page x to a page y is treated as conferring more prestige to page y than a link from a not-so-popular page z.

PageRank • A measure of popularity of a page based on the popularity of pages that link to the page. • Understanding PageRank. • Random walk model: • The PageRank of a page is the probability that a random walker is visiting a page at any given point in time. • Drawback: • does not take query keywords into account.

Other Measures of Popularity • Click fraction • search engine provides an indirect link through the search engine site, which records the page click, and transparently redirects the browser to the original link. • Anchor text + Page Rank • Anchor text + Page Rank + TF–IDF measures

The HITS algorithm: • compute popularity using set of related pages only. • Hubs and Authorities • Hub - A page that stores links to many related pages (may not in itself contain actual information on a topic) • Authority - A page that contains actual information on a topic (may not store links to many related pages). • Each page gets a prestige value as a hub (hub-prestige), and another prestige value as an authority (authority-prestige).

Search Engine Spamming • Practice of creating Web pages, or sets of Web pages, designed to get a high relevance rank for some queries, even though the sites are not actually popular sites.

Synonyms, Homonyms, and Ontologies • Synonyms • Define alternative words for keywords • E.g Class room <==> (Class or Lecture) room • Homonyms • single words with multiple meanings • Concept-based querying • analyze each document to disambiguate each word in the document, and replace it with the concept that it represents; disambiguation is usually done by looking at other surrounding words in the document.

Ontologies are hierarchical structures that reﬂect relationships between concepts. • Common relationships include: is – a, part of,.. etc.

Indexing of Documents • Inverted index • maps each keyword Ki to a list Si of the documents that contain Ki. • Document 1 (d1), Document 2 (d2), Document 3 (d3) • 56,89,201 12, 18, 19 5 • Inverted Index = “d1/56,89,201; d2/12,18,19; d3/5” • *May also include Term Frequency in documents.

Measuring Retrieval Effectiveness • Keywords are maintained in a compressed form (to keep space usage of the index low). • index sometimes stored such that the retrieval is approximate; a few relevant documents may not be retrieved (called a false drop or false negative), or a few irrelevant documents may be retrieved (called a false positive).

Measurement metrics • Precision • measures the percentage retrieved documents relevant to a given query. • Recall • Measures percentage of the documents (relevant to the query) retrieved.

Beyond Page Ranking • Information Extraction • convert information from textual form to a more structured form. • Sample application: google scholar. • Question Answering • system attempts to provide direct answers to questions posed by users.

Summary • Information-retrieval systems are used to store and query textual data such as documents. • Queries attempt to locate documents that are of interest by specifying, for example, sets of keywords. • Relevance ranking makes use of several types of information, such as: • ◦ Term frequency: how important each term is to each document. • ◦ Inverse document frequency. • ◦ Popularity ranking.

Search engine spamming attempts to get (an undeserved) high ranking for a page. • Synonyms and homonyms complicate the task of information retrieval. Concept- based querying aims at finding documents containing specified concepts, regardless of the exact words (or language) in which the concept is specified. Ontologies are used to relate concepts using relationships such as is-a or part-of. • Inverted indices are used to answer keyword queries. • Precision and recall are two measures of the effectiveness of an information retrieval system. • Techniques have been developed to extract structured information from textual data and to give direct answers to simple questions posed in natural language.

Information Retrieval