Introduction to Knowledge Retrieval: Understanding Data Types and IR Systems for Relevant Information Access

Knowledge and Information Retrieval Session 4 Introduction to Knowledge and Information Retrieval and Relevance Feedback 2019

Agenda Definition of Knowledge and Information Retrieval types of data and Indexing Dictionary Stemming / Lemmatization Natural Language Processing Relevace Feedback and Query Expansion

Knowledge and Information information need!

Information Need Conditional Frequency Distribution: this plot shows the number of female and male names ending with each letter of the alphabet; most names ending with a, e or i are female; names ending in h and l are equally likely to be male or female; names ending in k, o, r, s, and t are likely to be male.

Information Need Explosive data available online as the result of significant social media usage could be used as data source topredict the political election result. • Widodo Budiharto and Meiliana, Prediction and analysis of Indonesia Presidential election from Twitter using sentiment analysis, Journal of Big Data, 2018.

Case: Intelligent Chatbot

Case: Intelligent Chatbot (con’t) TF-IDF is a weighting scheme that assigns each term in a document a weight based on its Term Frequency (TF) and Inverse Document Frequency (IDF), can be used in chatbot. How term frequency is calculated for a term t in document d. It is basically the number of occurrences of the term in the document.

Types of Data Structured data:defined data tipe such as OLAP, RDBMS, CSV files and spreadsheet Semi-Structured data:able to be parsed such as XML Quasi-structured data:need effort to be processed such as hyperlink Unstructured data: does'ht have inherent structure such as PDF files, image and video

Data Evolution and Rise of Big Data Sources Huge volume data and seed of new data creation and growth

Big Data and IR Systems Big Data is a broad term for large data sets and complex that traditional data processing applications are inadequate. Big data can come in multiple forms(including structured and non-structured data as financial data, text files, multimedia files, pdf files and genetic mappings. Hadoop (common tools) can perform massively parallel ingest and custom analysis for web traffic parsing, and massive unstructured data feeds from multiple sources -> Information Retrieval purposes

Search Engine Andri Mirzal (2012).Design and Implementation of a Simple Web Search Engine, International JOurnal of Multimedia and Ubiquitous Engineering Vol. 7, No. 1.

Big Data In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.

What is Information Retrieval Definition Information Retrieval (IR) is the task of finding relevanttexts within a large amount of unstructured data Relevant ≡ texts matching some specific criteria. Examples of IR tasks: Searching for an event that occurred on a given date using the Internet, etc. Examples of IR systems: www search engines NB: Database Management Systems (DBMS) are different from IR systems (data stored in a DB are structured!)

Sec. 1.1 Basic assumptions of Information Retrieval Collection: A set of documents • Assume it is a static collection for the moment Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task

Conceptual Model An IR task is a 3-step process: The goal of information retrieval is to provide users with those documents that will satisfy their information need.

State of the Art An IR system looks for datamatching some criteria defined by the users in their queries. The language used to ask a question is called the query language. The basic unit of data is a document (can be a file, an article, a paragraph, etc.). A document corresponds to free text (may be unstructured).

Size of information These queries use keywords. All the documents are gathered into a collection (or corpus). Example: 1 million documents, each counting about 1000 words if each word is encoded using 6 bytes: 109 × 1000 × 6/1024 ≃ 6GB

Index How to relate the user’s information need with some documents’ content ? “using an index to refer to documents “ Usually an index is a list of terms that appear in a document. The kind of index we use maps keywords to the list of documents, we call this as an inverted index.

Inverted index The set of keywords is usually called the dictionary (or vocabulary). A document identifier appearing in the list associated with a keyword is called a posting. The list of document identifiers associated with a given keyword is called a posting list.

IR Technique A first model of IR technique to build an index and apply queries on this index. Example of input collection (Shakespeare’s plays): Doc1 I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Doc2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Loading documents Using concordance

Retrieval Models Retrieval models can be categorize as Boolean retrieval model Vector space model Probabilistic model The Boolean model of information retrieval is a classical information retrieval (IR) model and is the first and most adopted one. It is used by virtually all commercial IR systems today.

Index construction How to construct inverted index called index construction First we build the list of pairs (keyword, docID)):

Index construction (con’t) • Then the lists are sorted by keywords, frequency information is added:

Index construction (con’t) Frequency distribution

Index construction (con’t) • Multiple occurrences of keywords are then merged to create a dictionary file and a postings file:

Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare contain the words BrutusANDCaesar but NOTCalpurnia? One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Why is that not the answer? • Slow (for large corpora) • NOTCalpurnia is non-trivial • Other operations (e.g., find the word Romans nearcountrymen) not feasible • Ranked retrieval (best documents to return) • Later lectures

Sec. 1.1 Term-document matrices 1 if play contains word, 0 otherwise BrutusANDCaesarBUTNOTCalpurnia

Sec. 1.1 Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented)  bitwise AND. • 110100 AND • 110111 AND • 101111 = • 100100

Sec. 1.1 Answers to query Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i’ the Capitol; Brutus killed me.

Sec. 4.2 Reuters RCV1 Shakespeare’s collected works definitely aren’t large enough for demonstrating many of the points in this course. The collection we’ll use isn’t really large enough either, but it’s publicly available and is at least a more plausible example. As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection. This is one year of Reuters newswire (part of 1995 and 1996)

Sec. 4.2 A Reuters RCV1 document

Initial stages of text processing Tokenization • Cut character sequence into word tokens • Deal with “John’s”, a state-of-the-art solution Normalization • Map text and query term to same form • You want U.S.A. and USA to match Stemming • We may wish different forms of a root to match • authorize, authorization Stop words • We may omit very common words (or not) • the, a, to, of

Sec. 1.2 Where do we pay in storage? Lists of docIDs Terms and counts • IR system implementation • How do we index efficiently? • How much storage do we need? Pointers

Sec. 1.3 The index we just built Our focus How do we process a query? • Later - what kinds of queries can we process?

Sec. 2.4.1 A first attempt: Biword indexes Index every consecutive pair of terms in the text as a phrase For example the text “Friends, Romans, Countrymen” would generate the biwords • friends romans • romans countrymen Each of these biwords is now a dictionary term Two-word phrase query-processing is now immediate.

Sec. 2.4.1 Longer phrase queries Longer phrases can be processed by breaking them down stanford university palo alto can be broken into the Boolean query on biwords: stanford university AND university palo AND palo alto

Sec. 2.2.1 Tokenization Input: “Friends, Romans and Countrymen” Output: Tokens • Friends • Romans • Countrymen A token is an instance of a sequence of characters Each such token is now a candidate for an index entry, after further processing

Sec. 2.2.1 Tokenization Issues in tokenization: • Finland’s capital  Finland AND s? Finlands? Finland’s? • Hewlett-Packard  Hewlett and Packard as two tokens? • state-of-the-art: break up hyphenated sequence. • co-education • lowercase, lower-case, lower case ? • It can be effective to get the user to put in possible hyphens • San Francisco: one token or two? • How do you decide it is one token?

Sec. 2.2.1 Tokenization: language issues French • L'ensemble one token or two? • L ? L’? Le ? • Want l’ensemble to match with un ensemble • Until at least 2003, it didn’t on Google • Internationalization!

Sec. 2.2.1 Katakana Hiragana Kanji Romaji Tokenization: language issues フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) End-user can express query entirely in hiragana! Chinese and Japanese have no spaces between words: • 莎拉波娃现在居住在美国东南部的佛罗里达。 • Not always guaranteed a unique tokenization Further complicated in Japanese, with multiple alphabets intermingled • Dates/amounts in multiple formats

Sec. 2.2.1 Tokenization: language issues Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right Words are separated, but letter forms within a word form complex ligatures ← → ← → ← start ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’

Sec. 2.2.2 Stop words With a stop list, you exclude from the dictionary entirely the commonest words. Intuition: • They have little semantic content: the, a, and, to, be • There are a lot of them: ~30% of postings for top 30 words But the trend is away from doing this: • Good compression techniques (IIR 5) means the space for including stop words in a system is very small • Good query optimization techniques (IIR 7) mean you pay little at query time for including stop words.

Sec. 2.2.3 Normalization to terms We may need to “normalize” words in indexed text as well as query words into the same form • We want to match U.S.A. and USA Result is terms: a term is a (normalized) word type, which is an entry in our IR system dictionary We most commonly implicitly define equivalence classes of terms by, e.g., • deleting periods to form a term • U.S.A.,USA  USA • deleting hyphens to form a term • anti-discriminatory, antidiscriminatory  antidiscriminatory

Sec. 2.2.3 Normalization: other languages Accents: e.g., French résumé vs. resume. Umlauts: e.g., German: Tuebingen vs. Tübingen • Should be equivalent Most important criterion: • How are your users like to write their queries for these words? Even in languages that standardly have accents, users often may not type them • Often best to normalize to a de-accented term • Tuebingen, Tübingen, Tubingen  Tubingen

Sec. 2.2.4 Lemmatization Reduce inflectional/variant forms to base form E.g., • am, are,is be • car, cars, car's, cars'car the boy's cars are different colorsthe boy car be different color Lemmatization implies doing “proper” reduction to dictionary form

Sec. 2.2.4 Stemming for exampl compress and compress ar both accept as equival to compress for example compressed and compression are both accepted as equivalent to compress. Reduce terms to their “roots” before indexing “Stemming” suggests crude affix chopping • language dependent • e.g., automate(s), automatic, automation all reduced to automat.

Sec. 2.2.4 Porter’s algorithm Commonest algorithm for stemming English sses ss ies i ational ate tional tion Weight of word sensitive rules (m>1) EMENT → • replacement → replac • cement → cement

Typical rules in Porter(con’t) implementation

Implementation Stemming and tokenization using simple vocabulary

Introduction to Knowledge Retrieval: Understanding Data Types and IR Systems for Relevant Information Access