Chapter 2 Information Retrieval Part-1

Chapter 2Information RetrievalPart-1

Modern Information Retrieval • Document representation • Using keywords • Relative weight of keywords • Query representation • Keywords • Relative importance of keywords • Retrieval model • Similarity between document and query • Rank the documents • Performance evaluation of the retrieval process

Document Representation Transforming a text document to a weighted list of keywords

Stopwords Figure 2.2 A partial list of stopwords

Activity: Document Representation Transform the text in the document given into a weighted list of keywords.

Stemming A given word may occur in a variety of syntactic forms • plurals • past tense • gerund forms (a noun derived from a verb) Example The word connect, may appear as • connector, connection, connections, connected, connecting, connects, preconnection, and postconnection.

Stemming A stem is what is left after its affixes (prefixes and suffixes) are removed Suffixes • connector, connection, connections, connected, connecting, connects, Prefixes • preconnection, and postconnection. Stem • connect

Porter’s Algorithm • Letters A, E, I, O, and U are vowels • A consonant in a word is a letter other than A, E, I, O, or U, with the exception of Y • The letter Y is a vowel if it is preceded by a consonant, otherwise it is a consonant • For example, Y in synopsis is a vowel, while in toy, it is a consonant • A consonant in the algorithm description is denoted by c, and a vowel by v

Porter’s algorithmStep 1 Step 1: plurals and past participles

Porter’s algorithmStep 2 Steps 2–4: straightforward stripping of suffixes

Porter’s algorithmStep 5 Steps 5: tidying-up

Porter’s algorithm Suffix stripping of a vocabulary of 10,000 words (http://www.tartarus.org/~martin/)

For the Tutorial • Bring your laptop/ lab • Make sure you have Java installed • Bring any English language text document, extension must be .txt • Number of words (no more than 1000 words)

Document Representation

Term-Document Matrix • Term-document matrix (TDM) is a two-dimensional representation of a document collection. • Rows of the matrix represent various documents • Columns correspond to various index terms • Values in the matrix can be either the frequency or weight of the index term (identified by the column) in the document (identified by the row).

Term-Document matrix

Sparse Matrixes- triples

Sparse Matrixes- Pairs

Normalization • raw frequency values are not useful for a retrieval model • prefer normalized weights, usually between 0 and 1, for each term in a document • dividing all the keyword frequencies by the largest frequency in the document is a simple method of normalization:

Normalized Term-Document Matrix

Vector Representation of document d1 (word, frequency, normalized frequency)

Mini project (Survey) Arabic language stemmer design • Survey and compare existing Arabic language stemmers and write a research paper. • Design an Arabic Language stemmer Reading: Hints on writing technical reports and papers

Chapter 2 Information Retrieval Part-1

Chapter 2 Information Retrieval Part-1

Presentation Transcript

Chapter 19: Information Retrieval

Introduction to Information Retrieval (Part 2)

Applications (1 of 2): Information Retrieval

Chapter 19: Information Retrieval

CHAPTER 2 Information retrieval

Chapter 2 Information Retrieval

Chapter 2 (part 1)

Chapter 1 – Part 2

Information Retrieval (2)

Chapter 2 part 1

Modern Information Retrieval Chapter 1: Introduction

Chapter 2 Information Retrieval

Chapter 1: part 2

Chapter 21: Information Retrieval

Modern Information Retrieval Chapter 1: Introduction

Chapter 1: part 2

Chapter 2 part 1

Information Retrieval Part 2

Chapter 19: Information Retrieval

Chapter 2, Part 1

Chapter 2 – part 1

Chapter 1 (Part 2)