1 / 15

Thanks to Bill Arms, Marti Hearst

This article explores the iterative process of information retrieval, with a focus on document collections. It discusses the importance of documents, index terms, and repositories in the field of IR. The article also covers the different types of text documents and their properties.

bradyn
Download Presentation

Thanks to Bill Arms, Marti Hearst

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Documents Thanks to Bill Arms, Marti Hearst

  2. Last time • Big O – Growth of work • Size of information • Continues to grow • IR an old field, goes back to the ‘40s • Search engine most popular information retrieval model • Still new ones being built

  3. Focus on documents • Document will be what we: • Crawl (harvest) • Index • Retrieve with query • Evaluate • Rank • IR iterative process

  4. Repositories Goals Workspace IR is an Iterative Process Assume Search Engine has been built and working

  5. Query Parse User’s Information Need text input

  6. Index Pre-process Collections

  7. Index Query Parse Rank or Match Pre-process User’s Information Need Collections text input

  8. Index Query Parse Query Reformulation Rank or Match Pre-process User’s Information Need Collections text input Evaluation

  9. Definitions Collections consist of Documents • Document • The basic unit which we will automatically index • usually a body of text which is a sequence of terms • has to be digital • Tokens or terms • Basic units of a document, usually consisting of text • semantic word or phrase, numbers, dates, etc • Collections or repositories or corpus • particular collections of documents • sometimes called a database • Query • request for documents on a topic

  10. Document Collectons Many on the web • From the Text Search Engines: IR in Practive • Document collections • Collections • Corpus collections at UW • Some searchable but cost to download

  11. Collection vs documents vs terms Collection Terms or tokens Document

  12. What is a Document? • A document is a digital object with an operational definition • Indexable (usually digital) • Can be queried and retrieved. • Many types of documents • Text or part of text • Web page • Image • Audio • Video • Data • Email • Etc.

  13. Text Documents A text digital document consists of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. • Mixed text, combination of the above Examples?

  14. Why the focus on text? • Language is the most powerful query model • Language can be treated as text • Text has many interesting properties • Others?

  15. What we covered • Documents are the atoms of IR • Index terms or tokens in documents • Terms or tokes will be text • Interested in collections of documents • Repository • Corpus • Document collection

More Related