1 / 26

Information Retrieval and Extraction -- Course Introduction

Information Retrieval and Extraction -- Course Introduction. Chia-Hui Chang National Central University chia@csie.ncu.edu.tw. Information Retrieval (IR).

kaldrich
Download Presentation

Information Retrieval and Extraction -- Course Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval and Extraction-- Course Introduction Chia-Hui Chang National Central University chia@csie.ncu.edu.tw

  2. Information Retrieval (IR) • Problem Definition for Generic IR system select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user • Functions • Document searchthe selection of documents from an existing collection of documents • Document routingthe dissemination of incoming documents to appropriate users on the basis of user interest profiles

  3. Document Detection: Search

  4. Search: Text Operation • Document Corpus • the content of the corpus may have significant performance in some applications • Preprocessing of Document Corpus • stemming • stop words removing • phrases, multi-term items • ...

  5. Search: Indexing • Building Index from Stems • key place for optimizing run-time performance • cost to build the index for a large corpus • Document Index • a list of terms, stems, phrases, etc. • frequency of terms in the document and corpus • frequency of the co-occurrence of terms within the corpus • index may be as large as the original document corpus

  6. Search: Query Operation • Detection Need • the user’s criteria for a relevant document • Convert detection need to system specific query • first transformed into a detection query, and then a retrieval query. • detection query: specific to the retrieval engine, but independent of the corpus • retrieval query: specific to the retrieval engine, and to the corpus

  7. Search: Query Model • Compare query with index • Rank the list of relevant documents • Return the top ‘N’ documents

  8. Routing

  9. Routing: Detection Needs • Profile of Multiple Detection Needs • A Profile is a group of individual Detection Needs that describes a user’s areas of interest. • All Profiles will be compared to each incoming document (via the Profile index). • If a document matches a Profile the user is notified about the existence of a relevant document.

  10. Routing: Query Index • Convert detection need to system specific query • Building Index from Queries • The index will be system specific and will make use of all the preprocessing techniques employed by a particular detection system • Routing Profile Index • similar to build the corpus index for searching • the quantify of source data (Profiles) is usually much less than a document corpus • Profiles may have more specific, structured data in the form of SGML tagged fields

  11. Routing: Document Preprocessing • Document to be routed • A stream of incoming documents is handled one at a time to determine where each should be directed • Routing implementation may handle multiple document streams and multiple Profiles • Preprocessing of Document • A document is preprocessed in the same manner that a query would be set-up in a search • The document and query roles are reversed compared with the search process

  12. Routing: Ranking • Compare Document with Index • Identify which Profiles are relevant to the document • Given a document, which of the indexed profiles match it? • Resultant List of Profiles • The list of Profiles identify which user should receive the document

  13. Summary • Generate a representation of the meaning or content of each object based on its description. • Generate a representation of the meaning of the information need. • Compare these two representations to select those objects that are most likely to match the information need.

  14. Basic Architecture of an Information Retrieval System Documents Queries Document Representation Query Representation Comparison

  15. Research Issues • Issue 1 • What makes a good document representation? • How can a representation be generated from a description of the document? • What are retrievable units and how are they organized? • Issue 2How can we represent the information need and how can we acquire this representation? • from a description of the information need or • through interaction with the user?

  16. Research Issues (Continued) • Issue 3How can we compare representations to judge likelihood that a document matches an information need? • Issue 4How can we evaluate the effectiveness of the retrieval process?

  17. Information Extraction • Problem DefinitionAn information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.

  18. Example: Parser • Transducer: parser • Input: the sequence of words or lexical items • Output: a parse tree • Information added: predicate-argument and modification relations • Information lost: no • Rule form: unification grammars • Application method: chart parser • Acquisition method: manually

  19. Modules • Text Zonerturn a text into a set of text segments • Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes • Filterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones

  20. Modules (Continued) • Lexical Disambiguationturn a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates • Coreference Resolution, or Discourse Processingturn a tree-like structure into a network-like structure by identifying different descriptions of the same entity in different parts of the text • Template Generatorderive the templates from the semantic structures

  21. Input Document types • Plain Text IE: (一句一句,平鋪直述) • 利用lexical、semantic analysis。 • AutoSlog(Riloff 93), LIEP(Huffman 95), CRYSTAL(Soderland 95), HASTEN(Krupka 95)。 • Semi-structured Web IE:(半結構性文件) • 利用html語法特性-tag。 • 觀察所得之heuristics: Layout。

  22. Text IE vs. IR • IR simply finds texts and presents them to the user • IE analyses texts and presents only the specific information from them that the user is interested in • IE systems are more difficult and knowledge-intensive to build • IE systems are tied to particular domains and scenarios

  23. IE Tasks • Plain Text • Named Entity recognition (NE) • Coreference Resolution (CO) • Template Element construction (TE) • Scenario Template production (ST) • Semi-structured Documents • The template is a k-tuple relation

  24. Web IE Applications • Meta Search Engines • Information Agents • 以特定目的為導向,例: • 新聞代理人(News spider) • 網羅新聞 • 購物比價 • 找工作 • ShopBot (Doorenbos 97) • Software LEGO(Hsu 99)

  25. Human & Computer Users • User Services: • Query • Monitor • Update Information Integration Service Mediator Mediator Mediator Wrapper Wrapper SQL ORB Text, Images/Video, Spreadsheets Hierarchical & Network Databases Object & Knowledge Bases Relational Databases Heterogeneous Data Sources Information Integration Systems Abstracted Information Agent/Module Coordination Mediation Semantic Integration Translation and Wrapping Unprocessed, Unintegrated Details

  26. References • Information Retrieval: Data Structures and Algorithms, Chapter 1. by W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992. • Modern Information Retrieval, Chapter 1 by R. Baeza-Yates and B. Ribeiro-Neto, Addison-Wesley, 1999. • H. Cunningham, Information Extraction – a User Guide, http://www.dcs.shef.ac.uk • I. Muslea, Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction.

More Related