710 likes | 984 Views
Information Retrieval and Extraction. Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University. Chapter 1 Introduction. Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系. Motivation. Information retrieval
E N D
Information Retrieval and Extraction Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University
Chapter 1 Introduction Hsin-Hsi Chen (陳信希) 國立台灣大學資訊程學系
Motivation • Information retrieval • To retrieve information which might be useful or relevant to the user • Representation, Storage, Organization, Access • Information need (vs query) • Find all the pages containing information on college tennis teams which • (1) are maintained by an university in the USA and • (2) participate in the NCAA tennis tournament. • To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.
資訊需求Information Need • 找尋金馬獎/諾貝爾獎相關報導 獎 金馬獎/諾貝爾獎 金馬獎/諾貝爾獎得主 今年金馬獎/諾貝爾獎得主 今年金馬獎最佳影片得主
Information versus Data Retrieval • Data retrieval • Determine which documents of a collection contain the keywords in the user query • Retrieve all objects which satisfy clearly defined conditions in regular expression or relational algebra expression • Data has a well defined structure and semantics • Solution to the user of a database system • Information retrieval
Database Management • A specified set of attributes is used to characterize each item.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO) • Exact match between the attributes used inquery formulations and those attached to the record. SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME = ‘John Smith’
Basic Concepts • Content identifiers (keywords, index terms, descriptors) characterize the stored texts. • degrees of coincidence between the sets of identifiers attached to queries and documents Logical view of the documents User task query formulation content analysis
The User Task • Convey the semantics of information need • Retrieval and browsing Retrieval Database Browsing
Logical View of Documents • Full text representation • A set of index terms • Elimination of stop-words • The use of stemming • The identification of noun groups • …
From full text to a set of index terms automatic or manual indexing accents, spacing, etc. noun groups stemming document stopwords text+ structure text structure recognition index terms full text structure
Indexing • indexing: assignidentifiers to text items. • assign: manual vs. automatic indexing • identifiers: • objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, … • controlled vs. uncontrolled vocabulariesinstruction manuals, terminological schedules, … • single-term vs. term phrase
The retrieval process Text User Interface user need Text Text Operations logical view logical view DB Manager Module Query Operations Indexing user feedback query Searching Index retrieved documents Text Database Ranking ranked documents
Information Retrieval • generic information retrieval system select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user • functions • document searchthe selection of documents from an existing collection of documents • document routingthe dissemination of incoming documents to appropriate users on the basis of user interest profiles
Detection Need • Definitiona set of criteria specified by the user which describes the kind of information desired. • queries in document search task • profiles in routing task • forms • keywords • keywords with Boolean operators • free text • example documents • ...
Example <head> Tipster Topic Description <num> Number: 033 <dom> Domain: Science and Technology <title> Topic: Companies Capable of Producing Document Management <des> Description: Document must identify a company who has the capability to produce document management system by obtaining a turnkey- system or by obtaining and integrating the basic components. <narr> Narrative: To be relevant, the document must identify a turnkey document management system or components which could be integrated to form a document management system and the name of either the company developing the system or the company using the system. These components are: a computer, image scanner or optical character recognition system, and an information retrieval or text management system.
Example (Continued) <con> Concepts: 1. document management, document processing, office automation electronic imaging 2. image scanner, optical character recognition (OCR) 3. text management, text retrieval, text database 4. optical disk <fac> Factors: <def> Definitions Document Management-The creation, storage and retrieval of documents containing, text, images, and graphics. Image Scanner-A device that converts a printed image into a video image, without recognizing the actual content of the text or pictures. Optical Disk-A disk that is written and read by light, and are sometimes associated with the storage of digital images because of their high storage capacity.
search vs. routing • The search process matches a single Detection Need against the stored corpus to return a subset of documents. • Routing matches a single document against a group of Profiles to determine which users are interested in the document. • Profiles stand long-term expressions of user needs. • Search queries are ad hoc in nature. • A generic detection architecture can be used for both the search and routing.
Search • retrieval of desired documents from an existing corpus • Retrospective search is frequently interactive. • Methods • indexing the corpus by keyword, stem and/or phrase • apply statistical and/or learning techniques to better understand the content of the corpus • analyze free text Detection Needs to compare with the indexed corpus or a single document • ...
Document Detection: Search(Continued) • Document Corpus • the content of the corpus may have significant the performance in some applications • Preprocessing of Document Corpus • stemming • a list of stop words • phrases, multi-term items • ...
Document Detection: Search(Continued) • Building Index from Stems • key place for optimizing run-time performance • cost to build the index for a large corpus • Document Index • a list of terms, stems, phrases, etc. • frequency of terms in the document and corpus • frequency of the co-occurrence of terms within the corpus • index may be as large as the original document corpus
Document Detection: Search(Continued) • Detection Need • the user’s criteria for a relevant document • Convert Detection Need to System Specific Query • first transformed into a detection query, and then a retrieval query. • detection query: specific to the retrieval engine, but independent of the corpus • retrieval query: specific to the retrieval engine, and to the corpus
Document Detection: Search(Continued) • Compare Query with Index • Resultant Rank Ordered List of Documents • Return the top ‘N’ documents • Rank the list of relevant documents from the most relevant to the query to the least relevant
Routing (Continued) • Profile of Multiple Detection Needs • A Profile is a group of individual Detection Needs that describes a user’s areas of interest. • All Profiles will be compared to each incoming document (via the Profile index). • If a document matches a Profile the user is notified about the existence of a relevant document.
Routing (Continued) • Convert Detection Need to System Specific Query • Building Index from Queries • similar to build the corpus index for searching • the quantify of source data (Profiles) is usually much less than a document corpus • Profiles may have more specific, structured data in the form of SGML tagged fields
Routing (Continued) • Routing Profile Index • The index will be system specific and will make use of all the preprocessing techniques employed by a particular detection system. • Document to be routed • A stream of incoming documents is handled one at a time to determine where each should be directed. • Routing implementation may handle multiple document streams and multiple Profiles.
Routing (Continued) • Preprocessing of Document • A document is preprocessed in the same manner that a query would be set-up in a search • The document and query roles are reversed compared with the search process • Compare Document with Index • Identify which Profiles are relevant to the document • Given a document, which of the indexed profiles match it?
Routing (Continued) • Resultant List of Profiles • The list of Profiles identify which user should receive the document
Summary • Generate a representation of the meaning or content of each object based on its description. • Generate a representation of the meaning of the information need. • Compare these two representations to select those objects that are most likely to match the information need.
Basic Architecture of an Information Retrieval System Documents* Queries Document Representation Query Representation Comparison *此模型可以延伸到其他媒體所呈現的資訊
Research Issues • Given a set of description for objects in the collection and a description of an information need, we must consider • Issue 1 • What makes a good document representation? • What are retrievable units and how are they organized? • How can a representation be generated from a description of the document?
Research Issues (Continued) • Issue 2How can we represent the information need and how can we acquire this representation either from a description of the information need or through interaction with the user? • Issue 3How can we compare representations to judge likelihood that a document matches an information need?
Research Issues (Continued) • Issue 4How can we evaluate the effectiveness of the retrieval process?
Text Data Mining Tasks • Information extraction -- facts, fill database • Summarization • Categorization • Clustering • Associations • Temporal analysis of document collection
Information Extraction:Beyond Document Retrieval • Question and Answering • Q: Who is the author of the book, "The Iron Lady: A Biography of Margaret Thatcher"?A: Hugo Young • Q: What was the monetary value of the Nobel Peace Prize in 1989?A: $469,000
Information Extraction • Generic Information Extraction SystemAn information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.
Information Extraction (Continued) • What are the transducers or modules? • What are their input and output? • What structure is added? • What information is lost? • What is the form of the rules? • How are the rules applied? • How are the rules acquired?
Example: Parser • transducer: parser • input: the sequence of words or lexical items • output: a parse tree • information added: predicate-argument and modification relations • information lost: no • rule form: unification grammars • application method: chart parser • acquisition method: manually
Modules • Text Zonerturn a text into a set of text segments • Preprocessorturn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes • Filterturn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones • Preparsertake a sequence of lexical items and try to identify various reliably determinable, small-scale structures
Modules (Continued) • Parserinput a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly complete • Fragment Combinerturn a set of parse tree or logical form fragments into a parse tree or logical form for the whole sentence • Semantic Interpretergenerate a semantic structure or logical form from a parse tree or from parse tree fragments
Modules (Continued) • Lexical Disambiguationturn a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates • Co-reference Resolution, or Discourse Processingturn a tree-like structure into a network-like structure by identifying different descriptions of the same entity in different parts of the text • Template Generatorderive the templates from the semantic structures
<DOC> <DOCID> NTU-AIR_LAUNCH-中國時報-19970612-002 </DOCID> <DATASET> Air Vehicle Launch </DATASET> <DD> 1997/06/12 </DD> <DOCTYPE> 報紙報導 </DOCTYPE> <DOCSRC> 中國時報 </DOCSRC> <TEXT> 【本報綜合紐約、華盛頓十一日外電報導】在華盛頓宣布首度 出售「刺針」肩射防空飛彈 給南韓的第二天,美國與北韓今天 在紐約恢復延擱已久的會談,這項預定三天的會談將以北韓的 飛彈發展為重點,包括北韓準備部署射程 可涵蓋幾乎日本全境 的「蘆洞」一號長程飛彈 的報導。 美國國務院發言人柏恩斯說:「在有關北韓 飛彈擴散問題上 ,美方的確有多項關切之處。」美國官員也長期懷疑北韓正對 伊朗和敘利亞輸出飛彈,並希望平壤加入禁止擴散此種武器的