270 likes | 404 Views
Chapter 4 Query Language. Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University. Introduction. Goals Which queries can be formulated How the formulation is related to underlying information retrieval models Query languages. Boolean queries
E N D
Chapter 4 Query Language Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University
Introduction • Goals • Which queries can be formulated • How the formulation is related to underlying information retrieval models • Query languages
Boolean queries Fuzzy Boolean natural language structured queries basic queries proximity phrases pattern matching errors words substrings prefixes suffixes regular expressions extended patterns regular keywords and context
Keyword-Based Querying • single-word queries • A query is formulated by a word • A document is formulated by long sequences of words. • A word is a sequence of letters surrounded by separators • What are letters and separators? • e.g., ‘on-line’ • Chinese sentences are composed of characters without word boundaries • The division of the text into words is not arbitrary(This topic will be dealt with in a special talk for Chinese IR)
斷詞問題 • 問題 • 中文句子詞與詞之間並沒有明顯的分隔記號。 • 這名記者會說國語。 • 這 名 記者 會 說 國語。 • 這 名 記者會 說 國語。 • 詞的定義 • 具有獨立意義,且扮演特定語法功能的字串應視為一個詞。 • 分詞標準 • 中國大陸【信息處理用現代漢語分詞規範】 • 1989年制定 • 1993年呈報國家標準
斷詞問題(續) • 台灣【資訊處理用中文分詞標準草案】 • 1996年中華民國計算語言學學會草擬 • 基本原則 • 語義無法由組合成分直接相加而得之字串,應該分為一分詞單位。例如:撞期 vs 撞山 • 詞類無法由組合成分直接得到,應該合為一分詞單位。例如:好喝
處理模式 • 詞典是不可缺少的重要資源 • 列出“所有”可能的詞 • 把他的確實行動作了分析把,他,的,確實,實行,行動,動作,了,分析 • 電子計算機是會計算題目的機器電子,計算,計算機,電子計算機,是,會,會計,計算,計算題,題目,目的,的,機器 • word lattice電 子 計 算 機 是 會 計 算 題 目 的 機 器
處理模式(續) • 歧義排除機置 • 挑出最佳組合 • 策略 • 規則式 • 長詞優先台灣大學 是 有名 的 學府長詞遮蔽短詞:這 名 記者 會 說 國語。 • 除去造成路徑中斷的詞區段 • 經驗法則:偏好三字詞, ... • 剖析器 • 統計式 • 馬可夫模型, 鬆 弛法, ... • 效能─各家都宣稱有百分之九十五以上的準確率
處理模式(續) • 問題所在 • 詞典是否收錄所有可能的詞? • A-錢,凍蒜 • 策略 • 構詞率 • (半)自動建立新的詞典 • 未知詞處理模式
構詞率 • 數詞與量詞的形成 • 一個個, 一條條 • 日期與時間 • 八十五年十月四日 • 名詞或動詞的前綴或後綴 • 學生們 • 特殊動詞 • 丟丟 看,吃吃 看,寫寫 看 • 高高興興,歡歡喜喜,漂漂亮亮,迷迷糊糊 • 打打球,跑跑步,寫寫字 • ...
Context Queries • definition • Search words in a given context, e.g., near other words • types • phrase • a sequence of single-word queries • e.g., enhance retrieval • proximity • a sequence of single words or phrases, and a maximum allowed distance between them are specified • e.g., within distance(enhance, retrieval, 4) will match‘… enhance the power of retrieval …’
Boolean Queries • definition • A syntax composed of atoms that retrieve documents, and of Boolean operators which work on their operands • e.g., translation AND syntax OR syntactic AND translation OR syntactic syntax query syntax tree
Boolean Queries (Continued) • operands • (e1 OR e2) • Select all documents which satisfy e1 or e2. Duplicates are eliminated. • (e1 AND e2) • Select all documents which satisfy both e1 and e2. • (e1 BUT e2) • Select all documents which satisfy e1 but not e2 • “fuzzy boolean” • Retrieve documents appearing in some operands(The AND may require it to appear in more operands than the OR)
Natural Language • generalization of “fuzzy Boolean” • A query is an enumeration of words and context queries. • All the documents matching a portion of the user query are retrieved.
Pattern Matching • A pattern is a set of syntactic features that must occur in a text segment • types • words • prefixes, e.g., ‘comput’ ‘computer’, ‘computation’, ‘computing’, etc. • suffixes, e.g, ‘ters’ ‘computers’, ‘testers’, ‘painters’, etc. • substrings, e.g., ‘tal’ ‘coastal’, ‘talk’, ‘metallic’, etc. • Ranges (lexicographic order), between ‘held’ and ‘hold’ ‘hoax’ and ‘hissing
Pattern Matching (Continued) • allowing errors • Retrieve all text words which are ‘similar’ to the given word • edit distance: the minimum number of character insertions, deletions, and replacements needed to make two strings equal, e.g., ‘flower’ and ‘flo wer’ • maximum allowed edit distance: query specifies the maximum number of allowed errors for a word to match the pattern
Pattern Matching (Continued) • regular expressions • union: if e1 and e2 are regular expressions, then (e1 | e2) matches what e1 or e2 matches • concatenation: if e1 and e2 are regular expressions, the occurrences of (e1 e2) are formed by the occurrences of e1 immediately followed by those of e2 • repetition: if e is a regular expression, then (e*) matches a sequence of zero or more contiguous occurrence of e. • ‘pro (blem | tein) (s | ) (0 | 1 | 2)*’ ‘problem2’ and ‘proteins’
Pattern Matching (Continued) • extended patterns • subsets of the regular expressions expressed with a simpler syntax • classes of characters • conditional expressions • wild characters which match any sequence in the text • combinations
Structural Queries • mixing contents and structure in queries • contents: words, phrases, or patterns • structural constraints: containment, proximity, or other restrictions on structural elements • issues • what structure a text may have • what queries can be made on which structures • three main structures • form-like fixed structure • hypertext structure • hierarchical structure
Document:a fixed set of fields For example, a mail has a sender, a receiver, a date, a subject and a body field. Search for the mails sent to a given person with “football” in the Subject field Form-like fixed structure text text fields text text
Hypertext structure A hypertext is a directed graph where nodes hold some text the links represent connections between nodes or between positions inside nodes (text contents) (structural connectivity) WebGlimpse: combine browsing and searching on the Web
WebGlimpse(http://tucson.com/webglimpse/) • WebGlimpse is a fast, flexible search engine for finding information in a related web of pages. • The ability to index pages on remote sites provides a level of power one step above most search engine tools. • You can define your own sub-area of the web simply by making a page of links to all relevant sites. • Webglimpse will search by following your links, to whatever 'depth' you specify.
Hierarchical Structure Recursive decomposition of the text
chapter Chapter 4 4.1 Introduction We cover in this chapter the different kinds of … … 4.4 Structural Queries … section section figure title title Introduction We cover … … Structural … … in with figure with section “structural” title
Issues • static or dynamic structure • statistic: there are one or more explicit hierarchies • dynamic: the required elements are built on the fly using text makeup • restrictions on the structure • The text or the answers may have restrictions about nesting and/or overlapping
Issues (Continued) • integration with text • integration of queries on text content with queries on text structure • query language • features • selection of areas that contain (or not) other areas • selection of areas that are contained (or not) in other areas • selection of areas that follow (or are followed by) other areas • selection of areas that are close to other areas • set manipulation • standardization, expressiveness taxonomy or formal categorization
A Sample of Hierarchical Models • PAT Expressions • Overlapped Lists • Proximal Nodes • Tree Matching