380 likes | 559 Views
Chapter 4 : Query Languages. 學生:曾寶樂 學號:88522070 課程老師:張嘉惠 報告日期:89/10/26. Outline. Keyword-Based Querying Patten Matching Structural Queries Query Protocols Trends and Research Issues. Keyword-Based Querying. A query is formulation of a user information need
E N D
Chapter 4 : Query Languages 學生:曾寶樂 學號:88522070 課程老師:張嘉惠 報告日期:89/10/26
Outline • Keyword-Based Querying • Patten Matching • Structural Queries • Query Protocols • Trends and Research Issues
Keyword-Based Querying A query is formulation of a user information need Keyword-based queries are popular • 1.Single-Word Queries • 2.Context Queries • 3.Boolean Queries • 4.Natural Language
Single-Word Queries • A query is formulated by a word • A document is formulated by long sequences of words • A word is a sequence of letters surrounded by separators • What are letters and separators?e.g,’on-line’ The division of the text into words is not arbitrary
Context Queries • definition - Search words in a given context,e.g,near other words • types -phrase >a sequence of single-word queries >e.g,enhance retrieval -proximity >a sequence of single words or phrases, and a maximum allowed distance between them are specified >e.g,within distance(enhance,retrieval,4) will match ‘…enhance the power of retrieval…’
Boolean Queries Definition -A syntax composed of atoms that retrieve documents, and of Boolean operators which work on their operands -e.g,translation AND syntax OR syntactic
Boolean Queries • Operands -(e1 OR e2) select all documents which satisfy e1 or e2 -(e1 AND e2) select all documents which satisfy both e1 and e2 -(e1 BUT e2) select all documents which satisfy e1 but not e2 • “fuzzy boolean” -Retrieve documents appearing in some operands(The AND may require it to appear in more operands than the OR)
Natural Language • generalization of “fuzzy Boolean” • A query is an enumeration of words and context queries • All the documents matching a portion of the user query are retrieved
Pattern Matching • A pattern is a set of syntactic features that must occur in a text segment • Types -words -prefixes e.q ‘comput’->’computer’ ,’computation’,’computing’,etc -suffixes e.q ‘ters’->’computers’,’testers’,’painters’,etc -substrings e.q ‘tal’->’coastal’,’talk’,’metallic’,etc -Ranges between ‘held’ and ‘hold’->’hoax’ and ‘hissing’
Pattern Matching • Allowing errors • Retrieve all text words which all ‘similar’ to the given word • edit distance: the minimum number of character insertions,deletions,and replacements needed to make two strings equal , e.q , ‘flower’ and ‘flo wer’ • maximum allowed edit distance: query specifies the maximum number of allowed errors for a word to match the pattern
Pattern Matching • Regular expressions • union:if e1 and e2 are regular expressions , then(e1|e2) matches what e1 or e2 matches • concatenation:if e1 and e2 are regular expressions , the occurrences of (e1e2) are formed by the occurrences of e1 immediately followed by those of e2 • repetition:if e is a regular expression , then (e*) matches a sequence of zero or more contiguous occurrence of e • ‘pro(blem|tein)(s|є)(0|1|2)*’->’problem2’ and ‘proteins’
Structural Queries • Mixing contents and structure in queries -contents:words,phrases,or patterns -structural constraints:containment,proximity,or other restrictions on structural elements • Three main structures -fixed structure -hypertext structure -hierarchical structure
Fixed Structure Document:a fixed set of fields EX: a mail has a sender, a receiver, a date, a subject and a body field Search for the mails sent to a given person with “football” in the Subject field
Hypertext A hypertext is a directed graph where nodes hold some text (text contents) the links represent connections between nodes or between positions inside nodes (structural connectivity)
Hypertext : WebGlimpse WebGlimpse: combine browsing and searching on the Web
Hierarchical Structure Recursive decomposition of the text
Hierarchical Structure • PAT Expressions • Overlapped Lists • Lists of References • Proximal Nodes • Tree Matching
PAT Expressions • What is PAT tree? • The areas of a region cannot nest or overlap
Overlapped Lists • The model allow for the areas of a region to overlap,but not to nest • It is not clear,whether overlapping is good or not for capturing the structural properties
Lists of References • Overlap and nest are not allowed • All elements must be of the same type,e.g only sections,or only paragraphs. • A reference is a pointer to a region of the database.
Proximal Nodes • This model tries to find a good compromise between expressiveness and efficiency. • It does not define a specific language, but a model in which it is shown that a number of useful operators can be included achieving good efficiency.
Tree Matching • The leaves of the query can be not only structural elements but also text patterns, meaning that the ancestor of the leaf must contain that pattern.
Query Protocols • Z39.50 • WAIS (Wide Area Information Service)
Z39.50 • American National Standard Information Retrieval Application Service Definition • Can be implemented on any platform • Query bibliographical information using a standard interface between the client and the host database manager • Z39.50 protocol is part of WAIS
Z39.50 Brief history • Z39.50-1988(version 1) • Z39.50-1992(version 2) • Z39.50-1995(version 3) • Version 4,development began in Autumn 1995
Using Z39.50 over the WWW WWW Client WWW Z39.50 Z39.50 Server Repository Digital library Z39.50 Client
WAIS (Wide Area Information Service) • Beginning in the 1990s • Query databases through the Internet
Trends and Research Issues Model Queries allowed word,set operations words words words Boolean Vector Probabilistic BBN Relationship between types of queries and models
Boolean Model • 布林運算式雖具有精確的語意, 但如何將一篇文章以布林運算式表達也是一個問題。 • 它是以二元比較, 缺乏「相似性」或「程度上」的比較, 也就是無法進行相似文章的查詢。
Vector Model • 優點:(1)以Term-weight 的方法改善了資料粹取的效率;(2)它能允許相關文章的查詢;(3)它能計算文章間相似程度,以找出最大相似度的文章。 • 缺點:這個模型假設了字串的獨立性,若關鍵字在每篇文章都出現, 此關鍵字的weight 將會是0 如此便忽略了字在各文章出現頻率不同所隱含的意義。
Probabilistic Model • Probabilistic Model 主要優點在於能夠計算相似度的機率值, 但它有幾個缺點: (1)須要猜測一堆文章中相關及不相關的集合;(2)未考慮到字串在文件中出現的頻率;(3)對索引字串須假設相互獨立 。
Bayesian Belief Network • 是一個有向的非循環圖,其是由質和量兩個部份所組成,質的部分是由領域相關的變數及變數之間的交互關係所組成的有向圖,量的部分是這些領域相關變數的聯合機率分佈 • 在這有向圖中,每個節點代表一個隨機變數,每條連結線指出兩個變數之間的交互關係。簡言之,這個有向圖是這些變數之聯合機率分佈的分解表示法 。
Bayesian Belief Network • 懷孕(P)導致荷爾蒙(H)改變(ie.影響荷爾蒙的狀態)掃描圖陰影(S)的改變、荷爾蒙的改變導致血液檢測(B)及尿液檢測(U)的結果改變。
Trends and Research Issues The types of queries covered and how they are structured