200 likes | 333 Views
Query Languages. Information Retrieval. Concerned with the: Representation of Storage of Organization of, and Access to Information items. Recap 1. Important points Data retrieval vs. information retrieval Users specify their needs via an intermediary language.
E N D
Information Retrieval • Concerned with the: • Representation of • Storage of • Organization of, and • Access to • Information items.
Recap 1 • Important points • Data retrieval vs. information retrieval • Users specify their needs via an intermediary language. • Documents are represented by an abstraction of their content. • Traditional model vs. berry-picking model • Evaluation (precision/recall, single-value measures, human measures) • Task characteristics (question answering, open-ended analysis, ad-hoc vs. filtering) • Collection characteristics (size, document relations)
Recap 2 • Important points • Content types (text, descriptive/semantic metadata, multimedia) • Metadata (formats & sets) • Information Theory (entropy) • Models of symbol distribution (Zipf’s law, Heap’s law) • Distance measures (Hamming distance, Levenshtein distance) • Markup languages (SGML, HTML, XML) • Multimedia formats (header + data)
Query Languages • Query language determines which queries can be formulated • User-oriented languages • System-oriented languages (protocols) • Language dependent on underlying information retrieval model • Systems may enhance the query using • Word expansion (thesaurus & stemming) • removing stopwords
Keyword-Based Querying • A query is composed of keywords • Documents containing keywords are returned • Intuitive, easy to express, fast ranking • Single-word and multi-word queries • Classification of keyword-based queries • single-word queries • context queries • Boolean queries • natural language queries
Single-Word Queries • Word query • the most elementary query that can be formulated • a word is a sequence of letters surrounded by separators • in many models, words are only types of queries allowed • Result of word queries • the set of documents containing at least one of the words of the query • the resulting documents are ranked according to a degree of similarity to the query • use term frequency and inverse document frequency
Context Queries • Phrase query • a sequence of single-word queries • ignore separators and uninteresting words example: “…enhance the retrieval…” • ranked in a fashion analogous to single words • Proximity query • a phrase query with a maximum allowed distance (character or word) between words in the query example: distance = 4 “enhance the power of retrieval” • physical proximity has semantic value: the words in the same paragraph are related in some way
Boolean Queries • Boolean query • composed of atoms (basic queries) that retrieve documents, and of Boolean operators which work on their operands (sets of documents) • Query syntax tree • compositional scheme • leaves: basic queries • internal nodes: operators AND translation OR syntactic syntax
Boolean Queries • Operators in Boolean queries • e1 ORe2 : selecting all docs satisfying e1 or e2 • e1 ANDe2 : selecting all docs satisfying both e1 and e2 • e1 BUTe2 : selecting all docs satisfying e1 but not e2 • Classic Boolean system • no ranking of the retrieved docs • does not allow partial matching • alternative: fuzzy Boolean set of operators • meaning of AND and OR can be relaxed (e.g., appearing in some operands)
Natural Language • Natural language query • blurring the distinction between AND and OR → query becomes an enumeration of words and context queries • higher ranking is assigned to those documents matching more parts of the query • Characteristics • retrieving all the documents close to the query • a complete document can be used as a query → leads to the use of relevance feedback techniques (user selects a document from the result, and submits it as a new query) • example system: AskJeeves
Pattern Matching • Pattern • a set of syntactic features that must occur in a text segment • Types of patterns • Words: string (sequence of characters) in the text • Prefixes: string forming the beginning of a text word (e.g., comput → computer, computation) • Suffixes: string forming the termination of a text word (e.g., ters → computers, painters) • Substrings: string appeared within a text word (allowed word separators) (e.g., tal → talk, metallic & any flow → many flowers)
Pattern Matching (cont.) • More Types of Patterns • Ranges: a pair of strings matched any word lying between them in lexicographical order (e.g., held to hold → hoax, hissing) • Allowing errors: retrieving word similar to given word (e.g., flower → flo wer [edit distance = 1]) • Regular expressions: general patterns built up by simple strings and operators (e.g., pro (blem | tein) (s | ε) (0 | 1 | 2)* ) → Problem02, proteins • Extended patterns: classes of characters, conditional expressions, wild characters, combinations
Structural Queries • Structural query • mixing contents and structure in queries • content constraints (words, phrases, patterns) • structural constraints (containment, proximity) and restrictions on structural elements (chapters, sections) • Type of structures of text • form-like fixed structure • hypertext structure • hierarchical structure
Structural Queries (cont.) • Types of structures form hypertext hierarchical
Fixed Structure • Traditional restrictions • documents had a fixed set of fields • each field had some text inside • only rarely the fields appear in any order or repeat • fields were not allowed to nest or overlap • retrieval: specifying a given basic pattern to be found only in a given field • Characteristics • reasonable to retrieve text collection having a fixed structure (e.g. mail archive) → inadequate to represent the hierarchical structure such as HTML docs • expansion to relational DB model
Hypertext • Hypertext (navigational) • a directed graph where the nodes hold some text and the links represent connections between nodes or between positions inside the nodes • Browsing / Searching in hypertext • retrieval from a hypertext: browsing (traversing the hypertext nodes following link → navigational activity) • even in web, one can search by the text contents of the nodes, but not by their structural connectivity • some search engines now allow searching for specific source or destination anchors (but not general structure + content queries)
Hierarchical Structure • Hierarchical structure • an intermediate structuring model lying between fixed structure and hypertext • represents a recursive decomposition of the text • a natural model for many text collections, e.g., books, articles, legal documents, structured programs, etc. • Hierarchical models • PAT Expressions, Overlapped Lists, List of References, Proximal Nodes, Tree Matching • Issues in hierarchical models • static or dynamic structure, restrictions on the structure, integration with text, query language
Query Protocols • Query protocols • query language used to query text database • standards intended not for human use but for querying library systems and querying CD-ROMs • Some important query protocols • Z39.50 • WAIS • CCL • CD-RDx • SFQL