1 / 20

Query Languages

Query Languages. Information Retrieval. Concerned with the: Representation of Storage of Organization of, and Access to Information items. Recap 1. Important points Data retrieval vs. information retrieval Users specify their needs via an intermediary language.

Download Presentation

Query Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Languages

  2. Information Retrieval • Concerned with the: • Representation of • Storage of • Organization of, and • Access to • Information items.

  3. Recap 1 • Important points • Data retrieval vs. information retrieval • Users specify their needs via an intermediary language. • Documents are represented by an abstraction of their content. • Traditional model vs. berry-picking model • Evaluation (precision/recall, single-value measures, human measures) • Task characteristics (question answering, open-ended analysis, ad-hoc vs. filtering) • Collection characteristics (size, document relations)

  4. Recap 2 • Important points • Content types (text, descriptive/semantic metadata, multimedia) • Metadata (formats & sets) • Information Theory (entropy) • Models of symbol distribution (Zipf’s law, Heap’s law) • Distance measures (Hamming distance, Levenshtein distance) • Markup languages (SGML, HTML, XML) • Multimedia formats (header + data)

  5. Query Languages • Query language determines which queries can be formulated • User-oriented languages • System-oriented languages (protocols) • Language dependent on underlying information retrieval model • Systems may enhance the query using • Word expansion (thesaurus & stemming) • removing stopwords

  6. Keyword-Based Querying • A query is composed of keywords • Documents containing keywords are returned • Intuitive, easy to express, fast ranking • Single-word and multi-word queries • Classification of keyword-based queries • single-word queries • context queries • Boolean queries • natural language queries

  7. Single-Word Queries • Word query • the most elementary query that can be formulated • a word is a sequence of letters surrounded by separators • in many models, words are only types of queries allowed • Result of word queries • the set of documents containing at least one of the words of the query • the resulting documents are ranked according to a degree of similarity to the query • use term frequency and inverse document frequency

  8. Context Queries • Phrase query • a sequence of single-word queries • ignore separators and uninteresting words example: “…enhance the retrieval…” • ranked in a fashion analogous to single words • Proximity query • a phrase query with a maximum allowed distance (character or word) between words in the query example: distance = 4 “enhance the power of retrieval” • physical proximity has semantic value: the words in the same paragraph are related in some way

  9. Boolean Queries • Boolean query • composed of atoms (basic queries) that retrieve documents, and of Boolean operators which work on their operands (sets of documents) • Query syntax tree • compositional scheme • leaves: basic queries • internal nodes: operators AND translation OR syntactic syntax

  10. Boolean Queries • Operators in Boolean queries • e1 ORe2 : selecting all docs satisfying e1 or e2 • e1 ANDe2 : selecting all docs satisfying both e1 and e2 • e1 BUTe2 : selecting all docs satisfying e1 but not e2 • Classic Boolean system • no ranking of the retrieved docs • does not allow partial matching • alternative: fuzzy Boolean set of operators • meaning of AND and OR can be relaxed (e.g., appearing in some operands)

  11. Natural Language • Natural language query • blurring the distinction between AND and OR → query becomes an enumeration of words and context queries • higher ranking is assigned to those documents matching more parts of the query • Characteristics • retrieving all the documents close to the query • a complete document can be used as a query → leads to the use of relevance feedback techniques (user selects a document from the result, and submits it as a new query) • example system: AskJeeves

  12. Query Languages:Patterns & Structures

  13. Pattern Matching • Pattern • a set of syntactic features that must occur in a text segment • Types of patterns • Words: string (sequence of characters) in the text • Prefixes: string forming the beginning of a text word (e.g., comput → computer, computation) • Suffixes: string forming the termination of a text word (e.g., ters → computers, painters) • Substrings: string appeared within a text word (allowed word separators) (e.g., tal → talk, metallic & any flow → many flowers)

  14. Pattern Matching (cont.) • More Types of Patterns • Ranges: a pair of strings matched any word lying between them in lexicographical order (e.g., held to hold → hoax, hissing) • Allowing errors: retrieving word similar to given word (e.g., flower → flo wer [edit distance = 1]) • Regular expressions: general patterns built up by simple strings and operators (e.g., pro (blem | tein) (s | ε) (0 | 1 | 2)* ) → Problem02, proteins • Extended patterns: classes of characters, conditional expressions, wild characters, combinations

  15. Structural Queries • Structural query • mixing contents and structure in queries • content constraints (words, phrases, patterns) • structural constraints (containment, proximity) and restrictions on structural elements (chapters, sections) • Type of structures of text • form-like fixed structure • hypertext structure • hierarchical structure

  16. Structural Queries (cont.) • Types of structures form hypertext hierarchical

  17. Fixed Structure • Traditional restrictions • documents had a fixed set of fields • each field had some text inside • only rarely the fields appear in any order or repeat • fields were not allowed to nest or overlap • retrieval: specifying a given basic pattern to be found only in a given field • Characteristics • reasonable to retrieve text collection having a fixed structure (e.g. mail archive) → inadequate to represent the hierarchical structure such as HTML docs • expansion to relational DB model

  18. Hypertext • Hypertext (navigational) • a directed graph where the nodes hold some text and the links represent connections between nodes or between positions inside the nodes • Browsing / Searching in hypertext • retrieval from a hypertext: browsing (traversing the hypertext nodes following link → navigational activity) • even in web, one can search by the text contents of the nodes, but not by their structural connectivity • some search engines now allow searching for specific source or destination anchors (but not general structure + content queries)

  19. Hierarchical Structure • Hierarchical structure • an intermediate structuring model lying between fixed structure and hypertext • represents a recursive decomposition of the text • a natural model for many text collections, e.g., books, articles, legal documents, structured programs, etc. • Hierarchical models • PAT Expressions, Overlapped Lists, List of References, Proximal Nodes, Tree Matching • Issues in hierarchical models • static or dynamic structure, restrictions on the structure, integration with text, query language

  20. Query Protocols • Query protocols • query language used to query text database • standards intended not for human use but for querying library systems and querying CD-ROMs • Some important query protocols • Z39.50 • WAIS • CCL • CD-RDx • SFQL

More Related