100 likes | 219 Views
Query Languages: Patterns & Structures. Pattern Matching. Pattern a set of syntactic features that must occur in a text segment Types of patterns Words: string (sequence of characters) in the text Prefixes: string forming the beginning of a text word (e.g., comput → computer, computation)
E N D
Pattern Matching • Pattern • a set of syntactic features that must occur in a text segment • Types of patterns • Words: string (sequence of characters) in the text • Prefixes: string forming the beginning of a text word (e.g., comput → computer, computation) • Suffixes: string forming the termination of a text word (e.g., ters → computers, painters) • Substrings: string appeared within a text word (allowed word separators) (e.g., tal → talk, metallic & any flow → many flowers)
Pattern Matching (cont.) • More Types of Patterns • Ranges: a pair of strings matched any word lying between them in lexicographical order (e.g., held to hold → hoax, hissing) • Allowing errors: retrieving word similar to given word (e.g., flower → flo wer [edit distance = 1]) • Regular expressions: general patterns built up by simple strings and operators (e.g., pro (blem | tein) (s | ε) (0 | 1 | 2)* ) → Problem02, proteins • Extended patterns: classes of characters, conditional expressions, wild characters, combinations
Structural Queries • Structural query • mixing contents and structure in queries • content constraints (words, phrases, patterns) • structural constraints (containment, proximity) and restrictions on structural elements (chapters, sections) • Type of structures of text • form-like fixed structure • hypertext structure • hierarchical structure
Structural Queries (cont.) • Types of structures form hypertext hierarchical
Fixed Structure • Traditional restrictions • documents had a fixed set of fields • each field had some text inside • only rarely the fields appear in any order or repeat • fields were not allowed to nest or overlap • retrieval: specifying a given basic pattern to be found only in a given field • Characteristics • reasonable to retrieve text collection having a fixed structure (e.g. mail archive) → inadequate to represent the hierarchical structure such as HTML docs • expansion to relational DB model
Hypertext • Hypertext (navigational) • a directed graph where the nodes hold some text and the links represent connections between nodes or between positions inside the nodes • Browsing / Searching in hypertext • retrieval from a hypertext: browsing (traversing the hypertext nodes following link → navigational activity) • even in web, one can search by the text contents of the nodes, but not by their structural connectivity • some search engines now allow searching for specific source or destination anchors (but not general structure + content queries)
Hierarchical Structure • Hierarchical structure • an intermediate structuring model lying between fixed structure and hypertext • represents a recursive decomposition of the text • a natural model for many text collections, e.g., books, articles, legal documents, structured programs, etc. • Hierarchical models • PAT Expressions, Overlapped Lists, List of References, Proximal Nodes, Tree Matching • Issues in hierarchical models • static or dynamic structure, restrictions on the structure, integration with text, query language
Query Protocols • Query protocols • query language used to query text database • standards intended not for human use but for querying library systems and querying CD-ROMs • Some important query protocols • Z39.50 • WAIS • CCL • CD-RDx • SFQL