260 likes | 387 Views
Chapter 4 Query Languages. Introduction. Cover different kinds of queries posed to text retrieval systems Keyword-based query languages include simple words and phrases as well as Boolean operators Pattern matching complement keyword searching with data retrieval capabilities
E N D
Introduction • Cover different kinds of queries posed to text retrieval systems • Keyword-based query languages • include simple words and phrases as well as Boolean operators • Pattern matching • complement keyword searching with data retrieval capabilities • Structural queries • querying on structure of text
Keyword-Based Querying • Query is formulation of user information need • Keyword-based queries are popular • intuitive • easy to express • allow for fast ranking • Query can be simply a word • in general more complex combination of operations involving several words
Single-Word Queries • Most elementary query is a word • Word is sequence of letters surrounded by separators • some characters are not letters but do not split a word, e.g. hyphen in on-line • Result of word queries is • set of documents containing at least one of the words in query • resulting documents are ranked • term frequency (count of word in document) • inverse document frequency (count of no. of documents in which word appears)
Context Queries • Complement single-word queries with ability to search words in given context, I.e. near other words • words near each other signal higher likelihood of relevance than if they appear apart • form phrases of words or find words which are proximal in text
Phrase • Sequence of single-word queries • for instance, possible to search for word ‘enhance’ and then word ‘retrieval’ • uninteresting words in text are not considered at all • e.g. above example query could match text such as ‘…enhance the retrieval…’
Proximity • Sequence of single words or phrases is given together with maximum allowed distance between them • For instance, above query stated as • ‘enhance’ and ‘retrieval’ should occur within four words • a possible match could be ‘… enhance the power of retrieval…’ • Distance can be measured in characters or words
Boolean Queries • Oldest form of combining keyword queries is to use Boolean operators • Boolean query has following syntax • atoms (I.e. basic queries) that retrieve documents, and of • Boolean operators which work on their operands (sets of documents) • query syntax tree can be defined • leaves are basic queries • internal nodes are operators
Boolean Queries (Cont.) • Retrieve all documents that contain the word ‘translation’ as well as either the word ‘syntax’ or the word ‘syntactic’ AND OR translation syntactic syntax
Boolean Queries (Cont.) • No ranking of retrieved documents provided • document either satisfies query (retrieved) or does not (not retrieved) • does not allow partial matching between document and user query • to overcome this limitation, idea of ‘fuzzy Boolean’ set of operators proposed • instead of all the operands (AND) or at least in one of operands (OR), retrieve elements in some operands
Natural Language • Distinction between AND and OR completely blurred • simply an enumeration of words and context queries • all documents matching portion of user query are retrieved • higher ranking assigned to documents matching more parts of query • eliminated any reference to Boolean operators
Pattern Matching • Query formulation based on concept of pattern that allow retrieval of pieces of text that have some property • Pattern is set of syntactic features that occur in text segment • Segments satisfying pattern specification said to ‘match’ the pattern • We are interested in documents containing segments that match given search pattern
Pattern Matching (Cont.) • Most used types of pattern are • words • string (sequence of characters) that is a word in text • prefixes • string that form beginning of text word • prefix ‘comput’ retrieve documents with words such as ‘computer’, ‘computation’ • suffixes • string that form termination of word • suffix ‘ters’ retrieve documents with words such as ‘testers’, ‘computers’
substrings • string which can appear within word • substring ‘tal’ retrieve documents with words such as ‘coastal’, ‘talk’, ‘metallic’ • ranges • pair of strings that match any word lying between them in lexicographical order • alphabets sorted to order string into lexicographical order (dictionary order) • range between words ‘held’ and ‘hold’ retrieve strings such as ‘hoax’, ‘hissing’
allowing errors • word together with error threshold • retrieves all text words ‘similar’ to given word • pattern may have errors (typing, spelling) and documents with words with erroneous variants are retrieved (with edit distance) • if typing error splits ‘flower’ into ‘flo wer’, still found with one error • regular expression (r.e.) • r.e. is built up by simple strings and operators like union, concatenation and repetition • query like ‘pro (plem | tein) (s | ) (0 | 1 | 2)*’ will match words like ‘problem02’, ‘proteins’
Extended patterns • subset of regular expressions expressed with simpler syntax • classes of characters, I.e. some position in pattern matched by any character from pre-defined set (e.g. some characters must be digit, not a letter, vowel, etc.) • conditional expressions, I.e. part of pattern may or may not appear • wild characters, I.e. match any sequence in text (e.g. any word starts as ‘flo’ and ends with ‘ers’ which match ‘flowers’ as well as ‘flounders’
Structural Queries • Allowing user to query texts based on structure, and not content • mixing contents and structure in queries can pose powerful queries (much more expressive) • An example • select set of documents that satisfy certain constraints on content (using word, phrase, or patterns) and then • structural constraints expressed using containment, proximity, or chapters, sections present in documents
Types of structures • fixed structure • hypertext • hierarchical structure
Fixed Structure • Document has fixed set of fields • each field has some text inside • some fields not present in all documents • fields not allowed to nest or overlap • retrieval activity restricted to specifying that given pattern was to be found only in given fields • this model reasonable when text collection has fixed structure
Hypertext • Retrieval from hypertext began as navigational activity • user manually traverse hypertext nodes following links to search what he wanted • not possible to query hypertext based on its structure • WebGlimpse - interesting proposal to allow navigation plus ability to search by content in neighborhood of current node
Hierarchical Structure • Represent recursive decomposition of text • natural model for many text collections • Figure 4.3 shows example of hierarchical structure that consists of page of a book, its schematic view and parsed query to retrieve figure
Hierarchical ModelsPAT Expressions • Structure marked in the text by tags (as in HTML) • defined in terms of initial and final tags • each pair of initial and final tags defines a region, set of contiguous text areas • area of region cannot nest or overlap • possible to select areas containing other areas, contained in other areas, or followed by other areas
Overlapped Lists • Allows area of regions to overlap, but not to nest • considers use of inverted list where words and also regions are indexed • allows to perform set union, and to combine regions
List of References • Attempt to make definition and querying of structured text uniform, using common language • the language allows for querying on ‘path expressions’, which describe paths in structure tree • answers to queries are list of ‘references’ • reference is pointer to region of database
Proximal Nodes • Tries to find good compromise between expressiveness and efficiency • specifies fully compositional language where leaves of query syntax tree formed by basic queries on contents or names of structural elements (e.g. all chapters) • internal nodes combine results • for efficiency, operations at internal nodes must relate nodes close in text
Tree Matching • Relies on single primitive: tree inclusion • interpret structure of text database and query as trees • determine embedding of query into database which respects hierarchical relationships between nodes of query • simple queries return roots of the matches