1 / 26

Chapter 4 Query Languages

Chapter 4 Query Languages. Introduction. Cover different kinds of queries posed to text retrieval systems Keyword-based query languages include simple words and phrases as well as Boolean operators Pattern matching complement keyword searching with data retrieval capabilities

cheung
Download Presentation

Chapter 4 Query Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 4Query Languages .. .

  2. Introduction • Cover different kinds of queries posed to text retrieval systems • Keyword-based query languages • include simple words and phrases as well as Boolean operators • Pattern matching • complement keyword searching with data retrieval capabilities • Structural queries • querying on structure of text

  3. Keyword-Based Querying • Query is formulation of user information need • Keyword-based queries are popular • intuitive • easy to express • allow for fast ranking • Query can be simply a word • in general more complex combination of operations involving several words

  4. Single-Word Queries • Most elementary query is a word • Word is sequence of letters surrounded by separators • some characters are not letters but do not split a word, e.g. hyphen in on-line • Result of word queries is • set of documents containing at least one of the words in query • resulting documents are ranked • term frequency (count of word in document) • inverse document frequency (count of no. of documents in which word appears)

  5. Context Queries • Complement single-word queries with ability to search words in given context, I.e. near other words • words near each other signal higher likelihood of relevance than if they appear apart • form phrases of words or find words which are proximal in text

  6. Phrase • Sequence of single-word queries • for instance, possible to search for word ‘enhance’ and then word ‘retrieval’ • uninteresting words in text are not considered at all • e.g. above example query could match text such as ‘…enhance the retrieval…’

  7. Proximity • Sequence of single words or phrases is given together with maximum allowed distance between them • For instance, above query stated as • ‘enhance’ and ‘retrieval’ should occur within four words • a possible match could be ‘… enhance the power of retrieval…’ • Distance can be measured in characters or words

  8. Boolean Queries • Oldest form of combining keyword queries is to use Boolean operators • Boolean query has following syntax • atoms (I.e. basic queries) that retrieve documents, and of • Boolean operators which work on their operands (sets of documents) • query syntax tree can be defined • leaves are basic queries • internal nodes are operators

  9. Boolean Queries (Cont.) • Retrieve all documents that contain the word ‘translation’ as well as either the word ‘syntax’ or the word ‘syntactic’ AND OR translation syntactic syntax

  10. Boolean Queries (Cont.) • No ranking of retrieved documents provided • document either satisfies query (retrieved) or does not (not retrieved) • does not allow partial matching between document and user query • to overcome this limitation, idea of ‘fuzzy Boolean’ set of operators proposed • instead of all the operands (AND) or at least in one of operands (OR), retrieve elements in some operands

  11. Natural Language • Distinction between AND and OR completely blurred • simply an enumeration of words and context queries • all documents matching portion of user query are retrieved • higher ranking assigned to documents matching more parts of query • eliminated any reference to Boolean operators

  12. Pattern Matching • Query formulation based on concept of pattern that allow retrieval of pieces of text that have some property • Pattern is set of syntactic features that occur in text segment • Segments satisfying pattern specification said to ‘match’ the pattern • We are interested in documents containing segments that match given search pattern

  13. Pattern Matching (Cont.) • Most used types of pattern are • words • string (sequence of characters) that is a word in text • prefixes • string that form beginning of text word • prefix ‘comput’ retrieve documents with words such as ‘computer’, ‘computation’ • suffixes • string that form termination of word • suffix ‘ters’ retrieve documents with words such as ‘testers’, ‘computers’

  14. substrings • string which can appear within word • substring ‘tal’ retrieve documents with words such as ‘coastal’, ‘talk’, ‘metallic’ • ranges • pair of strings that match any word lying between them in lexicographical order • alphabets sorted to order string into lexicographical order (dictionary order) • range between words ‘held’ and ‘hold’ retrieve strings such as ‘hoax’, ‘hissing’

  15. allowing errors • word together with error threshold • retrieves all text words ‘similar’ to given word • pattern may have errors (typing, spelling) and documents with words with erroneous variants are retrieved (with edit distance) • if typing error splits ‘flower’ into ‘flo wer’, still found with one error • regular expression (r.e.) • r.e. is built up by simple strings and operators like union, concatenation and repetition • query like ‘pro (plem | tein) (s | ) (0 | 1 | 2)*’ will match words like ‘problem02’, ‘proteins’

  16. Extended patterns • subset of regular expressions expressed with simpler syntax • classes of characters, I.e. some position in pattern matched by any character from pre-defined set (e.g. some characters must be digit, not a letter, vowel, etc.) • conditional expressions, I.e. part of pattern may or may not appear • wild characters, I.e. match any sequence in text (e.g. any word starts as ‘flo’ and ends with ‘ers’ which match ‘flowers’ as well as ‘flounders’

  17. Structural Queries • Allowing user to query texts based on structure, and not content • mixing contents and structure in queries can pose powerful queries (much more expressive) • An example • select set of documents that satisfy certain constraints on content (using word, phrase, or patterns) and then • structural constraints expressed using containment, proximity, or chapters, sections present in documents

  18. Types of structures • fixed structure • hypertext • hierarchical structure

  19. Fixed Structure • Document has fixed set of fields • each field has some text inside • some fields not present in all documents • fields not allowed to nest or overlap • retrieval activity restricted to specifying that given pattern was to be found only in given fields • this model reasonable when text collection has fixed structure

  20. Hypertext • Retrieval from hypertext began as navigational activity • user manually traverse hypertext nodes following links to search what he wanted • not possible to query hypertext based on its structure • WebGlimpse - interesting proposal to allow navigation plus ability to search by content in neighborhood of current node

  21. Hierarchical Structure • Represent recursive decomposition of text • natural model for many text collections • Figure 4.3 shows example of hierarchical structure that consists of page of a book, its schematic view and parsed query to retrieve figure

  22. Hierarchical ModelsPAT Expressions • Structure marked in the text by tags (as in HTML) • defined in terms of initial and final tags • each pair of initial and final tags defines a region, set of contiguous text areas • area of region cannot nest or overlap • possible to select areas containing other areas, contained in other areas, or followed by other areas

  23. Overlapped Lists • Allows area of regions to overlap, but not to nest • considers use of inverted list where words and also regions are indexed • allows to perform set union, and to combine regions

  24. List of References • Attempt to make definition and querying of structured text uniform, using common language • the language allows for querying on ‘path expressions’, which describe paths in structure tree • answers to queries are list of ‘references’ • reference is pointer to region of database

  25. Proximal Nodes • Tries to find good compromise between expressiveness and efficiency • specifies fully compositional language where leaves of query syntax tree formed by basic queries on contents or names of structural elements (e.g. all chapters) • internal nodes combine results • for efficiency, operations at internal nodes must relate nodes close in text

  26. Tree Matching • Relies on single primitive: tree inclusion • interpret structure of text database and query as trees • determine embedding of query into database which respects hierarchical relationships between nodes of query • simple queries return roots of the matches

More Related