1 / 24

ISP433/633 Week 3

ISP433/633 Week 3. Query Structure and Query Operations. Outline. More on weight assignment Exercise (as a review) BREAK Query structure Query operations Boolean query parse Vector query reformulation. Vector Space Model Example. D 1 = “computer information retrieval”

Download Presentation

ISP433/633 Week 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ISP433/633 Week 3 Query Structure and Query Operations

  2. Outline • More on weight assignment • Exercise (as a review) • BREAK • Query structure • Query operations • Boolean query parse • Vector query reformulation

  3. Vector Space Model Example • D1 = “computer information retrieval” • D2 = “computer, computer information” • Q1 = “information, information retrieval” vectors

  4. More on Weight Assignment • Boolean Model: binary weight • Term Frequency as weight(freq) • Raw frequency of a term inside a document • Problem with using raw freq • Zipf’s law • Non-distinguishing terms have high frequency • Document length matters

  5. Zipf’s law Rank x Frequency  Constant

  6. Zipf’s law Linear scale Log scale

  7. Inverse Document Frequency (idf) • Inverse of the proportion of documents that have the term among all the documents in the collection • deal with Zipf’s law • idfi =log(N/ni) • N: total number of documents in the collection • ni: the number of documents that have term i

  8. Benefit of idf • idf provides high values for rare words and low values for common words

  9. Normalized Term Frequency (tf) • Deal with document length • The most frequent term m in document j • freqm, j • Term i’s normalized term frequency in document j: • tfi, j = freqi, j / freqm, j • For query: • tfi, q= .5 + .5 * freqi, q / freqm, q

  10. tf*idf as weight • Assign a tf * idf weight to each term in each document

  11. Exercise • D1 = “computer information retrieval” • D2 = “computer retrieval” • Q = “information, retrieval” • Compute the tf*idf weight for each term in D2 and Q BREAK

  12. Queries • Single-word queries • Context queries • Phrases • Proximity • Boolean queries • Natural Language queries

  13. Patten Match • Words • Prefixes • Suffixes • Substrings • Ranges • Regular expressions • Structured queries (e.g., XQuery to query XML, Z39.50)

  14. Boolean Query Processing • The query must be parsed to determine what the: • Search Words • Optional field or index qualifications • Boolean Operators • Are and how they relate to one-another • Typical parsing uses lexical analysers (like lex or flex) along with parser generators like YACC, BISON or Llgen • These produce code to be compiled into programs.

  15. Z39.50 Query Structure(ASN-1 Notation) -- Query Definitions Query ::= CHOICE{ type-0 [0] ANY, type-1 [1] IMPLICIT RPNQuery, type-2 [2] OCTET STRING, type-100 [100] OCTET STRING, type-101 [101] IMPLICIT RPNQuery, type-102 [102] OCTET STRING}

  16. Z39.50 RPN Query (ASN-1 Notation) -- Definitions for RPN query RPNQuery ::= SEQUENCE{ attributeSet AttributeSetId, rpn RPNStructure}

  17. RPN Structure RPNStructure ::= CHOICE{ op [0] Operand, rpnRpnOp [1] IMPLICIT SEQUENCE{ rpn1 RPNStructure, rpn2 RPNStructure, op Operator } }

  18. Operand Operand ::= CHOICE{ attrTerm AttributesPlusTerm, resultSet ResultSetId, -- If version 2 is in force: -- - If query type is 1, one of the above two must be chosen; -- - resultAttr (below) may be used only if query type is 101. resultAttr ResultSetPlusAttributes}

  19. Operator Operator ::= [46] CHOICE{ and [0] IMPLICIT NULL, or [1] IMPLICIT NULL, and-not [2] IMPLICIT NULL, -- If version 2 is in force: -- - For query type 1, one of the above three must be chosen; -- - prox (below) may be used only if query type is 101. prox [3] IMPLICIT ProximityOperator}

  20. Parse Result (Query Tree) • Z39.50 queries… Title XXX and Subject YYY Oper: AND right left Operand: Index = Subject Value = YYY Operand: Index = Title Value = XXX

  21. Parse Results • Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Op: AND Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ

  22. Relevance feedback • Popular query reformulation strategy • Used for • Query expansion • Term re-weighting • Type • Manual • Automatic • Scope • Local • Global

  23. Vector Model • Dr: set of relevant documents identified by user • Dn: set of non-relevant documents identified • Vecq: vector of original query • Vecq’: vector of expanded query • A common strategy of query reformulation is: • Vecq’ = Vecq + (sum of VecDr) – (sum of VecDn)

  24. Example • Q = “safety minivans” • D1 = “car safety minivans tests injury statistics” - relevant • D2 = “liability tests safety” - relevant • D3 = “car passengers injury reviews” - non-relevant • What should be the reformulated Q’?

More Related