240 likes | 369 Views
ISP433/633 Week 3. Query Structure and Query Operations. Outline. More on weight assignment Exercise (as a review) BREAK Query structure Query operations Boolean query parse Vector query reformulation. Vector Space Model Example. D 1 = “computer information retrieval”
E N D
ISP433/633 Week 3 Query Structure and Query Operations
Outline • More on weight assignment • Exercise (as a review) • BREAK • Query structure • Query operations • Boolean query parse • Vector query reformulation
Vector Space Model Example • D1 = “computer information retrieval” • D2 = “computer, computer information” • Q1 = “information, information retrieval” vectors
More on Weight Assignment • Boolean Model: binary weight • Term Frequency as weight(freq) • Raw frequency of a term inside a document • Problem with using raw freq • Zipf’s law • Non-distinguishing terms have high frequency • Document length matters
Zipf’s law Rank x Frequency Constant
Zipf’s law Linear scale Log scale
Inverse Document Frequency (idf) • Inverse of the proportion of documents that have the term among all the documents in the collection • deal with Zipf’s law • idfi =log(N/ni) • N: total number of documents in the collection • ni: the number of documents that have term i
Benefit of idf • idf provides high values for rare words and low values for common words
Normalized Term Frequency (tf) • Deal with document length • The most frequent term m in document j • freqm, j • Term i’s normalized term frequency in document j: • tfi, j = freqi, j / freqm, j • For query: • tfi, q= .5 + .5 * freqi, q / freqm, q
tf*idf as weight • Assign a tf * idf weight to each term in each document
Exercise • D1 = “computer information retrieval” • D2 = “computer retrieval” • Q = “information, retrieval” • Compute the tf*idf weight for each term in D2 and Q BREAK
Queries • Single-word queries • Context queries • Phrases • Proximity • Boolean queries • Natural Language queries
Patten Match • Words • Prefixes • Suffixes • Substrings • Ranges • Regular expressions • Structured queries (e.g., XQuery to query XML, Z39.50)
Boolean Query Processing • The query must be parsed to determine what the: • Search Words • Optional field or index qualifications • Boolean Operators • Are and how they relate to one-another • Typical parsing uses lexical analysers (like lex or flex) along with parser generators like YACC, BISON or Llgen • These produce code to be compiled into programs.
Z39.50 Query Structure(ASN-1 Notation) -- Query Definitions Query ::= CHOICE{ type-0 [0] ANY, type-1 [1] IMPLICIT RPNQuery, type-2 [2] OCTET STRING, type-100 [100] OCTET STRING, type-101 [101] IMPLICIT RPNQuery, type-102 [102] OCTET STRING}
Z39.50 RPN Query (ASN-1 Notation) -- Definitions for RPN query RPNQuery ::= SEQUENCE{ attributeSet AttributeSetId, rpn RPNStructure}
RPN Structure RPNStructure ::= CHOICE{ op [0] Operand, rpnRpnOp [1] IMPLICIT SEQUENCE{ rpn1 RPNStructure, rpn2 RPNStructure, op Operator } }
Operand Operand ::= CHOICE{ attrTerm AttributesPlusTerm, resultSet ResultSetId, -- If version 2 is in force: -- - If query type is 1, one of the above two must be chosen; -- - resultAttr (below) may be used only if query type is 101. resultAttr ResultSetPlusAttributes}
Operator Operator ::= [46] CHOICE{ and [0] IMPLICIT NULL, or [1] IMPLICIT NULL, and-not [2] IMPLICIT NULL, -- If version 2 is in force: -- - For query type 1, one of the above three must be chosen; -- - prox (below) may be used only if query type is 101. prox [3] IMPLICIT ProximityOperator}
Parse Result (Query Tree) • Z39.50 queries… Title XXX and Subject YYY Oper: AND right left Operand: Index = Subject Value = YYY Operand: Index = Title Value = XXX
Parse Results • Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Op: AND Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ
Relevance feedback • Popular query reformulation strategy • Used for • Query expansion • Term re-weighting • Type • Manual • Automatic • Scope • Local • Global
Vector Model • Dr: set of relevant documents identified by user • Dn: set of non-relevant documents identified • Vecq: vector of original query • Vecq’: vector of expanded query • A common strategy of query reformulation is: • Vecq’ = Vecq + (sum of VecDr) – (sum of VecDn)
Example • Q = “safety minivans” • D1 = “car safety minivans tests injury statistics” - relevant • D2 = “liability tests safety” - relevant • D3 = “car passengers injury reviews” - non-relevant • What should be the reformulated Q’?