ISP433/633 Week 3

ISP433/633 Week 3 Query Structure and Query Operations

Outline • More on weight assignment • Exercise (as a review) • BREAK • Query structure • Query operations • Boolean query parse • Vector query reformulation

Vector Space Model Example • D1 = “computer information retrieval” • D2 = “computer, computer information” • Q1 = “information, information retrieval” vectors

More on Weight Assignment • Boolean Model: binary weight • Term Frequency as weight(freq) • Raw frequency of a term inside a document • Problem with using raw freq • Zipf’s law • Non-distinguishing terms have high frequency • Document length matters

Zipf’s law Rank x Frequency  Constant

Zipf’s law Linear scale Log scale

Inverse Document Frequency (idf) • Inverse of the proportion of documents that have the term among all the documents in the collection • deal with Zipf’s law • idfi =log(N/ni) • N: total number of documents in the collection • ni: the number of documents that have term i

Benefit of idf • idf provides high values for rare words and low values for common words

Normalized Term Frequency (tf) • Deal with document length • The most frequent term m in document j • freqm, j • Term i’s normalized term frequency in document j: • tfi, j = freqi, j / freqm, j • For query: • tfi, q= .5 + .5 * freqi, q / freqm, q

tf*idf as weight • Assign a tf * idf weight to each term in each document

Exercise • D1 = “computer information retrieval” • D2 = “computer retrieval” • Q = “information, retrieval” • Compute the tf*idf weight for each term in D2 and Q BREAK

Queries • Single-word queries • Context queries • Phrases • Proximity • Boolean queries • Natural Language queries

Patten Match • Words • Prefixes • Suffixes • Substrings • Ranges • Regular expressions • Structured queries (e.g., XQuery to query XML, Z39.50)

Boolean Query Processing • The query must be parsed to determine what the: • Search Words • Optional field or index qualifications • Boolean Operators • Are and how they relate to one-another • Typical parsing uses lexical analysers (like lex or flex) along with parser generators like YACC, BISON or Llgen • These produce code to be compiled into programs.

Z39.50 Query Structure(ASN-1 Notation) -- Query Definitions Query ::= CHOICE{ type-0 [0] ANY, type-1 [1] IMPLICIT RPNQuery, type-2 [2] OCTET STRING, type-100 [100] OCTET STRING, type-101 [101] IMPLICIT RPNQuery, type-102 [102] OCTET STRING}

Z39.50 RPN Query (ASN-1 Notation) -- Definitions for RPN query RPNQuery ::= SEQUENCE{ attributeSet AttributeSetId, rpn RPNStructure}

RPN Structure RPNStructure ::= CHOICE{ op [0] Operand, rpnRpnOp [1] IMPLICIT SEQUENCE{ rpn1 RPNStructure, rpn2 RPNStructure, op Operator } }

Operand Operand ::= CHOICE{ attrTerm AttributesPlusTerm, resultSet ResultSetId, -- If version 2 is in force: -- - If query type is 1, one of the above two must be chosen; -- - resultAttr (below) may be used only if query type is 101. resultAttr ResultSetPlusAttributes}

Operator Operator ::= [46] CHOICE{ and [0] IMPLICIT NULL, or [1] IMPLICIT NULL, and-not [2] IMPLICIT NULL, -- If version 2 is in force: -- - For query type 1, one of the above three must be chosen; -- - prox (below) may be used only if query type is 101. prox [3] IMPLICIT ProximityOperator}

Parse Result (Query Tree) • Z39.50 queries… Title XXX and Subject YYY Oper: AND right left Operand: Index = Subject Value = YYY Operand: Index = Title Value = XXX

Parse Results • Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Op: AND Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ

Relevance feedback • Popular query reformulation strategy • Used for • Query expansion • Term re-weighting • Type • Manual • Automatic • Scope • Local • Global

Vector Model • Dr: set of relevant documents identified by user • Dn: set of non-relevant documents identified • Vecq: vector of original query • Vecq’: vector of expanded query • A common strategy of query reformulation is: • Vecq’ = Vecq + (sum of VecDr) – (sum of VecDn)

Example • Q = “safety minivans” • D1 = “car safety minivans tests injury statistics” - relevant • D2 = “liability tests safety” - relevant • D3 = “car passengers injury reviews” - non-relevant • What should be the reformulated Q’?

ISP433/633 Week 3

ISP433/633 Week 3

Presentation Transcript

3 Week Diet System Pdf 3-week Diet Plan For Men

Week 1

HUM 150 UOP Tutorial/ Uoptutorial

RDG 415 UOP Tutorial Course/ Uoptutorial

ACC 205 OUTLET Help Education Expert/ acc205outlethelpdotcom

MGT 521 NERD Real Education/mgt521nerd.com

MGT 521 AID Quest For Excellence/ mgt521aiddotcom

mgt521tutor Knowledge is power/mgt521tutordotcom

MGT 521 aid Focus Dreams/mgt521aid.com

MGT 521 AID Minds Online/mgt521aid.com

MGT 521 AID Becoming A Professional Educator/mgt521aid.com

MGT 521 AID Educational Technology/mgt521aid.com

MGT 521 AID Predictable World /mgt521aid.com