210 likes | 391 Views
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT. MAYURI UMRANIKAR. CONTENTS . Introduction Retrieval Environment - The Vector Space Model - INEX Environment - Flexible Retrieval System Method Used for Retrieval - Document Tree – Construction
E N D
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR
CONTENTS Introduction Retrieval Environment - The Vector Space Model - INEX Environment - Flexible Retrieval System Method Used for Retrieval - Document Tree – Construction - Ranking of Elements - Output Experiments Conclusions
INTRODUCTION • Extensible Markup Language (XML) preferred for representing documents and due to increase of documents, issue of element retrieval arises • Focus on retrieval of relevant elements rather than entire document • INEX – INitiative for Evaluation of XML Retrieval • Flexible Mechanisms • Different Approaches • Term Weighting
RETRIEVAL ENVIRONMENT • 2 Factors – Issues when focus moves from documents to components and Salton’s Vector Space Model • Vector Space Model – Weight number of times a term occurs in the document • Fox’s Extended Vector Space Model – Incorporation of objective identifiers • Document vector consists of subvectors • Contain text independently indexed, weighted, searched and retrieved • Term Weighting – weighting within subjective vectors • Smart Experimental Retrieval System
INEX ENVIRONMENT • Content Only (CO) –ignore document structure, like typical queries, specify only content of search • Content and Structure (CAS) – explicitly refer to structure, exhaustive and specific • CO query directly to user, CAS additional filtering and search of body portion • CAS returns rank ordered list of elements • INEX-EVAL – uses measures of recall and precision ( fig, exhaustivity, specificity mapped to a single relevance) results are ranked
FLEXIBLE RETRIEVAL SYSTEM • Smart Format – documents and topics translated, indexed as extended vectors • Subjective vectors – contain content bearing terms • Objective vectors – serve as filters on result returned by CAS queries • Extended vector – subjective vector, terms having a paragraph in body subvector • Lnu-ltu weighting • Dynamic flexible retrieval- tree representation, rank ordered list by lnu weights
METHOD FOR FLEXIBLE RETRIEVAL • Input – Query Q given and paragraph, retrieve rank ordered list, terminal modes • N top ranked paragraphs as input selected • Set of paragraphs used to identify documents – elements generated and returned as output • Document Tree – Needs information of structure Terminal nodes Pre-order traversal Terminal nodes found in paragraph index
CONSTRUCTION OF DOCUMENT TREE • For query Q, n top ranked paras used to build trees • Leaf elements or terminal nodes - paragraph nodes • Each leaf represented by term-freq weighted frequency vector • 1st – gather all leaf nodes, terminal nodes done • 2nd – merge children vectors for parents • Document schema determine merging • Parent – unique terms of children, term –freq weighted parent vector( has content of children) • Process in recursive manner done
RANKING OF ELEMENTS • Set of elements of document tree generated • Problem- structured retrieval; rank ordered list of elements • Method used – All-element index( separate representation for each element of each document and weighting information) • Lnu weights - elements variable length, do not require global frequency • Normalization and length – failing results in biased values • Pivot – document length probability of relevance= probability of retrieval • Slope- amount of tilting • Pivoted Normalization – reduces difference • Lnu term weights: ((1+log(term_freq))/ (1+log(avg_term_freq)))/((1-slope)+slope*((no_unique_terms)/pivot)
Ltu weighting – N collection size, nk no of elements ((1+log(term_freq))/log(N/nk))/ ((1-slope)+slope*(no_unique_terms)/pivot)) • N,nk element dependent, should be known through indexing • We move up; N – count elements of each type • Nk – inverted file entry in paragraph index, mapping identifiers and xpaths (given)
OUTPUT OF FLEXIBLE RETRIEVAL • Select another leaf node, gather siblings, construct document tree, calculate Lnu term weights, Ltu weighted query; produce another rank ordered list • After n top ranked exhausted, last list produced, merge lists • Single set of elements rank ordered – correlation Q • Comparison – flexible retrieval & all-element index identical – set of n paragraphs i/p to flexible retrieval have all paragraphs same values used for Lnu-ltu
EXPERIMENTS • Paragraph – result; set of extended vectors representing paragraph • CO – subvector represents subjective portion, body subvector important (content of element and not type) contained in body • Tree Representation
FACTORS OF INTEREST • Slope, pivot for Lnu-ltu • Effective structure retrieval • Can be determined – empirically, applied from one collection to other; Generic • N- no of paragraphs input, sets upper bound on number per query • Actual trees depend on number of paragraphs having same group or same document
EXPERIMENTS DONE • All-element and dynamic/flexible retrieval experiments and results - body-only retrieval • Correlation between element and query vector produced – correlation of body elements only Table 1
RESULTS • Tables
Result equivalent • Flexible more efficient – file space Time required for indexing is half • Dynamic- Per query basis cost more – n; total trees not exact required specified • Another factor – value of nk
DISCUSSIONS AND CONCLUSIONS • Flexible retrieval dynamically, rank ordered list of elements, single indexing at level - basic indexing node (paragraph) • Basic functions- SMART; extended vector model • Results – flexible capabilities • Attempt to incorporate other subvectors, internal node, weight • INEX – exhaustivity and specificity; results exhaustive; specificity research going on; results are reflection • It is the better way of retrieval than all-indexing