Hierarchical Indexing and Flexible Element Retrieval for Structured Document

Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang CuiSchool of Computing, NUSJi-Rong WenMicrosoft Research AsiaTat-Seng ChuaSchool of Computing, NUS Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Outline • Motivations and problems • Hierarchical index propagation and pruning • Flexible element retrieval • Evaluation • Conclusions Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Motivations • More structured and semi-structured documents on the Web. • Users want to explore more of the document structure. • Access only relevant parts of a document, i.e. sections or paragraphs • IR can’t help • Document as the smallest resulting unit. • Not Question Answering! • Can’t provide views of the internal document structure. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Encarta Articles – An Example • Online encyclopedia. • Well structured XML documents. • Nodes (elements) – documents, sections and paragraphs (leaf nodes) • Text contained in paragraphs, which constitute sections and documents. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Problems • A document covers multiple aspects of a central topic • Represented by sections or paragraphs. • Users usually want just one of the aspects. • How to achieve this goal by utilizing the document structure? • Flexible element retrieval to get elements at arbitrary level rather than only leaf nodes. • Let each element at different levels have proper keywords description. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Our contributions • Building index with the same hierarchical structure as the document has. • Not just index the leaf nodes. • Keywords propagation mechanism. • Assign proper keywords to each level’s nodes (push broad-sense keywords to upper-level nodes). • Why can’t use weight propagation technique? • Considering terms’ distributions. • Flexible element retrieval according to queries. • With the hierarchical index, the system can access arbitrary-level elements – documents, sections or paragraphs w.r.t queries. • Avoid assembling separate text fragments with leaf nodes retrieval only. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Hierarchical Indexing for Structured Documents • Term weighting for the leaf nodes and the intermediate elements. • Combining the statistics of the term occurrences and the distributions. • Term selection threshold. • Propagation and pruning of the index terms Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Term Weighting for Paragraphs • Paragraphs are “atomic” without children elements. • Consider the term occurrences only – TFIDF measure. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Term Weighting for Intermediate elements • Document-level or section-level elements. • Taking into account the term distributions in the immediately descendant elements. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Measuring Term Distributions • Entropy-like measurement • How even a term is distributed in all the immediate-descendant elements of an intermediate element. • Normalization factor – the theoretic maximum entropy. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Term Selection • Term weights are normalized to the range of 0 and 1 for the purpose of comparison. • Compare the terms within one element. • Select those terms with the weights beyond a threshold as the index terms for this element. • Repeat this process from bottom up. • Broader-sense terms can be propagated to upper level elements. • Term pruning to avoid duplications of index terms. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Terms Propagation and Pruning Algorithm • For each leaf element, i.e. paragraph, calculate all terms’ weights for paragraphs. • For each composite element Ej at the next upper level, calculate the terms’ weights by measuring these terms’ occurrences in this element and the distributions in the immediate-descendant elements of Ej. • For term ti, if Weight(ti, Ej)>= average(Ej)+std_dev(Ej) , then this term is selected as an index term of the element Ej and all the descendent elements of Ej would eliminate ti from their index term lists. This process is called the index term propagation and pruning. • Recursively perform step 2 onwards until the root node (i.e., the document) is reached. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy An illustration of the process Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Flexible Element Retrieval • No term duplications along one path. • The path of an element • including all the elements from this node to the root. • Ranking relevant elements is equal to rank their paths. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Path Ranking Algorithm 1. Find all elements that contain at least one query term. 2. Get paths for all candidate elements and merge the paths, that is, merge two paths into one if one is a part of the other. 3. Assign the weights of the query terms for the elements to their paths respectively. 4. Rank these paths using the equation on the previous slide. 5. Return the elements corresponding to the ranked paths with the ranks satisfying the pre-defined threshold in a descending order. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Result Browsing • The prototype interface can • Highlight the relevant parts of the selected document. • Allow the user to browse results in the original document structure. • Query example – “the Manchu Qing Dynasty” • A section in “China” • The whole document for “Qing Dynasty” Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Prototype Interface Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Evaluation • Data Set • 41,942 XML documents in various topics from Encarta online encyclopedia. • Ten experimental queries • Can be answered by only parts of the relevant document, e.g. “Fleet Street in London” answered by a paragraph of the document London. • Relevance judgment made by human assessors – for each query, there is a group of paragraphs representing relevant sections or such paragraphs. • Baseline system (TFIDF Para) • Indexing paragraph nodes only. • Applying TFIDF measure to weight terms and using cosine similarity to retrieve answers. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Performance Evaluation • Use precision, recall and F-value as performance metrics. • Two sets of hierarchical index • Utilizing titles and without considering titles. • Answer selection threshold • Fixed numbers 0.1 – 0.9, used by most of existing systems. • Dynamic thresholds – Avg and Avg + Std_Dev • Compared our system with TFIDF Para using different answer selection thresholds. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Results of Performance Comparison • Figures are impressive • Improvements on precision are 48.83% (w/ titles) and 41.67% (w/o titles) in average. • For F-Values, improvements are 56.02% (w/ titles) and 40.89% (w/o titles). • Recall slightly decreases with some threshold settings (too rigorous threshold for index term selection). • User feedback • Our system can find more meaningful units instead of separate paragraphs, including some paragraphs not actually containing query terms. • Users are clear of their context when browsing the answers within the original document structure. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Threshold Setting • Our system is less sensitive to the answer selection threshold settings. • Dynamic threshold is a good alternative for such structured document retrieval. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Conclusions • A novel hierarchical index propagation and pruning mechanism to generate structured index. • Flexible element retrieval of getting arbitrary-level relevant elements is realized on the hierarchical index. • It can better satisfy users than previous passage retrieval systems. • More work can be done on generating hierarchical index for federate search. Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Presentation for ECIR’03, Pisa, Italy Thanks! Hang Cui, Ji-Rong Wen and Tat-Seng Chua

Hierarchical Indexing and Flexible Element Retrieval for Structured Document