Routing of Structured Queries in Large-Scale Distributed Systems

Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS_IR'08) @ ACM 17th CIKM 2008, Napa Valley, California, USA, Oct 2008. Judith Winter Institute for Informatics / Telematics GroupGoethe-University / Frankfurt am Main, Germany

Routing of Structured Queries in Large-Scale Distributed Systems Overview 1. Introduction • Introduction • Concept & Architecture • Routing • Evaluation • Questions and Discussion

1.Introduction2. Concept 3. Routing 4. Evaluation Proposed research: • XML Information Retrieval in P2P systems • Investigate the impact of using structural information when retrieving XML-documents in a P2P network • Challenge: not all information accessable / scalability issues How to perform & improve query routing in a large-scale P2P System by using structural information?

1.Introduction2. Concept 3. Routing 4. Evaluation XML Information Retrieval in Peer-to-Peer Systems: • Challenges: • no central index • only selected information available • bandwith consumption / communication overhead • efficiency vs effectiveness • vague queries • relevance-ranking InformationRetrieval Peer-to-Peer XML-Retrieval • structured documents • more precise search • based on c/s architectures • distributed • autonomous peers • growing amount of XML-documents

Routing of Structured Queries in Large-Scale Distributed Systems • Introduction • Concept & Architecture • Routing • Evaluation • Questions and Discussion 2. Concept & Architecture

1.Introduction 2. Concept 3. Routing 4. Evaluation Concept for a P2P-search engine: • Queries: content-and-structure (CAS) • Indexing: include structure • Hybrid indexing: globally or locally (distributing summaries) depending on peer status  index with posting lists (doc level) & peer lists (peer-level) • Distributing global information into DHT • Ranking: extended vector space model (using structure) • Results/Retrieval units: document or element retrieval

1.Introduction 2. Concept 3. Routing 4. Evaluation Concept for a P2P-search engine: • Routing: • Use peer lists and posting lists • Use of pre-computed posting lists for popular term combinations  highly discriminative keys (HDKs) • Use of pruned posting lists by considering structural information • Ordering of posting lists by a query-independent score (evidence from document-, element-, collection, and peer level) • Select top k results according to pre-ranking regarding structural similarity between CAS query and posting key

1.Introduction 2. Concept 3. Routing 4. Evaluation Frequent XTerm index HDK index P2P network DL Local documents APPLICATION GUI Indexing Querying & result presentation Querying Component INFORMATION RETRIEVAL Index storage component Inverted Index Statistics Index Document index Retrieval unitindex Similarity calculator Retrieval component Ranking component Routing component Weighting calculator Sourceselector PEER-TO-PEER P2P component SpirixDHT PeerMetricscalculator SimulationDHT Chord

Routing of Structured Queries in Large-Scale Distributed Systems • Introduction • Concept & Architecture • Routing • Evaluation • Questions and Discussion 3. Routing

P0 P7 P1 P6 P2 P5 P3 P4 1.Introduction 2. Concept 3. Routing 4. Evaluation • (dok2,12.4) • (dok2/chap, 11.2) • (dok1/sec,5.4) Example: q q = {apple, \book} • Peer P0 looks for books about apples • Id i0 = hash(apple, \book) = hash(apple)is calculated • Peer P5 assigned to i0 is located in log(n) hops • Query q is sent to P5 • P5 selects top k=2 postings for q;these relate to dok1 and dok2 • Id i1 = hash(dok1) and Id i1 = hash(dok1) are calculated, their peers located • q is sent to P2 and P6 assigned to i1 and i2 • P2 and P6 calculate relevance for dok1 and dok2 plus their RUs • P2 and P6 send back results to P0 Dok2=(1,4,0,0,3,…) Dok1=(0,1,5,1,3,…) Result = {(dok2,12.4), (dok2/chap, 11.2)} Result = {(dok1/sec,5.4)} q assigned to hash(apple) apple, \book  dok1(4.8), dok2(4.1), dok3(3.7)…apple, \novel  dok2(12.9) apple, \article\p\sec  ----

1.Introduction 2. Concept 3. Routing 4. Evaluation Routing process:

1.Introduction 2. Concept 3. Routing 4. Evaluation Weighting of postings (query independent at indexing): • Entries sorted by scoret(di); choose k best entries for XTerm t • Considers document di, best retrieval unit rubest, and peer pi • Weighting function w: BM25e-based • PeerScore: high for peers with good collections regarding t and with good performance metrics

1.Introduction 2. Concept 3. Routing 4. Evaluation Selection of Postings (query dependend reordering): Example: apple \book\chapter  dok1(12.8), dok2(12.4) \article\p  dok2(25.3), dok3(12.7), dok4(10.7) chips \book\c1\section  dok4(18.4), dok2(3.1), dok1(2.3), dok3(1.5) apple \book\chapter  dok1(12.8), dok2(12.4) \article\p  dok2(25.3), dok3(12.7), dok4(10.7) chips \book\c1\section  dok4(18.4), dok2(3.1), dok1(2.3), dok3(1.5) sim = 1 sim = 0 sim = 0.7 q = { (apple, \book\chapter), (chips, \section) } Final Posting list = {dok2(12.4*1+3.1*0.7=14.6), dok1(12.8*1+2.3*0.7=14.4), dok4(18.4*0.7=12.9), dok3(1.5*0.7=1.1) }

Routing of Structured Queries in Large-Scale Distributed Systems • Introduction • Concept & Architecture • Routing • Evaluation • Questions and Discussion 4. Evaluation

1.Introduction 2. Concept 3. Routing 4. Evaluation Implementation: • Implementation of SPIRIX: Search Engine for P2P Information Retrieval in XML-Documents • P2P-complex: • Based on OpenChord, • Collects peer characteristics, • Adapted to special requirements of XML IR • Preliminary evaluation with INEX-Collection

1.Introduction 2. Concept 3. Routing 4. Evaluation Evaluation: • Evaluation with INEX-Collection of 2007: • Wikipedia-collection: 660.000 documents (4.6 GB) • 80 CAS queries (out of 123 topics ) • run on 1 peer with simulationDHT (measurement of #postings) • retrieval of best 1500 results per query • PLmax set to indefinite ( all HDKs single XTerms) • different structural similarity functions • simple version of the proposed formulas (document-based) • Goal: show the effect of using structural hints for routing • efficiency (#postings: 100, 500, 2000 postings) • effectivness (precision at different recall levels)

1.Introduction 2. Concept 3. Routing 4. Evaluation

+7,2% +8,7% +5,5% 1.Introduction 2. Concept 3. Routing 4. Evaluation

1.Introduction 2. Concept 3. Routing 4. Evaluation

1.Introduction 2. Concept 3. Routing 4. Evaluation Conclusion: • Propose to take advantage of XML structure when routing in highly distributed environments such as P2P systems • Provide an infrastructure for investigation of proposed techniques to perform routing based on evidence from document-, element-, collection-, and peer-level • For 80 CAS topics of INEX2007, efficiency and effectivness could be improved • Future work to verify the observed improvement: • evaluate formulas in full version • runs with multimedia topics INEX 2007; INEX2008 • measure bandwidth consumption (incl. #messages, message sizes) • run on different peers; split collection

Routing of Structured Queries in Large-Scale Distributed Systems • Introduction • Concept & Architecture • Routing • Evaluation • Questions and Discussion ? 5. Questions and Discussion

Routing of Structured Queries in Large-Scale Distributed Systems