1 / 36

Hybrid Keyword Search across Peer-to-Peer Federated Data

This PhD dissertation defense from Florida State University discusses the problem statements of current keyword search approaches and introduces a hybrid search mechanism combining semantic metadata and keyword search. The dissertation explores the challenges and benefits of this approach and presents a scalable architecture for hybrid search on distributed databases.

shermanp
Download Presentation

Hybrid Keyword Search across Peer-to-Peer Federated Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hybrid Keyword Search across Peer-to-Peer Federated Data PhD Dissertation Defense Florida State University Jungkee (Jake) Kim

  2. Motivation Where is the Information? Internet

  3. Outline • Two Typical Search Paradigms • Problem Statements of Current Approaches • Hybrid Keyword Search • Hybrid Search on Distributed Databases • Hybrid Search across Peer-to-Peer Federated Databases

  4. Searching over structured data Relational Databases Searching over unstructured data Information Retrieval Two Typical Search Paradigms Internet Environment • Semistructured Data – XML • Keyword Search in DB • Web Search Engines – Technologies from Information Retrieval Hybrid Keyword Search ?

  5. Current Approaches – Keyword-only Search • Web Search Engines • Web crawlers visit Web pages and collect the keyword based text indexes. • Fast information retrieval • Keyword Search in databases • Web integration on legacy DBMS • Dynamic Web publication through embedded DB • Easy to use without knowledge of DB schema

  6. Problems of Current Approaches – Keyword-based • Web Search Engines • Can not collect every connected resource • Query results are often unrelated • Keyword Search in Databases • Losing the inherent meaning of the schema • Query results are not based on semantic schema

  7. Current Approaches – Semantic • Semantic Web • Multiple relation links with directed labeled graphs and machines can understand the relationship between different resources • Describes metadata about resources • To represent the relations of the objects on the Web; the object terms defined under a specific description – an Ontology

  8. Problems of Current Approaches – Semantic Web • Ontology design is sophisticated • Lack of unified definition * • Limited adoption

  9. Our Approach • Hybrid search mechanisms – Semantic metadata + Keyword search • Semantic Solution Semantic Web might be better than Hybrid search Hybrid search must be better than Web search engines • Simplicity Hybrid search is simpler than Semantic Web

  10. Hybrid Keyword Search Service • A search service fetches target information data against a search query. • Unstructured data A file containing data – MS Word, PDF, PS documents • Metadata: Structured or semistructured data – XML • We utilized an XML-enabled relational DBMS and a native XML DB along with a text search library (Apache Xindice + Jakarta Lucene) to address the search against metadata and text.

  11. How to Combine? (1) • Two entities and a relationship in relational DBMS • We can obtain the hybrid search result using a nested subquery

  12. How to Combine? (2) • A hash table is used for joining search results in non-DBMS based system (Apache Xindice + Lucene)

  13. Local Query Processing – XML (1) Average XML Query Time • XML-enabled RDB • DBLP XML record (1,000 – 10,000) • Non indexed matches except year match bound by the number of matches. • Combined query time depends on # of year query results

  14. Apache Xindice DBLP XML record (1,000 – 10,000) Indexed approximate matches for text elements in XML instances as bad as non-indexed queries Exact matches bound by the number of matches. Local Query Processing – XML (2) Average XML Query Time

  15. Local Query Processing – Hybrid (1) • Hybrid search query performance measurement • XML-enabled RDB • For 100,000 XML instances and 100,000 text documents • Small result set: 4 XML and a keyword matches • Large result set: 7,752 XML and 41,889 documents

  16. Local Query Processing – Hybrid (2) • Hybrid search query performance measurement • Apache Xindice + Jakarta Lucene • For 10,000 XML instances and 10,000 text documents • Small result set: 2 XML and a keyword matches • Large result set: 192 XML and 4,562 documents

  17. Discussion – Local Hybrid Search • XML-enabled RDB provides proper response except some extreme query loads. • A native XML DB (Apache Xindice) had very limited scalability. (No accurate query result over 16,000 XML instances) • We will generalize hybrid search to a distributed environment.

  18. Hybrid Search on Distributed Databases • Data Independence: logically and physically independent; the same schema – no change, data encapsulation in each machine • Network Transparency: depends on MOM or P2P framework • No replication – restricted to a computer cluster • Fragment: full partition; horizontal fragmentation • The query result for the distributed databases is the collection of query results from individual database queries.

  19. Scalable Hybrid Search Architecture on DDBS Search Service Search Service Search Service Subscriber for a query topic Publisher for a temporary topic Query Message Result Message Message Broker Query Message Result Message Client Publisher for a query topic Client Subscriber for a temporary topic Client

  20. Cooperating Broker Network • Distributed Databases based on NaradaBrokering Network

  21. Query Processing – DDBS (1) • 100,000 XML and 100,000 Documents in 8 machines – 12,500 each • Few keyword match (1-3) on 1 machine only • RDB – 0.04 Sec. for few keyword match Avg. response time for an author exact match query over 8 search services

  22. Query Processing – DDBS (2) • 100,000 XML and 100,000 Documents in 8 machines – 12,500 each • RDB – half hour or 6.96 Sec. (Hash table) Avg. response time for a year match query over 8 search services

  23. Data Integration Hub • Partial integration – possible method to increase the data portion queried c.f. Supernode in P2P • We designed a partial integration architecture through a message-oriented middleware – the NaradaBrokering system • NaradaBrokering system • JMS compliant topic-based communication • Scalability by brokers hierarchical connection • Passive queries / Static binding • We attached a RDBMS to store the metadata and index the contents of the data

  24. Architecture of Data Integration Hub

  25. Coupling vs. Scalability • From ICDE 2002 Tutorial

  26. Query Propagate and Results back on a P2P Network

  27. Peer group architecture of the P2P Search

  28. Performance Test for Peer Group Communication (JXTA) Group Propagation Group Propagation Subnet A Subnet B Subnet C ….. Client Peer Rendezvous Peer Search Service Peers Point-to-point Pipe Connection

  29. Performance for Group Peer Communication – 1 Peer per Node Average Response Time for a Query

  30. Performance for Group Peer Communication – Multiple Peers per Node Allowed (1) Average Response Time for a Query with Multiple Peers per Node Allowed

  31. Performance for Group Peer Communication – Multiple Peers per Node Allowed (2) Message Response Time for 32 Group Peers

  32. Related Works (1) • Distributed lookup in routing to reduce the unnecessary communications • Distributed Hash Table (DHT) – Chord, CAN, Pastry, and Tapestry • JXTA: DHT + multiple random walks • Look up peers based on reputation • Hristidis et. al. – Exploiting a context on existing RDBMS with reducing the schema loss of Keyword Search in DB

  33. Related Works (2)

  34. Conclusion • We addressed the semantic loss of keyword-only search while remaining a simpler solution than the Semantic Web • Low cost scalability over heterogeneous resource through customized overlay networks • A practical bridging role on the road towards the ideal of information represented by Semantic Web?

  35. Contributions • Demonstration of a hybrid search – combining metadata search with a keyword search over unstructured context data • A way to increase locality and integrate several dispersed resources through a data integration hub • Extension of the scalability of a native XML database and performance improvement for some queries compared to those on a single machine • Generalization of our hybrid search architecture on potentially more scalable P2P overlay network

  36. Publications • J. Kim and G. Fox. Scalable Hybrid Search on Distributed Databases. Accepted for presentation in 3rd International Workshop on Autonomic Distributed Data and Storage Systems Management (ADSM) in conjunction with ICCS, To appear in Lecture Notes in Computer Science. May, 2005. • J. Kim and G. Fox. A Hybrid Keyword Search across Peer-to-Peer Federated Databases. In Proceedings of 8th East-European Conference on Advances in Databases and Information Systems (ADBIS), September, 2004. • J. Kim, O. Balsoy, M. Pierce, and G. Fox. Design of a Hybrid Search in the Online Knowledge Center. In Proceedings of IASTED International Conference on Information and Knowledge Sharing, November, 2002. • G. Aydin, H. Altay, M. S. Aktas, M. N. Aysan, G. Fox, C. Ikibas, J. Kim, A. Kaplan, A. E. Topcu, M. Pierce, B. Yildiz, and O. Balsoy. Online Knowledge Center Tools for Metadata Management. Technical report, DoD HPCMP Users Group Meeting, June, 2003. • O. Balsoy, M. S. Aktas, G. Aydin, M. N. Aysan, C. Ikibas, A. Kaplan, J. Kim, M. Pierce, A. Topcu, B. Yildiz, and G. Fox. The Online Knowledge Center: Building a Component Based Portal. In Proceedings of the International Conference on Information and Knowledge Engineering, June, 2002. • G. Fox, S. Ko, M. Pierce, O. Balsoy, J. Kim, S. Lee, K. Kim, S. Oh, X. Rao, M. Varank, H. Bulut, G. Gunduz, X. Qiu, S. Pallickara, A. Uyar, and C. Youn. Grid services for earthquake science. Concurrency and Computation: Practice and Experience, 14:371---393, May---June 2002.

More Related