440 likes | 539 Views
Information Retrieval on P2P Networking. Willie Yang November 2004. 1. Information Retrieval. What is Information Retrieval ?. Select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user【Salton, 1989】
E N D
Information Retrieval on P2P Networking Willie Yang November 2004
What is Information Retrieval ? • Select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user【Salton, 1989】 • Basic model【Belkin, 1992】
1. Format, Source, Type 3. Query Expression 2. Indexing 4. Query Model 5. Ranking 6. Feedback Research Issues
More about Information Retrieval Concepts related to searching - Browsing - Filtering Technologies related to IR - Information extraction - Question answering - Classification
What is Peer-to-peer Networking ? • Peer-to-peer is a way of structuring distributed applications such that the individual nodes have symmetric roles. Rather than being divided into clients and servers each with quite distinct roles, in P2P applications a node may act as both a client and a server.【IETF/IRTF 2004】
Characteristics of P2P (1) • Multiple peers participating in the network • The number of roles is small. • The number of peers is typically large. • Every peer owns some resources and pays its participation by providing access to its resources. • Distributed, decentralized, no distinguished roles • Autonomous, self-control, ad hoc participation. • Dynamic (e.g. come and go freely) • Rely very little on the underlay infrastructure. →do most things on their own.
Characteristics of P2P (2) • Difference from distributed computing • More dynamic (fail or not v.s. join or leave) • Larger number • Difference from distributed database, or grid computing. • No centralized mechanism (i.e.integrator or dispatcher, etc.) • Research highlights • Resource sharing • Autonomous • Load balancing
X Where is X? Search on unstructured P2P • Example: Gnutella • Solution: Broadcasting + TTL • Constraints: non-guarantee search • Research topics • - Exploring strategies • - Linking strategies • - Routing strategies
8 1 7 Node joining : assign node id 2 6 3 5 4 Search on Structured P2P Where is X? • Example: Chord, a kind of DHT P2P • Solution: Consistent Hashing + Routing • Constraints: only support Key-value pair lookup Object publishing : hash(X) = 3 X Object look up : the same • Research topics • - Topology and Routing • - Efficiency X
DOC 1 DOC 1 網路 DOC 3 DOC 2 DOC 6 8 DOC 4 DOC 8 1 資管 7 台灣 2 6 DOC 3 DOC 5 3 DOC 7 5 4 Keyword Search on Structured P2P • Example: Chord + Inverted List • Solution: Routing + Merge Sort • Constraints: (1) storage redundancy (2) unbalanced load → Zipf’s law (3) single point failure (4) huge traffic (5) hard to rank the results Where is 台灣 & 資管? DOC 1
Keyword Search in DHT-Based Peer-to-Peer Networks Yuh-Jzer Joung, Chien-Tse Fang, and Li-Wei Yang
Outline • Background • Some Preliminaries • The Hypercube Index Scheme • Simulation • Conclusions and Related Work
0010000 1010000 DOC 1 1011000 DOC 2 0 0 1 0 0 0 0 Doc1 (keyword 台灣) Doc2 (keyword 台灣, 網路) Doc3 (keyword 台灣, 網路, 資管) 1 0 1 0 0 0 0 1 0 1 1 0 0 0 DOC 3 Our Hypercube Indexing Scheme • Assign node id : a r-bit string • Hash each keyword into range [0,r] to construct a doc vector • Publish doc to the node where doc vector = node id Hash(台灣) = 2 Hash(網路) = 0 Hash(資管) = 3
0100 1100 0101 1101 0000 1000 0001 1001 0110 1110 0111 1111 1010 0010 0011 1011 Hypercube • An r-dimensional hypercube Hr(Vr, Er) has 2r nodes. Each node u in Vris represented by a unique r-bit binary string. • Two nodes u, v in Vr has an edge iff differ at exactly one bit. • An r-D hypercube can be constructed by 2 (r1)-D hypercubes
Spanning Binomial Tree Search and broadcast in hypercube can be done via traversing the spanning binomial tree.
Subhypercube • A subhypercube of Hr(Vr, Er) induced by u, denoted by Hr(u), is a subgraph G=(U, F) of Hr such that every node wVr is in U if and only if w contains u, and every edge eEr is in F if and only if its two end points are in U. H3 H4(0100)
Outline • Background • Some Preliminaries • The Hypercube Index Scheme • Simulation • Conclusions
0 0 1 0 0 1 0 0 0 0 1 0 … 1 0 0 0 Our Index Scheme • A conceptual r-D hypercube is built over the DHT to index objects. • Each object o with keyword set Ko is mapped to a unique r-bit vector by a hash h as follows: Object o Ko={w1, w2, …, wk} h: W {0, 1, …, r-1} h(w2)=6 h(w1)=1 0 r-1 Fh(Ko) The node Fh(Ko) in the hypercube is responsible for indexing o.
0100 1100 1101 0101 1000 0000 1001 0001 1110 0110 1111 0111 0010 1010 0011 1011 Index Table(0101) {w1, w2} {(A, u), …} {w1, w7} … … … Object Insert/Delete/Pin Search • To insert/delete an object o with keyword set Ko into the system • Find node Fh(Ko) that is responsible for o • Insert/delete index information of o at the node. Object A KA={w1, w2} u publishes A u Fh(KA)=0101 Fh(KA)=0101 x Any object of {w1, w2}
Superset Search • To search objects that can be described by a keyword set K (object o with Ko K) we need just to search the subhypercube induced by the node Fh(K). • E.g.,to search objects that can be described by KA={w1, w2}, we need to search all nodes with x1x1 (since Fh(KA)=0101).
Flexible Superset Search • The spanning binomial tree of the subhypercube can be visited in various ways: • Top-down • General objects first • Bottom-up • Specific objects first • Priority can also be distinguished by nodes at the same depth • Note that the hypercube is purely conceptual; each logical node corresponds directly to a physical node in the DHT. So tree traverse can be flexible as the underlying DHT provides the basic communication.
Simulation • Data set • 131,180 web site records from PCHome (http://www.pchome.com.tw) • Each Web site is maintained manually by experienced editors containing the following fields: • ID, Title, URL, Category, Description, Keyword
Keyword Frequency Logarithm in base e
Object vs. node Distribution X-axis: dimensionality r of hypercube
Query Performance---cacheless m: keyword set size
Conclusions • Our hypercube index scheme has the following characteristics: • Load balancing • Fault tolerant • Facilitate efficient object insert/delete • Direct pin search • A variety of ways for superset search • Ranking can be based on this diversity • Personalization services can also be built • The hypercube index scheme is decomposable • Multiple hypercubes can be built for multi-attribute search
Future Challenges • Flexible Keyword Search • Boolean • Prefix / Range Query • Wildcard / Fuzzy Query • Semantic Query • Semantic Routing
Two Types of Services • White page service • search by names • “Lord of the rings.mpg” • Yellow page service • search by attributes • “rings”, “lord”, “mpg” • Keyword search is the basis for yellow page services • Both services can be easily supported in unstructured P2Ps or P2Ps with a centralized server. Yellow page service, however, is not easy in DHTs.
w2 w5 w3 w1 w4 {A, C, E} {B, D} {B, E} {A, B, D} {C, E} Keywords={W1, W5} Distributed Inverted Indexing {A, B, D}∧{B, D}
Zipf 's law • In a real world corpus, keyword frequency---the count of a keyword's occurrence in objects---varies enormously. A few keywords occur very often while many others occur rarely (in power-law relationship). • e.g., mp3, ring, lord • Zipf’s law implies that a straightforward distributed implementation of inverted index results in an extremely imbalanced load.
Other Problems • Storage redundancy • an object o contains keywords {w1, w2, …, wk} is repeatedly stored at k different sites. • Increase insert/delete complexity • Decrease consistency • Fault tolerance • A failure to a site would block all queries containing a keyword handled by the site. • Nodes handling hot keywords may be swamped. • Object ranking is difficult • Ranking in general requires global knowledge • inverse document frequency (IDF)
Our Keyword Indexing Scheme • The index entries of a single keyword are deterministically handled by a set of nodes. • Fault tolerance • The population of this set depends on the popularity of the keyword • Load Balancing • An object o with a keyword set K is indexed at exactly one node, and the node is determined uniquely by K • No storage redundancy • Insert/delete is efficient
Ranking • Given a keyword set K, the set SK of nodes that may be responsible for a superset of K is fixed. The larger the size of K, the smaller the size of SK. • Within SK, the nodes are distinguished according to their responsible keyword sets as follows: • K+{w1}, K+{w2}, K+{w3}, … • K+{w1,w2}, K+{w1,w3}, K+{w1,w4}, … K+{w2,w1}, … • K+{w1,w2,w3}, K+{w1,w2,w4}, K+{w1,w2,w5}, … • … • So, much leeway in visiting the nodes to retrieve objects in an order required by applications.