220 likes | 322 Views
Comparing Hybrid Peer-to-Peer Systems. based on an article by Hector Garcia-Molina Beverly Yang by Tudor Balan. P2P short survey. P2P advantages
E N D
Comparing Hybrid Peer-to-Peer Systems based on an article by Hector Garcia-Molina Beverly Yang by Tudor Balan
P2P short survey P2P advantages • Resources of many computers might be gathered to form large pools of information and significantly computing power. • Network bandwidth significantly improves as computers directly communicate P2P drawbacks • due to decentralized nature. • Ex. Gnutella(network flooding & no scallability) • Improvements • Ex. Napster (restricted server search, fractional indexing) Goal • Study the functionality of P2P systems in order to understand their tradeoffs • Concentrate on data sharing and hybrid P2P systems.
Data sharing overview Pure data sharing systems Data sharing systems Hybrid data sharing systems Hybrid data sharing systems hugely popular but … well studied also? • Which is the best way to organize server nodes? • Should indexes be replicated? • Which are the common queries asked by users? • How to treat disconnected users?
Problem analysis and treatment • Present several architectures for P2P data sharing systems already used or to be. • Probabilistic model for user queries and for the result size • Illustrate a systems performance evaluating model • Based on above, let’s see some comparisons.
Server architectureGeneral concepts • Login • library • connecting • metadata upload • index • connection information (client IP, line speed) • local server • remote servers • local users • Query • list of desired words • satisfied (max nr of results touched) • query processing way (retrieve and intersect lists for each query word) • Download • library enrichment notification • index update • server notification when remove/logoff comes up
Server architecturesLogin policies • Batch • Login entire library metadata upload • Logoff entire library metadata removed • Index={metadata of active users} • Advantages • Small index dimensions • Increased query efficiency • Disadvantages • Intense and expensive metadata update • Incremental • Metadata permanence • Difference update • Advantages • Less effort at login/logoff • Disadvantages • Increased memory requirements • Penalty on query efficiency • Need to connect to the same server(sometimes)
Server architectures • Chained Architecure • Linked server nodes • Login • Local server metadata upload • Others server nodes unaffected • Query • Submitted to local server • While (not enough results OR all servers received and serviced the query) • local server contacts other servers • End While • Performance • Efficient login and downloads (local server conversation only) • Expensive query treatments (query forwarding, multiple query execution, results retrieval) • Full Replication Architecture • Intended to overcome previous disadvantages • Each server contains a complete index • Advantages • Single server queried • Login at any server (even in incremental policy case) • Disadvantages • Logins sent to all servers • High login/logoff frequency sensibility
Server architectures • Hash Architecture • Login • Metadata words hashed to # servers • A given server maintains the complete lists for a subset of all words • Query • Addressed to only one server • The addressed server ask other servers the lists for the words it doesn’t have • The addressed server merges all lists • Advantages • Limited nr of servers involved in each query processing • Limited nr of servers update metadata • No results traffic (only lists) • Disadvantages • High bandwidth for lists manipulations • Unchained architecture • Set of independent servers • Login • To one isolate server • No other servers are affected • Query • The server the user has logged on • Advantage • Scalability • Disadvantage • Partial functionality • Limited query results
Query model • Needed for systems comparison • Goals • Number of query results estimation • Nr. of servers to process a query • Initial computations in Chained architecture (more complex) • Subsequent derived computations (relaxing or particularizing chained architecture conditions)
Query model(following)Chained architecture • Assume a query universe q1,q2… • g = the probability function that describes the query popularity, i.e g(i) is the probability that a submitted query happens to be query I • f= the probability density function that describes the query selection power. If we take a given file in a user’s library, it will match query i with probability f(i)
Query model(following) • Full replication • ExServ=1 all results are local • ExRemoteResults=0 • Unchained • ExServ=1all results are local • ExRemoteResults=0 • Hash • ExRemoteResults=0
Particularization In case of music share g and f might be realistically taken as:
Performance model • Illustrates the way to measure the performance of a P2P system • NumServers (LAN, WAN) • Users (LAN, WAN) • {LAN,WAN} X {LAN, WAN} • Compute action costs in terms of: • CPU cycles • Interserver communication bandwidth • User-server communication bandwidth
CPU consumption CPU cost variations for chained architecture (batched and incremental) Interpretation • CPU cost variations for other architectures (related to chained one) • Unchained & Full replication • query costs (batch & incremental) formula is the same • …but ExServ=1 and ExRemoteResults=0 • Hash • additional cost for list transfer (for query costs)
Network consumption Client-Server byte costs Interserver byte costs • Full replication • each server sees each Login, AddFile, RemoveFile • LAN once broadcast each message • WAN each message sent NumServers-1 times by local server • Hash • each of selected server sees each Login, AddFile, RemoveFile • LAN once broadcast each message • WAN AddFile sent once for each server containing lists for words contained in the name of the file • Unchained • no interserver communication • 0 login costs • Chained • query interserver communication • no login interserver communication • 0 login costs
Overall performance • Hypothesis: known formulas for each action cost • Performance metric: UsersPerServer • How to compute a global formula for UsersPerServer ? (direct?...to complex) • For each resource • Assume infinite resources of other 2 types • Compute UsersPerServer for current resource (UsersPerServeri) • Compute min(UsersPerServeri)
Experiments • Results of performance studies • Music sharing systems • Sharing systems for domains others than music • Maximum number of users( throughput, not response time) • Architectures={CHN,FR,HASH,FR} • Login policies={batch, incremental} • Strategies=Architectures X Login policies
For MaxResults=100: • QueryLoginRation • nr of logins/sec • users supported available files expected nr of results Music share systems behaviour • Ex: For Query/Login ratio=1: • Incremental FR=54203 • Batch FR=7281 • QueryLoginRation increaseslogins/sec decreasesperformance increases • Incremental strategies outperform batch counterparts • CHN & UCH better than FR & HASH • UHCCHN(conserves performance but increases returned results) • Paradox: UCH more used than CHN • QueryLoginRatio sensitivity
Memory analysis No previous treatment of memory implications Batch strategies better than the incremental counterparts Memory=f(NumServers,ActiveFrac) NumServers , Memory (for FR) Mem of incremental=1/ActiveFrac Mem of batch ActiveFrac incr. strategies come closer to batch. Memory price may eliminate worries about memory limitations Small analysis Ex1.QueryLoginRatio=.75(incr & batch CHN comparison) (69708,26828) vs (12268,28828) take batch Ex2. QueryLoginRatio=.25(incr & batch CHN comparison) (52088,9190) vs (12268,9190) take incremental
Beyond music… • We can generally compute • Expected nr of results of a query • Expected nr of servers to satisfy the query • …using • g() distribution of query frequency • f() distribution of selection power • f and g are input for the general query model • For music f, g exponential (positively correlated) all precedent results( the more popular a query is, the greater the selection power is) • What if we have a stock? • Select * from Product where price>10 (rare query) return as much results as • Select * from Product (common query) • No correlation • What about archive-driven company? • Rare queries (for old articles) return good results • Frequent queries (for new articles) return few results • Negative correlation
Final Conclusions • Chained • Best for music today • Good login, least memory • Poor if many servers involved • Full replication • Potentially good in the future when more stable connections • Hash • Has high bandwidth requirements • Good in future or in systems when servers must not exchange large metadata amounts • Unchained • Not recommended • Few results for only small performance improvement • Good when nr of results is not important • Incremental policy • Good for systems with negative correlation