160 likes | 176 Views
Explore the evolution of Peer-to-Peer (P2P) Information Retrieval (IR) beyond Napster, the advantages and disadvantages of P2P systems, and the future directions for decentralized and traditional IR methods.
E N D
Peer to Peer Information Retrieval Going beyond Napster
What is P2P IR? • No index on a central server • Content is distributed across all users of the system • Content is more then text • Binary files • Associated Metadata
Why go P2P • Spiraling costs of maintaining indexes • Look at Google’s server farm • New content forces new thinking on IR • Large binary files are hard to index • Freedom of speech • Society is striving to communicate data which is being legislated against
First P2P Systems • Central hash of distributed content • Only the central hash was used for queries • Disadvantages: • Scalability • Known location of content • Single point of failure • Advantages • Quick searching • Deterministic search results
Bumps that caused change • Legal • Centralized services were easy targets • Owners of index could not claim they had no knowledge of content • Growth • Cost of maintaining service grew • Hardware requirements exploded
Decentralized P2P • Content spread between users w/ no explicit intent • Centralized server is replaced by self-maintaining network • Every user is also a server • There is no index of content • How do we search?
Searching Decentralized P2P Systems • Many methods, none perfected yet • Broadcast search • Advantages • Every node takes part in query • Disadvantages • As system grows, network bandwidth, query time grow exponentially
Intelligent P2P Crawls • Ways to improve decentralized P2P query • Intelligently place data (FreeNet) • By knowing the algorithm that distributes data, querying can be done more intelligently • Clustering (Fireworks model) • Clients with similar properties are logically grouped • Queries that don’t apply to a group will not be sent to that entire group of clients • Both change the paradigm of what kind of data is shared and the means of sharing
Other improvements • Today, most networks still rely on brute-force-search • CRC/MD5 hashing • A checksum of each file is computed • Instead of searching metadata, search for file hash • Files that are identical, but mislabeled, are still returned
Query time limiting • Save on inter-system bandwidth, searches terminate after X hops • Client ends query after 100 results • Searches time out after X seconds
Distributed IR • Traditional IR with the advantages of distributed systems • A central server still stores the index • Multiple brokers allow access to the data repository • Multiple gatherers crawl data near to them • Advantages are seen in the data acquisition end
Future Directions • Next steps will be drastic re-thinking of content placement ala FreeNet • Donate X amount of bandwidth, Y amount of HD space • Share Z directories of content • Actual content files are distributed to the network intelligently • Most requested files are blanketed • Unique files are still accessible
Future directions for Traditional IR • Large central repositories such as Google will fade • Internet will be fragmented into clusters of interest • Similar interest groups will have decentralized search facilities • An index of these groups will replace the Google’s of today