280 likes | 419 Views
A Portal-based P2P System for the Distribution and Management of Large Data Sets. Rahim Lakhoo (Raz) and Prof Mark Baker ACET, University of Reading E-mail: r.n.lakhoo@rdg.ac.uk Web: http://acet.rdg.ac.uk/~rnl. Outline. Motivation. A Portal-based P2P System: High-level View, Overview,
E N D
A Portal-based P2P System for the Distribution and Management of Large Data Sets Rahim Lakhoo (Raz) and Prof Mark Baker ACET, University of Reading E-mail: r.n.lakhoo@rdg.ac.uk Web: http://acet.rdg.ac.uk/~rnl r.n.lakhoo@rdg.ac.uk
Outline • Motivation. • A Portal-based P2P System: • High-level View, • Overview, • Components. • P2P Simulators: • Our requirements, • Simulators investigated, • Issues, • Experiences. • Summary. • Conclusions. r.n.lakhoo@rdg.ac.uk
Motivation • Sloan Digital Sky Survey (SDSS) - uses a telescope to take optical images of the sky. • Scientific projects such as SDDS are producing and working with very large data sets. • Current methods for distributing the content involve: • Physically shipping disk drives, • Splitting and the point-to-point transfer from one location to another. • Data sets are growing for projects like SDSS. • Currently, 5 Tbytes, • Set to be ~15 Tbytes by the end of the project. • Storage and bandwidth is costly and limited, and the data sets will inevitably get larger. • Managing and maintaining these large data sets is difficult, will will only become harder over time. r.n.lakhoo@rdg.ac.uk
Motivation • P2P is being used by normal people to download multimedia. • A popular example is BitTorrent. • It’s success surrounds its protocol, which makes users share their bandwidth with other people trying to download the same file. • BitTorrent Concepts: • Files are split into small pieces called ‘chunks’, • Chunks are seeded (uploaded) by a user, • Users download a ‘torrent’ file which has information about a file. • A user loads the ‘torrent’ into an application which then downloads chunks from different peers, • A ‘tracker’ tracks which peers have what chunks. • Peer-to-Peer (P2P) systems offer a potential way to manage and distribute data sets. r.n.lakhoo@rdg.ac.uk
High-level View • Data sets such as SDSS are currently kept in a storage mechanism, such as a RAID array. • A bootstrapping service is set up and has access to the SDSS data. • The data is split into chunks and distributed to the Portal P2P services, hosted by different portals. • Users who access the portal can contribute resources to help store and distribute the data. These are the Mini Peers. • The Portal P2P services propagate the Mini Peers with parts of the data set. • Any other project partners who want a copy of the data can join the P2P network and download parts of the data set from Portal and Mini Peers. r.n.lakhoo@rdg.ac.uk
Overview • Ideas are loosely based around the concepts of BitTorrent and Freenet. • The P2P System consists of: • A distributed registry, for storing information for the network peers and also provides a tracker, • A Bootstrapping Service, which splits the data set into chunks to be distributed by the peers, • A Portal P2P Service, which provides storage and management of the data: • This service also propagates chunks to the Mini Peers. • Mini Peers, donate bandwidth and disk space to the network. r.n.lakhoo@rdg.ac.uk
Overview r.n.lakhoo@rdg.ac.uk
Overview • The registry (VR) provides the distributed tracker: • A tracker helps peers locate other peers with chunks to download. • The Bootstrapper initiates the propagation of the data set to the peers. • The Portal P2P service manages the Mini Peers. • The portal has management and monitoring tools for the data set. • All peers volunteer resources to the P2P network. r.n.lakhoo@rdg.ac.uk
The Virtual Registry • The Virtual Registry (VR) is provided by Tycho. • Tycho is a wide-area asynchronous message passing system with a integrated distributed registry. • The VR can store information which can be searched and retrieved by peers on the network. • Tycho uses HTTP/HTTPS,Sockets/SSL for communications. • The VR will provide the distributed P2P tracker service, for finding peers with chunks to download. r.n.lakhoo@rdg.ac.uk
The Virtual Registry r.n.lakhoo@rdg.ac.uk
The Virtual Registry • Tycho has a Service Oriented Architecture that uses the concept of producers and consumers. • In our system, each Tycho mediator has a consumer and producer, for communications. • Mediators provide the VR with a distributed data store, which uses HSQLDB as its database. • Local communications are via Sockets/SSL and wide-area communications via HTTP/HTTPS. r.n.lakhoo@rdg.ac.uk
The Bootstrapper • A bootstrapping service is needed to propagate the Portal P2P service with parts of the data set. • This service splits the data set into chunks. • Each chunk has an associated hash value, which is stored in the Virtual Registry. • The bootstrapping service needs access to the original data set(s). r.n.lakhoo@rdg.ac.uk
The Bootstrapper r.n.lakhoo@rdg.ac.uk
The Bootstrapper • The bootstrapping service needs to propagate different chunks to different Portals concurrently. • Hash values and metadata about the data set and chunks is stored in the VR. • This service is also used if a requested chunk that is not found on the P2P network, due to chunk corruption. In this case, the missing chunk needs to be replaced in the P2P system. r.n.lakhoo@rdg.ac.uk
The Portal P2P Service • The Portal P2P service is a plug-in component for portals. • This service stores and serves chunks of the data set to other peers in the network. • The portal service propagates chunks to the Mini peers. • The monitoring and management of the data set is handled by the portlet tools and the P2P service. • The portal service uses Tycho to synchronise management tools across all portals in the network. r.n.lakhoo@rdg.ac.uk
The Portal P2P Service r.n.lakhoo@rdg.ac.uk
The Portal P2P Service • Each Portal P2P service needs access to a storage mechanism, for parts of the data set. • The storage resources provided by the portals provides space for a copy of the large data set. • The Portal P2P service also provides parts of the data set to other peers in the P2P network. • The Portal provides users with an environment for managing and monitoring the data set collaboratively between peers. r.n.lakhoo@rdg.ac.uk
The Mini Peers • Mini peers donate bandwidth and storage space to the network. • Mini peers will interact with the P2P network via their Web browser. • Mini peers will store chunks that are useful for other peers. • Mini peers aim to help other peers download and distribute the data set. r.n.lakhoo@rdg.ac.uk
The Mini Peers r.n.lakhoo@rdg.ac.uk
The Mini Peers • Client-side Web browser technologies such as Ajax and JavaScript, will be used for the Mini Peer. • They will utilise the VR to publish parts of the data set, to share with other peers in the network. • Mini Peers will store chunks locally on a users machine. r.n.lakhoo@rdg.ac.uk
P2P Simulators - Requirements • We wanted to use a simulator to help test and develop our P2P system with greater assurance. • Running the P2P system in a simulator would allow us to configure scenarios for studying system behaviour. • Our requirements for a simulator were: • Have support for customised P2P protocols, • Provide facilities for hierarchical topologies, • Provide visualisations, • Provide reasonably accurate results in terms of ‘real-world’ performance, • Have good support and documentation, • Be capable of interfacing with the Java. r.n.lakhoo@rdg.ac.uk
P2P Simulators • There are many network simulators, some are more suited to P2P then others. • Simulators investigated include: • NS-2 with NAM, • PeerSim, • PlanetSim, • OMNet++ and OverSim, • General Purpose Simulator (GPS), • AgentJ, • P2PSim. r.n.lakhoo@rdg.ac.uk
Issues • We short listed three simulators: • General Purpose Simulator (GPS), • AgentJ, • OverSim. • GPS • Difficult to implement our own protocol as the simulator is tightly coupled to the BitTorrent protocol, • Stability issues were seen with larger simulations. • AgentJ • Requires a normal Java application, • Does not support TCP in the simulation environment. • OverSim • Java support is limited and restricting. It is not possible to implement a whole simulation with the provided Java support. r.n.lakhoo@rdg.ac.uk
Experiences • No simulator completely fulfilled our requirements. • We could not successfully implement our Portal-based P2P system in these simulators. • Some of the simulators are complex and take extensive time to learn. • Stability issues were seen with some of the simulators. • Code written for a simulation is specific to a particular simulator. The code cannot be reused in the later stages of development. • The time taken to implement our P2P system in a simulator, does not merit many advantages. r.n.lakhoo@rdg.ac.uk
Summary • We are developing a Portal-based P2P system to help the scientific community to manage, store and distribute large data sets. • Our Portal-based P2P system introduces the concept of data sets being collaboratively downloaded and managed. • The Portal-based P2P system has four main components: • Virtual Registry, • Bootstrapping service, • Portal P2P service, • Mini peers. • We attempted to simulate our design and idea with one of the P2P simulators. • We have investigated and tested several P2P simulators for their suitability to emulate our design. • We found that the simulators we studied we inflexible, unstable, and not easy to use - basically we would have spent more time fixing them, than actually physically implementing and testing our design on a cluster. r.n.lakhoo@rdg.ac.uk
Conclusions • Distributing and managing large data sets is difficult for projects such as SDSS. • P2P simulators are not as useful as first thought. • We will implement our Portal-based P2P system and test it on a suitable test bed, i.e. a cluster. • Once the development of our P2P system has reached a suitable stage, we may consider systems such as PlanetLab. • PlanetLab provides time on a real network with 100’s of nodes, hosted by academic institutes. • P2P systems are known to be an efficient way to distribute files and are becoming increasingly popular. • Implementation should be at a suitable stage for preliminary testing in a few months. r.n.lakhoo@rdg.ac.uk
References Tycho - http://acet.rdg.ac.uk/projects/tycho Further Information - http://acet.rdg.ac.uk/projects/vre/docs.php r.n.lakhoo@rdg.ac.uk
Thank you for listening Questions? r.n.lakhoo@rdg.ac.uk