240 likes | 318 Views
Hyperion :High Volume Stream Archival. Divya Muthukumaran. Area. Network Monitoring Identify problems due to overloaded and/or crashed servers, network connections or other devices
E N D
Hyperion :High Volume Stream Archival Divya Muthukumaran
Area • Network Monitoring • Identify problems due to overloaded and/or crashed servers, network connections or other devices • Example: To determine the status of a webserver, monitoring software may periodically send an HTTP request to fetch a page
Live Monitoring • Packets are examined in real time • Compute and continually update traffic statistics • Discard the captured packet headers once examined • Why the need to store packet headers?
Live Monitoring • Packets are examined in real time • Compute and continually update traffic statistics • Discard the captured packet headers once examined • Why the need to store packet headers? • Example: Network forensics • To go back and examine the root cause of a problem • Ex: See how an intruder gained entry, How a worm infection happened
What is the need of such a system? Querying and examining live data • Data Archival • Capture the data at wire speeds, Index and store them • Efficiently support retrieval and processing of archived data • Specifically designed to handle needs of high volume stream archival
Why not traditional databases? • Some statistics • A single GB link can generate over 100,000 packets and tens of MBs of archival data. • A monitor may record from Multiple links.
Design Principles • Support Queries not reads • Implies the need to maintain indexes • Writes • Sequential and Immutable • Archive locally , summarize globally • Scalability Vs Need to avoid flooding • Scalability: Favors local archiving and indexing to avoid network writes • Need to answer Distributed queries: favors sharing information across nodes
Hyperion Three Key components • Stream File System • High volume archiving and querying • Multi-level index structure • High update rates + reasonable lookup performance • Distributed index layer • Distributes a summary of local indices to enable distributed querying
Design choices for the Hyperion Storage System • Storage of multiple high-speed traffic streams without loss • Support for concurrent read activity without loss of write performance • Re-use of storage in a buffer-like fashion
Stream File System • Stores Streams as opposed to files • Characteristics • Recycled : When storage is full new data replaces old data. • In a GP File system new data is lost old is retained • Immutable • Record-oriented: data is written in fixed or variable length records
Can we use a GP FS? • Need to map streams <=>files
Stream FS Organization • Los-structured FS • What problem? • Cleaning/Garbage collection • StreamFS solves the cleaning problem • Guarantee : Storage guarantee for each stream • Small segment size • Check if next segment is a surplus . If yes then overwrite , otherwise skip.
Stream FS Organization • Los-structured FS • What problem? • Cleaning/Garbage collection • StreamFS solves the cleaning problem • Guarantee : Storage guarantee for each stream • Small segment size (1 or ½ MB) • Check if next segment is a surplus . If yes then overwrite , otherwise skip. • Advantages? • Storage Reservation • Best effort use of remaining storage
Reads • First get index • Use index to get data • Persistent Handles • Returned from each write operation • Passed to read op to retrieve data • What does the handle contain? • Disk location , approximate length • Allows data to be retrieved directly
Handle issues • Validate the handle. How? • Self certifying record header • Id of the stream • Permissions of the stream • Record length • Hash (used for validating the handle)
Stream FS Organization • Record • Variable length • On-disk record + header • Block • Fixed length • Multiple records of the same stream • Block Map • Every nth block • (stream ID + in-stream sequence number for each of the preceding n-1 blocks) • Used for easy write allocation
Indexing • Uses signature based Indices • Signature for each segment • Can check if a record with a key k is present in the segment or not • Does not tell you where the record is present in the segment
Multi Level Indices • Uses a Bloom Filter • Hash (key) -> b bits • In b bits k bits are set to 1 • H(key1)||H(key2)…||H(keyn) = Hs (Signature) • How to check for presence of a record? • Compute hash of its key kr, H(kr) • If a bit in H(kr) is set but not set in Hs then the value is not present • False positives
Distributed Index • How to handle distributed queries without flooding? • Maintain distributed index • Integrated view of all nodes • Coarse-grain summary of data at each node is needed • Can use the top level index in the Hyperion • One index node per time interval • All nodes send their top-level indices to this node • Temporally–distributed index