320 likes | 531 Views
SAMSON Platform Architecture. Streaming Big Data. TELEFÓNICA I+D. Index. 01. SAMSON Platform File-System Streaming MapReduce Eco- system Architecture. 02. 03. 04. 05. 01. SAMSON Platform. Overview.
E N D
SAMSON Platform Architecture Streaming Big Data TELEFÓNICA I+D
Index 01 SAMSON Platform File-System Streaming MapReduce Eco-system Architecture 02 03 04 05
01 SAMSON Platform
Overview • Samson is a distributed processing engine especially designed for efficient analytics of stream-processing. • Internal distributed file-system optimized for shared data processing • Provides an extension of the MapReduce framework • More efficient MapReduce • Joins • Streaming MapReduce that allows for the incremental processing of data feeds • Uses existing BigData Storage solutions such as Apache HDFS or MongoDB for fetching and storing data • Built to be deployed on Ubuntu, Redhat Linux and in virtual machines
Samson 0.6 DEB Samson 0.6 RPM User guide Availablefor VM
Key Platform Components • File-system • Streaming MapReduce • Eco-system • Architecture
02 File-system
HD HD HD HD Cores Cores Cores Cores Distributedbig-data platformforhigh-performance Processingoverunboundedstreamsof data SAMSON distributedfile - system MapReduce forstreamedprocessing
SAMSON distributedfile - system We periodically receive a set of documents. We want to compute the accumulated word-count each time we receive an update. First input Secondinput Thirdinput Reduce MapReduce 6 Redistribution 6 Reduce MapReduce 12 Redistribution 6 Reduce MapReduce 18 Redistribution 6
03 Streaming MapReduce
SAMSON Upload Download Run operations SAMSON DELILAH 4 Gb SAMSON
SAMSON Run operations SAMSON DELILAH Upload new operations & data types SAMSON Open API for 3rd party developers
SAMSON SAMSON SAMSON
Map Operation SAMSON Operation SAMSON Operation SAMSON
Reduce Operation SAMSON Operation SAMSON Operation SAMSON State Output Input
04 Eco-system
Top levelview… delilah samsonPop samsonPush samsonPush delilah samsonPop delilah samsonPop samsonPush samsonPush samsonClient samsonClient samsonClient module module module module module module • 3rd Party C++ sharedlibrary • New data types • New operations • Toolsprovidedforsimplified • development!! • Console-based client • Upload data • Download data • Runcommands • Platform monitor • C++ librarytodevelop • new plugins • Examples: • samsonPush • samsonPop Binariestostream data into and out of SAMSON
Delilahclient delilah
SAMSON Module example… classparser_cdrs : public samson::system::SimpleParser { std::vector<char*> words; // Vector used to store words parsed at each line void parseLine( char * line , samson::KVWriter *writer ) { // Split line in words split_in_words( line, words ); // Expected format USER_ID CDR X Y time if( words.size() < 5 ) return; // No content for a valid instruction if( strcmp( words[1] , "CDR" ) != 0 ) return; // Non valid format // Set the key key.value = atoll( words[0] ); // Set the position value.set( atoll( words[2] ) , atoll( words[3] ) , atoll( words[4] ) ); // Emit the key-value writer->emit( 0 , &key, &value ); } }; Module simple_mobility { title "Simple mobility example" author "Andreu Urruela" version "0.1.1" } data UserArea { system.String name; system.UInt x; system.UInt y; system.UInt radius; } data Position { system.UInt x; system.UInt y; system.TimeUnix time; } … parser parser_cdrs { out system.UIntsimple_mobility.Position helpLine "Parse input CDRs to get user-position" } module
Stream MapReduce Ecosystem Demo File system Architecture
05 Architecture
CommunicationProtocols… delilah Goal Solution Why ? Flexibility Back compatibility Platform messages Maximum data compression ( no field separator ) Best for fast-sequential processing Easy job distribution Data serialization Proprietary serialization format No recompilation needed Best tools for querying ( XPATH ) Monitoring
Worker • Runtimeengine and notificationsystem Engine library Independent development • Process • Manager • Network • Manager • Memory • Manager • Disk • Manager --- cores --
Enginelibrary • Disk Manager // Network Manager • Controller to access local disk and network connections • Asynchronus notifications using engine notification system • If required multiple threads are used • Memory Manager: ( our retain-release model ….. similar toObjective-C ) • Simple system to control memory usage • Used to optimize memory allocation when under heavy load • Process Manager ( similar to Apple’s Grand Central dispatch library ) • System to control independent “heavy” task to be executed • Automatic creation / destruction of threads • Optional “fork” mode with shared-memory system to get output • Runtime Engine & Notification system • Inspired in message-passing system implemented in Objective-C • Single loop to run all state-update operations • Thread protection to interact with Disk/Network/Memory/Process Managers
Worker Block Manager Block Manager Disk–Memory balancer • Runtimeengine and notificationsystem Engine library Independent development multi-core • Process • Manager • Network • Manager • Memory • Manager • Disk • Manager --- cores --
Block Manager • Maintains a reference of all blocks of data contained in a Worker ( in disk or memory ) • It keeps a sorted list based on when they will be used ( future operations ) • Low priority blocks are flushed to disk first • High priority blocks are loaded from disk first • Connected to the Disk Manager inside the engine using the EngineNotificationSystem Block Manager Schedule write operations To DiskManager Schedule read operations To DiskManager • Important: Since the order of blocks changes continuously based on the scheduling of new processing operations, the Block Manager is made aware of the new order and is able to react accordingly.
Worker Queues Manager StreamOperations Manager Input data txt_cdrs Operation A cdrs Operation B users priority Operation C Block Manager Block Manager Disk–Memory balancer • Runtimeengine and notificationsystem Engine library Independent development multi-core • Process • Manager • Network • Manager • Memory • Manager • Disk • Manager --- cores --
Queue & StreamOperations Manager Queues Manager StreamOperations Manager txt_cdrs Operation A cdrs Operation B users priority Operation C Contains reference to all the blocks contained in queues and stream operations Both systems are connected to Block Manager to inform about the priority of blocks Stream Operation Manager is connected with ProcessManager to schedule 3rd party operations at Engine Subsystem