530 likes | 706 Views
Scalable Performance of System S for Extract-Transform-Load Processing. Toyotaro Suzumura , Toshihiro Yasue and Tamiya Onodera IBM Research - Tokyo. Outline. Background and Motivation System S and its suitability for ETL Performance Evaluation of System S as a Distributed ETL Platform
E N D
Scalable Performance of System S for Extract-Transform-Load Processing Toyotaro Suzumura, Toshihiro Yasue and Tamiya OnoderaIBM Research - Tokyo
Outline • Background and Motivation • System S and its suitability for ETL • Performance Evaluation of System S as a Distributed ETL Platform • Performance Optimization • Related Work and Conclusions
What is ETL ? ETL = Extraction + Transformation + Loading • Extraction : handle the extraction of data from different distributed data sources • Transformation :cleansing and customizing the data for the business needs and rules while transforming the data to match the data warehouse schema • Loading : Load the data into data warehouse Data Warehouse Data Sources ETL Extract Load Transform
Data Explosion in ETL • Data Explosion • The amount of data stored in a typical contemporary data warehouse may double every 12 to 18 months • Data Source Examples: • Logs for Regulatory Compliance (e.g. SOX) • POS (Point-of-sale) Transaction of Retail Store (e.g. Wal-Mart) • Web Data (e.g. internet auction sites, EBay) • CDR (Call Detail Record) for Telecom companies to analyze customer’s behavior Trading Data
Near-Real Time ETL • Given the data explosion problem, there are strong needs for ETL processing to be as fast as possible so that business analysts can quickly grasp the trends of customer activities
Our Motivation: • Assess the applicability of System S, data stream processing system to the ETL processing for, considering both qualitative and quantitative ETL constraints • Thoroughly evaluate the performance of System S as a scalable and distributed ETL platform to achieve “Near-Real Time ETL” and solve the data explosion in the ETL domain
Outline • Background and Motivation • System S and its suitability for ETL • Performance Evaluation of System S as a Distributed ETL Platform • Performance Optimization • Related Work and Conclusions
Stream Computing and System S • System S: Stream Computing Middleware developed by IBM Research • System S is productized as “InfoSphere Streams” now. Stream Computing Traditional Computing Fact finding with data-at-rest Insights from data in motion
InfoSphere Streams Programming Model Sink Adapters Operator Repository Source Adapters Application Programming (SPADE) Platform optimized compilation
A stream-centric programming language dedicated for data stream processing Streams as first class entity Explicit task and data parallelism Intuitive way to exploit multi-core and multi-nodes Operator and data source profiling for better resource management Reuse of operators across stored and live data Support for user-customized operators (UDOP) SPADE : Advantages of Stream Processing as Parallelization Model
Functor Aggregate Source Sink A simple SPADE example [Application] SourceSink trace [Nodepool] Nodepool np := (“host1”, “host2”, “host3) [Program] // virtual schema declaration vstream Sensor (id : id_t, location : Double, light : Float, temperature : Float, timestamp : timestamp_t) // a source stream is generated by a Source operator – in this case tuples come from an input file stream SenSource( schemaof(Sensor)) := Source( ) [ “file:///SenSource.dat” ] {} -> node(np, 0) // this intermediate stream is produced by an Aggregate operator, using the SenSource stream as input stream SenAggregator ( schemaof(Sensor) ) := Aggregate( SenSource<count(100),count(1)> ) [ id . location ] { Any(id), Any(location), Max(light), Min(temperature), Avg(timestamp) } -> node(np, 1) // this intermediate stream is produced by a functor operator stream SenFunctor( id: Integer, location: Double, message: String ) := Functor( SenAggregator) [ log(temperature,2.0)>6.0 ] { id, location, “Node ”+toString(id)+ “ at location ”+toString(location) } -> node(np, 2) // result management is done by a sink operator – in this case produced tuples are sent to a socket Null := Sink( SenFunctor) [ “udp://192.168.0.144:5500/” ] {} -> node(np, 0)
InfoSphere Streams Runtime Streams Data Fabric Transport X86 Blade X86 Blade X86 Blade X86 Blade X86 Blade X86 Box X86 Blade FPGA Blade X86 Blade Cell Blade Optimizing scheduler assigns operators to processing nodes, and continually manages resource allocation Processing Element Container Processing Element Container Processing Element Container Processing Element Container Processing Element Container Template Documentation
? ? System S as a Distributed ETL Platform ? Can we use System S as a distributed ETL processing platform ?
Outline • Background and Motivation • System S and its suitability for ETL • Performance Evaluation of System S as a Distributed ETL Platform • Performance Optimization • Related Work and Conclusions
Target Application for Evaluation Inventory processing for multiple warehouses that includes most of the representative ETL primitives (Sort,Join,and Aggregate)h
Compute host (1) Data Distribution Host 0100-0300-00 0100-0900-00 Source WarehouseItems 1(Warehouse_20090901_1.txt) Sort Join Sort ODBCAppend Join Sort 6 million Split Source bundle WarehouseItems 2 (Warehouse_20090901_2.txt) Aggregate Functor Sink UDOP(SplitDuplicatedTuples) Sink Source WarehouseItems 3 (Warehouse_20090901_3.txt) Key=item Compute host (2) Functor Sort Join Sort ODBCAppend Join Sort Sink Around 60 Aggregate Functor Sink UDOP(SplitDuplicatedTuples) Sink Compute host (N) Sort Sort Join Sort ODBCAppend Source Join Sort Item Catalog Functor Aggregate Functor Sink UDOP(SplitDuplicatedTuples) Sink Sink SPADE Program for Distributed Processing
SPADE Program (1/2) ## stream for computing subindex stream StreamWithSubindex(schemaFor(Warehouse1Schema), subIndex: Integer) := Functor(warehouse1Bundle[:])[] { subIndex := (toInteger(strSubstring(item, 6,2)) / (60 / COMPUTE_NODE_NUM))-2 } -> node(np, 0), partition["Sources"] for_begin @i 1 to COMPUTE_NODE_NUM stream ItemStream@i(schemaFor(Warehouse1Schema), subIndex:Integer) for_end := Split(StreamWithSubindex) [ subIndex ]{} -> node(np, 0), partition["Sources"] for_begin @i 1 to COMPUTE_NODE_NUM stream Warehouse1Sort@i(schemaFor(Warehouse1Schema)) := Sort(ItemStream@i <count(SOURCE_COUNT@i)>)[item, asc]{} -> node(np, @i-1), partition["CMP%@i"] stream Warehouse1Filter@i(schemaFor(Warehouse1Schema)) := Functor(Warehouse1Sort@i)[ Onhand="0001.000000" ] {} -> node(np, @i-1), partition["CMP%@i"] Nil := Sink(Warehouse1Filter@i)["file:///WAREHOUSE1_OUTPUTFILE@i", csvFormat, noDelays]{} -> node(np, @i-1), partition["CMP%@i"] for_end [Nodepools] nodepool np[] := ("s72x336-00", "s72x336-02", "s72x336-03", "s72x336-04") [Program] vstream Warehouse1Schema(id: Integer, item : String, Onhand : String, allocated : String, hardAllocated : String, fileNameColumn : String) vstream Warehouse2OutputSchema(id: Integer, item : String, Onhand : String, allocated : String, hardAllocated : String, fileNameColumn : String, description: StringList) vstream ItemSchema(item: String, description: StringList) ##=================================================== ## warehouse 1 ##=================================================== bundle warehouse1Bundle := () for_begin @i 1 to 3 stream Warehouse1Stream@i(schemaFor(Warehouse1Schema)) := Source()["file:///SOURCEFILE", nodelays, csvformat]{} -> node(np, 0), partition["Sources"] warehouse1Bundle += Warehouse1Stream@i for_end
SPADE Program (2/2) stream JoinedItem2@i(schemaFor(Warehouse2OutputSchema), count: Integer) := Join(SortedItems@i <count(JOIN_COUNT@i)>; AggregatedItems@i <count(AGGREGATED_ITEM@i)>) [ LeftOuterJoin, {id, item} = {id, item} ] {} -> node(np, @i-1), partition["CMP%@i"] stream SortJoinedItem@i(schemaFor(Warehouse2OutputSchema), count: Integer) := Sort(JoinedItem2@i <count(JOIN_COUNT@i)>)[id(asc).fileNameColumn(asc)]{} -> node(np, @i-1), partition["CMP%@i"] stream DuplicatedItems@i(schemaFor(Warehouse2OutputSchema), count: Integer) stream UniqueItems@i(schemaFor(Warehouse2OutputSchema), count: Integer) := Udop(SortJoinedItem@i)["FilterDuplicatedItems"]{} -> node(np, @i-1), partition["CMP%@i"] Nil := Sink(DuplicatedItems@i)["file:///DUPLICATED_FILE@i", csvFormat, noDelays]{} -> node(np, @i-1), partition["CMP%@i"] stream FilterStream@i(item: String, recorded_indicator: Integer) := Functor(UniqueItems@i)[] { item, 1 } -> node(np, @i-1), partition["CMP@i"] stream AggregatedItems2@i(LoadNum: Integer, Item_Load_Count: Integer) := Aggregate(FilterStream@i <count(UNIQUE_ITEM@i)>) [ recorded_indicator ] { Any(recorded_indicator), Cnt() } -> node(np, @i-1), partition["CMP@i"] stream AddTimeStamp@i(LoadNum: Integer, Item_Load_Count: Integer, LoadTimeStamp: Long) := Functor(AggregatedItems2@i)[] { LoadNum, Item_Load_Count, timeStampMicroseconds() } -> node(np, @i-1), partition["CMP@i"] Nil := Sink(AddTimeStamp@i)["file:///final_result.out", csvFormat, noDelays]{} -> node(np, @i-1), partition["CMP@i"] for_end ##==================================================== ## warehouse 2 ##==================================================== stream ItemsSource(schemaFor(ItemSchema)) := Source()["file:///ITEMS_FILE", nodelays, csvformat]{} -> node(np, 1), partition["ITEMCATALOG"] stream SortedItems(schemaFor(ItemSchema)) := Sort(ItemsSource <count(ITEM_COUNT)>)[item, asc]{} -> node(np, 1), partition["ITEMCATALOG"] for_begin @i 1 to COMPUTE_NODE_NUM stream JoinedItem@i(schemaFor(Warehouse2OutputSchema)) := Join(Warehouse1Sort@i <count(SOURCE_COUNT@i)>; SortedItems <count(ITEM_COUNT)>) [ LeftOuterJoin, {item} = {item} ]{} -> node(np, @i-1), partition["CMP%@i"] ##================================================= ## warehouse 3 ##================================================= for_begin @i 1 to COMPUTE_NODE_NUM stream SortedItems@i(schemaFor(Warehouse2OutputSchema)) := Sort(JoinedItem@i <count(JOIN_COUNT@i)>)[id, asc]{} -> node(np, @i-1), partition["CMP%@i"] stream AggregatedItems@i(schemaFor(Warehouse2OutputSchema), count: Integer) := Aggregate(SortedItems@i <count(JOIN_COUNT@i)>) [item . id] { Any(id), Any(item), Any(Onhand), Any(allocated), Any(hardAllocated), Any(fileNameColumn), Any(description), Cnt() } -> node(np, @i-1), partition["CMP%@i"]
Qualitative Evaluation of SPADE • Implementation • Lines of SPADE: 76 lines • # of Operators: 19 (1 UDOP Operator) • Evaluation • With the built-in operators of SPADE, we could develop the given ETL scenario in a highly productive manner • The functionality of System S for running a SPADE program on distributed nodes was a great help
Performance Evaluation • Total Nodes: 14 nodes and 56 CPU cores • Spec. for Each Node : Intel Xeon X5365 3.0 GHz Xeon (4 physical cores with HT), 16GB memory, RHEL 5.3 64bit (Linux Kernel 2.6.18.-164.el5) • Network : Infiniband Network (DDR 20Gbps) Or 1Gbps Network • Software: InfoSphere Streams: beta version • Data : 9 Million Records (1 Record is around 100 Byte) Item Sorting Data Distribution Total = 14 Nodes (Each node has 4 cores) 1 2 3 4 e0101b0${n}e1 n 9 10 11 12 13 14 5 6 7 8 1 21 2 22 3 23 4 24 5 25 6 26 7 27 8 28 9 29 10 30 11 31 12 32 13 14 15 16 17 18 19 20 Compute Host (10 Nodes, 40 Cores)
Node Assignment Item Sorting Data Distribution Total = 14 Nodes (Each node has 4 cores) 1 2 3 4 e0101b0${n}e1 n Not used 9 10 11 12 13 14 5 6 7 8 1 21 2 22 3 23 4 24 5 25 6 26 7 27 8 28 9 29 10 30 11 31 12 32 13 14 15 16 17 18 19 20 Compute Host (10 Nodes, 40 Cores)
Throughput for Processing 9 MillionData Maximum Throughput : around 180000 records per second (144 Mbps) Speed-up Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A
Analysis (I-a) : Breakdown the Total Time Data Distribution is Dominant Computation Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A
Analysis (I-b) Speed-up ratio against 4 cores when focusing on only “computation part” Over Linear-Scale Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: A
CPU Utilization at Compute Hosts Computation Idle Computation
Outline • Background and Motivation • System S and its suitability for ETL • Performance Evaluation of System S as a Distributed ETL Platform • Performance Optimization • Related Work and Conclusions
Performance Optimization • The previous experiment shows that most of the time is spent in the data distribution or I/O processing • For performance optimization, we implemented a SPADE program in such a way that all the nodes are participated in the data distribution while each source operator is only responsible for certain chunk of data records divided by the number of source operators
Compute host (1) Data Distribution Host 0100-0300-00 0100-0900-00 Source WarehouseItems 1(Warehouse_20090901_1.txt) Sort Join Sort ODBCAppend Join Sort 6 million Split Source bundle WarehouseItems 2 (Warehouse_20090901_2.txt) Aggregate Functor Sink UDOP(SplitDuplicatedTuples) Sink Source WarehouseItems 3 (Warehouse_20090901_3.txt) Key=item Compute host (2) Functor Sort Join Sort ODBCAppend Join Sort Sink Around 60 Aggregate Functor Sink UDOP(SplitDuplicatedTuples) Sink Compute host (N) Sort Sort Join Sort ODBCAppend Source Join Sort Item Catalog Functor Aggregate Functor Sink UDOP(SplitDuplicatedTuples) Sink Sink Performance Optimization • We modified the SPADE data-flow program in such a way that multiple Source operators participate in the data distribution • Each data distribution node can read a chunk of the whole data Original SPADE Program Optimized SPADE Program Data Distribution Host Compute host (1) Key=item WarehouseItems 1 Source Split Sort Join Sort ODBCAppend Join Sort 0100-0300-00 0100-0900-00 Aggregate Functor Sink UDOP(SplitDuplicatedTuples) Sink Compute host (2) WarehouseItems 1 Source Split Sort Join Sort ODBCAppend Join Sort Aggregate Functor Sink UDOP(SplitDuplicatedTuples) Sink WarehouseItems 1 Source Split Around 60 Compute host (N) Sort Sort Source Join Sort ODBCAppend Join Sort Item Catalog Functor Aggregate Functor Sink UDOP(SplitDuplicatedTuples) Sink Sink
Total = 14 Nodes (Each node has 4 cores) 1 2 3 4 1 15 2 16 3 17 4 18 e0101b0${n}e1 n disk disk disk disk 9 10 11 12 13 14 5 6 7 8 5 19 6 20 7 21 8 22 9 23 10 24 11 12 13 14 disk disk disk disk disk disk disk disk disk disk Node Assignment • All the 14 nodes participate in the data distribution • Each operator reads the number of records that divide the total data records (9M records) with the number of source operators. • The node assignment for compute nodes are the same as Experiment I Data Distribution Data Distribution / Compute Host
Elapsed time with varying number of compute nodes and source operators # of source operators Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C
Throughput : Over 800000 records / sec Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C
Scalability : Achieved Super-Linear with Data Distribution Optimization Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:C
Outline • Background and Motivation • System S and its suitability for ETL • Performance Evaluation of System S as a Distributed ETL Platform • Performance Optimization • Related Work and Conclusions
Related Work • Near Real-Time ETL • Panos et.al. reviewed the state of the art of both conventional and near real-time ETL [2008 Springer] • ETL Benchmarking • Wyatt et.al. identifies a common characteristics of ETL workflows in an effort of proposing a unified evaluation method for ETL [2009 Springer Lecture Notes] • TPC-ETL: formed in 2008 and still under the development by the TPC subcommittee
Conclusions and Future Work • Conclusions • Demonstrated the software productivity and scalable performance of System S in the ETL domain • After the data distribution optimization, we achieved over linear scalability performance by processing around 800000 records per second on 14 nodes • Future Work • Comparison with the existing ETL tools / systems and various application scenarios (TPC-ETL?) • Automatic Data Distribution Optimization
Node Pool 1 2 3 P ComputeOperators Source Operators C1 n(S1, C1) S1 d1 C2 S2 C3 d2 S3 d3 n(Sn, C3) Sn dn Cm Future Direction: Automatic Data Distribution Optimization • We were able to identify the appropriate number of source operators through a series of long-running experiments. • However, It is not wise for such a distributed systems as System S to force users/developers to experimentally find the appropriate number of source nodes. • We will need to have an automatic optimization mechanism that maximizes the throughput by automatically finding the best number of source nodes in a seamless manner from the user.
? ? Questions Thank You
ComputeOperators Source Operators C1 ComputeOperators S1 d1 C2 Source Operator C1 S2 C3 d2 C2 D S C3 Data DistributionOptimizer S3 d3 Sn dn Cm Cm Optimized SPADE Program Original SPADE Program Towards Adaptive Optimization • The current SPADE compiler has compile-time optimizer by obtaining the statistical data such as tuple/byte rates and CPU ratio for each operator. • We would like to let users/developers to write a SPADE program in a left manner without considering the data partitioning and data distribution. • By extending the current optimizer, the system automatically could convert the left-hand side program to right-hand program that achieves the maximum data distribution
Executive Summary Optimized version vs. others Elapsed Time for Baseline • Motivation: • Evaluate System S as an ETL platform at a large experimental environment, Watson cluster • Understand the performance characteristics at such a large testbed such as scalability and performance bottlenecks • Findings: • A series of our experiments have shown that data distribution cost is dominant in the ETL processing • The optimized version in right hand side shows that when changing the number of data feed (or source) operators, the throughput is dramatically increased and obtains higher speed-ups than the others • Using the Infiniband network is critical for such an ETL workload that includes barrier before aggregating all the data for sorting operation, and we achieved almost double performance against the one with 1Gbs network Optimized version Comparison between 1Gbs network and Infiniband Network Throughput Infiniband Network 1Gbps Network
Node Assignment (B) for Experiment II Experimental Environment is comprised of 3 source nodes for data distribution, 1 node for item sorting, and 10 nodes for computation. The compute node has 4 cores and we manually allocate each operator with the following scheduling policy. The following diagram shows the case in that 32 operators are used for the computation. Each operator is allocated to adjunct node in order Data Distribution Item Sorting Total = 14 Nodes (Each node has 4 cores) 1 2 3 4 e0101b0${n}e1 n 9 10 11 12 13 14 5 6 7 8 1 21 2 22 3 23 4 24 5 25 6 26 7 27 8 28 9 29 10 30 11 31 12 32 13 14 15 16 17 18 19 20 Compute Host (10 Nodes, 40 Cores)
c0101b06 …. c0101b05 c0101b07 s72x336-14 1 1 1 1 1 2 2 2 2 2 4 4 4 4 4 Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Join Join Join Join Join Join Join Join Join Join Join Join Join Join Join Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort ODBCAppend ODBCAppend ODBCAppend ODBCAppend ODBCAppend ODBCAppend ODBCAppend ODBCAppend ODBCAppend ODBCAppend ODBCAppend ODBCAppend ODBCAppend ODBCAppend ODBCAppend Join Join Join Join Join Join Join Join Join Join Join Join Join Join Join Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Sort Aggregate Aggregate Aggregate Aggregate Aggregate Aggregate Aggregate Aggregate Aggregate Aggregate Aggregate Aggregate Aggregate Aggregate Aggregate Functor Functor Functor Functor Functor Functor Functor Functor Functor Functor Functor Functor Functor Functor Functor Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) UDOP(SplitDuplicatedTuples) Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink Sink Source Source Source WarehouseItems 2 (Warehouse_20090901_2.txt) WarehouseItems 2 (Warehouse_20090901_2.txt) WarehouseItems 2 (Warehouse_20090901_2.txt) Split Split Split Functor Functor Functor Sink Sink Sink SPADE Program with Data Distribution Optimization Since 3 nodes are participated in the data distribution, the number of communication is at maximum 120 (3 x 40). c0101b01 c0101b02 c0101b03
for_begin @j 1 to COMPUTE_NODE_NUM bundle warehouse1Bundle@j := () for_end #define SOURCE_NODE_NUM 3 for_begin @i 0 to SOURCE_NODE_NUM-1 stream Warehouse1Stream@i(schemaFor(Warehouse1Schema)) := Source()["file:///SOURCEFILE", nodelays, csvformat]{} -> node(SourcePool, @i), partition["Sources@i"] stream StreamWithSubindex@i(schemaFor(Warehouse1Schema), subIndex: Integer) := Functor(Warehouse1Stream1)[] { subIndex := (toInteger(strSubstring(item, 6,2)) / (60 / COMPUTE_NODE_NUM)) } -> node(SourcePool, @i), partition["Sources@i"] for_begin @j 1 to COMPUTE_NODE_NUM stream ItemStream@i@j(schemaFor(Warehouse1Schema), subIndex:Integer) for_end := Split(StreamWithSubindex@i) [ subIndex ]{} -> node(SourcePool, @i), partition["Sources@i"] for_begin @j 1 to COMPUTE_NODE_NUM warehouse1Bundle@j += ItemStream@i@j for_end for_end for_begin @j 1 to COMPUTE_NODE_NUM stream StreamForWarehouse1Sort@j(schemaFor(Warehouse1Schema)) := Functor(warehouse1Bundle@j[:])[]{} -> node(np, @j-1), partition["CMP%@j"] stream Warehouse1Sort@j(schemaFor(Warehouse1Schema)) := Sort(StreamForWarehouse1Sort@j <count(SOURCE_COUNT@j)>)[item, asc]{} -> node(np, @j-1), partition["CMP%@j"] stream Warehouse1Filter@j(schemaFor(Warehouse1Schema)) := Functor(Warehouse1Sort@j)[ Onhand="0001.000000" ] {} -> node(np, @j-1), partition["CMP%@j"] for_end bundle warehouse1Bundle := () for_begin @i 1 to 3 stream Warehouse1Stream@i(schemaFor(Warehouse1Schema)) := Source()["file:///SOURCEFILE", nodelays, csvformat]{} -> node(np, 0), partition["Sources"] warehouse1Bundle += Warehouse1Stream@i for_end Experiment I New SPADE Program After warehouse2, 3, and 4 are omitted in this chart, but we executed them for the experiment
Total = 14 Nodes (Each node has 4 cores) 1 2 3 4 1 15 2 16 3 17 4 18 e0101b0${n}e1 n disk disk disk disk 9 10 11 12 13 14 5 6 7 8 5 19 6 20 7 21 8 22 9 23 10 24 11 12 13 14 disk disk disk disk disk disk disk disk disk disk Node Assignment (C) for Experiment III • All the 14 nodes participate in the data distribution, and each Source operator is assigned as the manner described in the following diagram. For instance, 24 Source operators are allocated to each node in order and when 14 source operators are allocated to 14 nodes, then the next source operator is allocated to the first node. • Each operator reads the number of records that divide the total data records (9M recordss) with the number of source operators. This data division is conducted in prior using a Linux tool called “split” • The node assignment for compute nodes are the same as Experiment I Data Distribution
Performance Result for Experiment II and Comparison with Experiment I When 3 nodes are participated in the data distribution, the throughput is increased to almost double when compared with the result given by Experiment I Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:B
Analysis (II-a) Optimization by changing the number of source operators Node Assignment for 9 Data Distribution Node • Motivation for this experiment • In the previous page, the throughput is saturated around 16 cores due to the lack of data feeding ratio against computation • Experimental Environment • We changed the number of source operators while not changing the total volume of data (9M data records), and measured throughput • We only tested 9MDATA-32 (32 operators for computation) • Experimental Results • In this experiment shows that the 9 source nodes obtains the best throughput. Total = 14 Nodes (Each node has 4 cores) 1 2 3 4 1 2 3 4 e0101b0${n}e1 n disk disk disk disk 9 10 11 12 13 14 5 6 7 8 5 6 7 8 9 disk disk disk disk disk disk disk disk disk disk Best Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: B
Analysis (II-b) : Increased Throughput by Data Distribution Optimization • The following graph shows the overall results by taking the same optimization approach in previous experiment, which increases the number of source operators. • 3 source operators are used for 4, 8, 12, 16, and 9 source operators are used for 20, 24, 28 and 32. • We achieved 5.84 times speedup against 4 cores at 32 cores Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: B
Analysis (II-c) : Increased Throughput by Data Distribution Optimization The yellow line shows the best performance since 9 nodes are participated in the data distribution for 20, 24, 28 and 32 cores. Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment:B
Experiment (III): Increasing More Source Operators • Motivation • In this experiment, we understand the performance characteristics by increasing more source operators than previous experiment (Experiment II). • We also identify the performance comparison between Infiniband network and the commodity 1Gbps network • Experimental Setting • We increase the number of source operators up to 45 from 3, and test this configuration against relatively large number of computes nodes, 20, 24, 28, 32 nodes. • Node Assignment for Data Distribution and Computation is the same as previous experiment (Experiment II)
Analysis (II-a): Throughput and Elapsed Time The maximum total throughput, around 640 Mbps, is below the network bandwidth of both Infiniband and 1Gbps LAN. 800000 tuples/sec (1 tuple=100byte) = 640 Mbps Throughput Elapsed Time Hardware Environment: 14 nodes, Intel Xeon 3.0GHz, 4 Cores, 16GB RAM, RHEL 5.3 64 bit, Infiniband Node Assignment: C