190 likes | 302 Views
Black-box Determination of Cost Models’ Parameters for Federated Stream-Processing Systems. Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener. 2011-09-23. IDEAS 2011. Agenda. Problem Statement Calibration of Cost Models Function Approximation
E N D
Black-box Determination of Cost Models’ Parameters for Federated Stream-Processing Systems Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener 2011-09-23 IDEAS 2011
Agenda • Problem Statement • Calibration of Cost Models • Function Approximation • Estimating the Costs of Single Operators • Evaluation • Summary • Perspective: Cost Estimation for Federated DSMS Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
DSAM: heterogeneous distributed data stream processing Automatic cost-based query distribution Problem: hardware and DSMS specific cost models needed Problem Statement Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Operator graph Topology Data rates Selectivity Distribution of certain values For some operators: Cost model Calibration of Cost Models Things we know a priori Stream characteristics Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Hardware and DSMS-specific parameters of cost models System costs For some operators: cost model Function approximation Things we do not know a priori Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Cost model consists of • Stream and operator-dependent parameters • Constant values • Hardware/System/Implementation dependent values • Test queries and input streams • Different values for the stream and operator dependent parameters • Cost Measurements • Least squares • Outlier detection (e.g. RANSAC) Calibration of Cost Models - Parameter Estimation Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
No appropriate cost model • Operator without existing cost model • Existing cost models could not be fitted to a specific system • Solution: function approximation • Radial Basis Function Network (RBNF) • Function approximation instead of interpolation • Less centers than input points • Moore-Penrose pseudoinverse least squares solution • Improving the function approximation • Iterative approach • Naive function approximation • Improving areas of interest (e.g. discontinuities, high gradient) Function Approximation – Nonparametric Models Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Assumptions • Only the system costs can be measured • The costs of a single operator are independent of other operators additivity • System costs linear dependent on the number of operators • Parallel instances of the same operator • Latency • Parallel operators latency not dependent on the number of operators • Operators have to be connected in series Estimating the Costs of Single Operators Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Coral 8 • Test setting • Synthetic input streams with constant properties (rate, attribute value distribution) • Every test query running for two minutes • The test data collected in the first minute is discarded • Measured values • Latency • Memory consumption (resident set size) • CPU usage • Coral8 status stream • Input and output rate • Query latency • Application Memory Evaluation Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Filter operator • Application memory • CPU usage • Unexpected behavior: steps and peaks Coral8 Measurements Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
CPU usage linear dependent on the number of operators Slope equals the costs of a single operator Costs of Single Operators Operators Operators Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Application memory of the aggregate operator • Left side: Calibrated cost model • Linear cost model • Right side: Function Approximation • Adapts to the steps Model Calibration and RBFN Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Operator graph consisting of 100 parallel filter operators Cost estimation using function approximation Cost Estimation for Operator Graphs Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Cost estimation for black-box systems without cost estimators • Calibration of a cost model • Default cost model • System-specific cost model • Function approximation • Calibration of a cost model for unknown systems • Behavior conforming to cost model is required • Nonconforming behavior can be detected (automatically) after some measurements • Evaluation • CPU usage and memory consumption can be estimated • Latency: Queuing theory Summary Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Cost formulas as metadata • Cost formulas containing constants, variables and parameters • Cost estimation • Hardware-dependent and system-dependent parameters loaded from metadata catalog • Operator-dependent variables by a metadata provider • Stream-dependent variables by a monitoring component or an estimator • Interpreter to calculate costs • Advantages • Both default and system specific cost formulas possible • Cost models interchangeable at runtime Application: Cost Estimation for Federated DSMS Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Any questions…? Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Identifying parameters • Cost model based • Identifying query or stream-dependent parameters • Generating a set of test data for the parameters • Mapping the parameters to the query language and stream properties • Operator or query language based • No existing cost model • Function approximation • Identifying important parameters based on the query language and possible stream properties • Generating a set of test data Generating Test Data and Test Queries Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Niko Pollner, Klaus Meyer-Wegener
Problem statement Distributed Query Processing Global Query Graph SSDBM 2010 Node 1 Data Rate, Density, Statistics Op1 Op2 Stream1 ??? Node 3 Op6 Op5 Node 2 Op3 Op4 Stream2 ??? ??? Out ??? ??? Data Rate, Density, Statistics Relevant metadata about inner streams unknown ??? Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Klaus Meyer-Wegener
Propagation of input streams‘ statistics • Propagation of statistics for inner streams between operators • Propagation of statistics for output streams • Statistical objective: Attribute Value Distribution (Density) • Analytic Operator Model • Accurate Formulas • Numerical Operator Model • Discrete Mappings • Training of mapping relation Propagation of Densities Operator Input-Stream Output-Stream Data Rate, Density, Statistics Data Rate, Density, Statistics Operator Model Analytic Operator Model Numerical Operator Model SSDBM 2010 Michael Daum, Frank Lauterwald, Philipp Baumgärtel, Klaus Meyer-Wegener