Data-Driven Decision-Making The Good, the Bad, and the Ugly Ruda Kulhav ý

Data-Driven Decision-Making The Good, the Bad, and the Ugly Ruda Kulhavý Honeywell International, Inc. Automation and Control Solutions Advanced Technology

Can We Generate More Value from Data? • Today, a typical “data mining” project is ad hoc, lengthy, costly, knowledge-intensive, and requiring on-going maintenance. • Although the project benefits can be quite significant, the resulting profit is often marginal. • The industry is in search of robust methods and reusable workflows, easy to use and adapt to system and organizational changes, and requiring no special knowledge from the end user. • This is a tough target … What can we offer to it today?

Learning from DataProbabilistic Approach

Learning from Data • Data:di(k), i=1,…,n, k=1,…,N Independent variables • States (disturbance vars) • Actions (manipulated vars) Dependent variables • Responses (controlled vars) • Rewards (objective functions) • Goal: Learn from the data how the responses and rewards depend on actions and states. Data Matrix n variables N observations

Count Ni Data hypercube Cell i=1,…,L C Relational database table Fields 12 B 9 A B C Dimensions 23 A k =1 Records Empirical probability 23 9 12 N i = 1 … L From Data to Probability

Empirical probability Data cube Database Query Answer Smoothened probability Possible Monte Carlo approximation Probability operations Probabilistic Data Mining

What Makes Up ‘Problem Dimensionality’? Take a discrete perspective: • Number of data (N) • N=105 five-minute samples per year • Number of cells (L) • L=dn cells, assuming n dimensions, each divided into d cells • Number of models (M) • M=dm models, assuming m parameters of model, each divided into d cells • Can be cut down if strong prior info is available.

Addressing DimensionalityMacroscopic Prediction

Macroscopic Prediction E. T. Jaynes, Macroscopic Prediction, 1985: • If any macrophenomenon is found to be reproducible, then it follows that all microscopic details that were not reproduced must be irrelevant for understanding and predicting it. • Gibbs’ variational principle is … "predict that final state that can be realized by Nature in the greatest number of ways, while agreeing with your macroscopic information."

Boltzmann’s Solution (1877) • To determine how N gas molecules distribute themselves in a conservative force field such as gravitation, Boltzmann divided the accessible 6-dimensional phase space of a single molecule into equal cells, with Ni molecules in the i-th cell. • The cells were considered so small that the energy Ei of a molecule did not vary appreciably within it, but at the same time so large that it could accommodate a large number Ni of molecules.

Boltzmann’s Solution (cont.) • Noting that the number of ways this distribution can be realized is the multinomial coefficient he concluded that the "most probable" distribution is the one that maximizes W subject to the known constraints of his prior knowledge; in this case the total number of particles and total energy

Boltzmann’s Solution (cont.) • If the numbers Ni are large, the factorials can be replaced with Stirling approximation • The solution maximizing log W can be found by Lagrange multipliers where C is a normalizing factor and the Lagrange multiplier is to be chosen so that the energy constraint is satisfied. Shannon entropy Exponential distribution

Why Does It Work? E.T.Jaynes, Where Do We Stand on Maximum Entropy?, 1979: • Information about the dynamics entered Boltzmann’s equations at two places: (1) the conservation of total energy; and (2) the fact that he defined his cells in terms of phase volume … • The fact that this was enough to predict the correct spatial and velocity distribution of the molecules shows that the millions of intricate dynamical details that were not taken into account, were actually irrelevant to the predictions …

Why Does It Work? (cont.) E.T.Jaynes, Where Do We Stand on Maximum Entropy?, 1979: • Boltzmann’s reasoning was super-efficient … • Whether by luck or inspiration, he put into his equations only the dynamical information that happened to be relevant to the questions he was asking. • Obviously, it would be of some importance to discover the secret of how this come about, and to understand it so well that we can exploit it in other problems …

General Maximum Entropy • Empirical probability mass function r(N) • Equivalence of probability mass functions for a given (vector) function h  (h1,…,hL). • Equivalence class containing r(N):

General Maximum Entropy (cont.) • Relative entropy (aka Kullback-Leibler distance) • Minimum relative entropy w.r.t. reference s(0) • Minimum relative entropy solution where C is a normalizing factor and is chosen so that Maximum entropy

Addressing DimensionalityParametric Approximation

Probability Approximation • Approximate the empirical probability vector r(N) with a member s( ) of a more tractable family parameterized by vector : • Taking a geometric perspective, this can be regarded as a projection of the point r(N) onto a surface of a lower dimension.

Maximum Likelihood • Exponential family S(m)with a fixed "origin" s(0), canonical affine parameter , directional sufficient statistic h  (h1,…,hL), and normalizing factor C • Minimize relative entropy • By definition of , the task is equivalent to Maximum likelihood

Sufficient statistic Maximum Likelihood (cont.) • Minimum relative entropy solution where C is a normalizing factor and is chosen so that

Addressing DimensionalityInformation Geometry

Dual Projections Maximum Entropy Maximum Likelihood

Dual parametrizations of exponential family Pythagorean Geometry Equivalence class Exponential family

Maximum Entropy The empirical probability known with precision up to an equivalence class. The solution found within an exponential family through a reference point. Maximum Likelihood The approximating probability sought within an exponential family. The approximation found by projecting the empirical probability. Dual Geometry Exponential family Equivalence classes

Bayesian Estimation • Posterior probability vector for models i=1,…,M:

Addressing DimensionalityRelevance-Based Weighting

What If the Model Is Too Complex? • For some real-life problems, the level of detail that needs to be collected on the empirical probability (and, correspondingly, the dimension of the exponential family) is too high, possibly infinite. • In such case, we can either • sacrifice the closed-form solution, or • take a narrower view of the data, • modeling only the part of system behavior relevant to the problem in question, • while using a simpler, lower-dimensional model.

Relevance-Based Weighting of Data • A general idea of relevance weighting is to modify the empirical probability through where the weight vector reflects the relevance of particular cells to a case. • A popular choice of the weights wi for a given “query” vector x(0) is using a kernel function:

Projections of relevance- weighted empirical distributions onto an exponential family Projections of relevance- weighted empirical distributions onto an exponential family Query-independent model family Query-specific empirical distributions Local Empirical Distributions Response variable Predictor variable

Local Modeling Relational Database Multidimensional Data Cube Forecasted variable Heat demand Time of day Outdoor temperature Query point ( What if ? ) Explanatory variables

Multiple Forecasting Applications Electricity Loads Heat Loads Gas Loads Process Yields

Data-Centric Technology Regression Classification Continuous target variable (product demand, product property, perform. measure) Categorical target variable (discrete event, system fault, process trip) State and/or Action State and/or Action Query point Query point Neighborhood Neighborhood Novelty Detection Optimization Tested variable (corrupt values, unusual responses, new behavior) Reward (operating profit, production cost, target matching) Current State (operating conditions) Past & new State and/or Action Action (decision) Tested point Neighborhood

Increasingly Popular Approach • Statistical Learning • Locally-Weighted / Nonparametric Regression • Cleveland (Bell Labs) • Vapnik (AT&T Labs) • Artificial Intelligence • Lazy / Memory-Based Learning • Moore (Carnegie Mellon University) • Bontempi (University of Brussels) • System Identification • Just-in-Time / On-Demand Modeling • Cybenko (Dartmouth College) • Ljung & Stenman (Linköping University)

How Do Humans Solve Problems? Sales Rep Focus on recent experience! Expert Take everything into account! Engineer Use relevant information!

Corresponding Technologies Adaptive Regression Neural Network Local Regression

Pros and Cons Adaptive Regression Neural Network Local Regression • Simple adaptation • Fast computation • Data compression • No actual learning • Local description • Global description • Fast lookup • Data compression • Slow learning • Interference problem • Lack of adaptation • Difficult to interpret • Minimum bias • Inherent adaptation • Easy to interpret • No compact model • No data compression • Slower lookup

Addressing DimensionalityNo Locality in High Dimension?

Limits of Local Modeling • As the cube dimension n increases, it becomes increasingly difficult to do relevance weighting, similarity search, neighborhood sizing … • The volume of a unit hypersphere becomes a fraction of the volume of a unit hypercube. • The length of the diagonal ( ) of a unit hypercube goes to infinity. • The hypercube increasingly resembles a spherical “hedgehog” (with 2n spikes). • When uniformly distributed, most data appear near the cube edges.

Dimension of surface on which the data live 1 2 3 Retrieved data ratio 10 100 Cube edge ratio No “Local” Data in High Dimensions • However, in most real-life problems, the data is anything but uniform-ly distributed. • Thanks to technology design, integrated control & optimization, and human supervision, the actual number of degrees of freedom is often quite limited.

Query point neighborhood defined over an embedded manifold. Local Modeling Revisited • Exploit data dependence structure. • “Divide and conquer” approach. • Compare p(x1)  p(x2) against p(x1,x2). • Make use of Markovian property. • Discover low-dimensional manifolds on which the data live. • Feature selection. • Cross-validation.

Local Modeling Revisited • Make use of multiple modes in data. • Tree of production or operating modes. • Definition of similar modes over the tree. • Analyze patterns in population of the cube cells with the data, incl. the occupancy numbers. • Estimate the probabilities of symbols generated by an information source, given an observed sequence of symbols. • Symbols are defined by cube cell labels, in a proper encoding.

Cube Encoding For every i, i’, there exists n such that

General “Linear” Case • There exist m numbers D1, D2, …, Dmsuch thatfor every two populated cells i, i’, the absolute difference of the cell labels can be expressed as a weighted sum of the numbers D1, D2, …, Dm,while the corresponding weights n1, n2, …, nmare natural numbers • The number m defines the dimension of a “hyperplane” cutting the cube, on which the data live.

Condition acts as a sequence template: Symbolic Forecasting For every i, i’, there exists n such that

Symbolic Forecasting More questions than answers at the moment: • What are proper “model” functions capturing population patterns and occupancy numbers? • What is a proper way of approaching the problem? • Coding theory? • Algebraic geometry? • Harmonic analysis? • Quantization error .. • Discrete to continuous transition …

Decision-Making ProcessLessons Learnt

Hypothesis Formulation … Two of world’s leading economists present quite distinct views of globalization in their new books: • Joseph Stiglitz Globalization and Its Discontents • Jagdish Bhagwati In Defense of Globalization

Data-Driven Decision-Making The Good, the Bad, and the Ugly Ruda Kulhav ý