1.4k likes | 1.42k Views
Explore the combination of MST and p-median problem in market graphs analysis, using spanning p-forest with p-stars for data interpretation. Includes applications in cell formation. Discusses problems like tolerance issue for MST and experimental results.
E N D
Mixed Tools for Market Analysis and Their Applications Boris Goldengorin LATNA – Laboratory of Algorithms and Technologies for Network Analysis Higher School of Economics, Moscow, Russian Federation bgoldengorin@hse.ru Joint work with M. Batsyn, V. Kalyagin, A. Kocheturov, P.M. Pardalos, A. Vizgunov
Dedicated to Boris Mirkin Birthday • Professor, Department of Applied Mathematics, Higher School of Economics, Moscow RF • - clustering • - decision making • - mathematical classification • - evolutionary trees • - data and text interpretation • Citation indices All Citations 3865 • h-index 28 • i10-index 50
Mirkin visit me in Alma-Ata, Kazakhstan in 1981 The USSR Workshop on Statistical and Discrete Analysis of Non-Numerical Information, Expert’s Estimations and Discrete Optimization. Abstracts. Moscow-Alma-Ata, VINITI AN SSSR, 1981, pp.356 (in Russian)
Abstract • Efficient daily trading impose aggregation of positions correlated to each other by one of trader’s criteria. The positions aggregation is one of possible ways to increase the online trader’s capacity. • In this talk we analyse the well known minimum spanning tree (forest) approach used for the market graphs analysis and combine this approach with less known pseudo-Boolean approach based on the p-median problem. • We illustrate our mixed tools (spanning p-forest combined with p-stars) by application them to different sources of data including market graphs and cell formation in group technology.
Outline of the talk • The Market Graph • The Minimum Spanning Tree (MST) Problem • MST and Its Tolerances • Stars and the p-Median Problem • Pseudo-Boolean polynomial • Mixed Boolean pseudo-Boolean Model (MBpBM) • Experimental results • Concluding Remarks • Directions for Future Research 5
Market Graph • Vertices are stocks, and an edge connects two stocks if the correlation between their price fluctuations over a certain period is greater than a specified threshold • ~6000 vertices (stocks)
Market Graph • Correlation coefficients for the edges: Distribution of correlation coefficients in the US stock market for several overlapping 500-day periods during 2000–2002 (period 1 is the earliest, period 11 is the latest).
Market Graph • Market graph (all the considered instances for different correlation thresholds) follows the power-law model • Using the combination of heuristic and exact algorithms, the exact solution of the maximum clique problem was found (Boginski, Butenko & Pardalos, 2005)
Finding Cliques in the Market graph • Using the IP formulation of the maximum clique problem to find the exact solution:
Maximum Clique size for different correlation thresholds • Large cliques despite very low edge density – confirms the idea about the “globalization” of the market
The Minimum Spanning Tree (MST) Problem. • For a given simple weighted undirected graph G = (V;E;W) find a spanning tree T = (V;E(T)) such that the total sum of all edge weights w(e) for all e ϵ E(T) is minimized. It is well known that a MST is a connected acyclic graph, containing exactly (n-1) edges, and might be computed be means of the Kruskal’s (greedy type) algorithm. • At each step the Kruskal’s algorithm selects a shortest edge such that the current graph will be a forest.
Examples of Spanning Trees Weekly volatility before technology crash Daily return before technology crash
Kruskal’s Algorithm for the MST • Repeat the following step until a forest T has n-1 edges (initially E(T) is empty): Add to T a shortest edge that does not form a cycle with edges already in E(T). • Assume that we have ordered all m = |E| edges in a non-increasing order such that w(e1) ≤ w(e1) ≤ … ≤ w(em) Thus, the Kruskal’s algorithm will terminate with a MST in at most O(mlogm) with m = n(n-1)/2 for a complete graph.
The tolerance problem for a MST • The problem of finding for each eϵE, the maximum decrease l(e) and the maximum increase u(e) of the edge length w(e) preserving the optimality of T under the assumption that the lengths of all other edges remain unchanged. • The values l(e) and u(e) are called the lower and the upper tolerances, respectively, for an edge eϵE with respect to the given MST T and the function of edge lengths w.
An optimal MST and Its Tolerances in O(mlogm) time In the following portion we show that a MST together with all its upper and lower tolerances can be computed in O(mlogm) time by a tiny modification of the Kruskal’s algorithm. Let us recall that by adding a single edge y not in T to the chosen spanning subtree S(T) we create a unique cycle C = {e1;e2,…,ek,y} where the tail of y is the head of ek and the head of y is the tail of e1 or vice versa.
Equivalent Problems • The clique problem and the independent set problem are complementary: a clique in G is an independent set in the complement graph of G and vice versa. • Set {1,2,3,4} – is the maximum clique, set {0,2,5} is the maximum independent set
Market Graph • Vertices are stocks, and an edge connects two stocks if the correlation between their price fluctuations over a certain period is greater than a specified threshold • ~6000 vertices (stocks)
Market Graph • Correlation coefficients for the edges: Distribution of correlation coefficients in the US stock market for several overlapping 500-day periods during 2000–2002 (period 1 is the earliest, period 11 is the latest).
Market Graph • Market graph (all the considered instances for different correlation thresholds) follows the power-law model • Using the combination of heuristic and exact algorithms, the exact solution of the maximum clique problem was found (Boginski, Butenko & Pardalos, 2005)
Finding Cliques in the Market graph • Using the IP formulation of the maximum clique problem to find the exact solution:
Maximum Clique size for different correlation thresholds • Large cliques despite very low edge density – confirms the idea about the “globalization” of the market
The p-Median Problem (PMP) I = {1,…,m} – a set of m facilities (location points), J = {1,…,n} – a set of nusers (clients, customers or demand points) C = [cij] – a m×n matrix with distances (measures of similarities or dissimilarities) travelled (costs incurred) Costs Matrix location points clients - location point (cluster center) - Client (cluster points) 27
The PMP: combinatorial formulation The p-Median Problem (PMP) consists of determining p locations (the median points) such that 1 ≤ p≤ m and the sum of distances (or transportation costs) over all clients is minimal. complexity 1 m p - opened facility - location point - client 28 p = 3
The PMP: combinatorial formulation • I – set of locations • J – set of clients • cij– costs for serving j-th client from i-th location • p – number of facilities to be opened 29
The PMP: Applications • Facilty location • Cluster analysis • Quantitative psychology • Telecommunications industry • Sales force territories design • Political and administrative districting • Optimal diversity management (assortment problems) • Cell formation in group technology (flexible manufacturing systems) • Vehicle routing • Topological design of computer and communication networks 30
The PMP: Applications • Facility location - consumer (client) - possible location of supplier (server) - supplier (server), e.g. supermarket, bakery, laundry, etc.
The PMP: Applications • Facility location - consumer (client) - possible location of supplier (server) - supplier (server), e.g. supermarket, bakery, laundry, etc.
The PMP: Applications Output • Cluster analysis • Input: • finite set of objects • measure of similarity cluster 1 cluster 2 cluster 3 cluster 4 “best” representatives – p-medians
The PMP: Applications • Quantitative psychology patients symptoms (behavioural patterns) type 1 mentality features type 2 mentality features “leaders” or typical representatives
The PMP: Applications • Telecommunications industry
The PMP: Applications • Sales force territories design customers (groups of customers) entries of the costs matrix account for customers’ attitudes and spatial distance possible outlets for some product Goal: select p best outlets for promoting the product
The PMP: Applications • Political and administrative districting districts, cities, regions degree of relationship: political, cultural, infrastructural connectedness districts, cities, regions
The PMP: Applications • Optimal diversity management • given a variety of products (each having some demand, possibly zero) • select p products such that: • every product with a nonzero demand can be replaced by one of the p selected products • replacement overcosts are minimized
The PMP: Applications • Optimal diversity management • Example: wiring designs, p=3 configurations with zero demand
The PMP: Applications • Cell formation in group technology functional layout cellular layout see also video at http://www.youtube.com/watch?v=q_m0_bVAJbA - machines - products routes
The PMP: Applications • Vehicle routing - clients / storage - vehicle routes
The PMP: Applications • Topological design of computer and communication networks
The PMP: Applications • Topological design of computer and communication networks
The PMP: Applications • Topological design of computer and communication networks
Publications, more than 500 Goldengorin et al, 2011, 2012 Elloumi, 2010; Brusco and K¨ohn, 2008; Belenky, 2008; Church, 2003; 2008; Avella et al, 2007; Beltran et al, 2006; Reese, 2006 (Overview, NETWORKS) ReVelle and Swain, 1970; Senne et al, 2005.
Brusco and Kohn PSYCHOMETRIKA—VOL. 73, NO. 1, 89–105 There is an evidence that the p-median model can, for certain data structures, provide better cluster recovery than alternative clustering procedures (Klastorin, 1985). Klastorin provided a limited comparison of misclassification rates of the complete linkage (Johnson, 1967), average linkage (Sokal & Sneath, 1963), minimum variance (Ward, 1963), K-means (Hartigan & Wong, 1979; MacQueen, 1967), and p-median methods (Mulvey & Crowder, 1979). For data generated based on squared Euclidean measures of dissimilarity, Ward’s method provided the lowest misclassification rates, followed by the p-median method. The p-median model, however, provided the lowest misclassification rates when the pairwise measure of dissimilarity was based on Euclidean distance.
The PMP: Boolean Linear Programming Formulation (ReVelle and Swain, 1970) s.t. - each client is served by exactly one facility - p opened facilities - prevents clients from being served by closed facilities xij = 1, if j-th client is served by i-th facility; xij = 0, otherwise 47
The PMP:alternative formulation, Cornuejols et al. 1980 Let for each client j - sorted (distinct) distances (Kj – number of distinct distances for j-th client) 48
The PMP: alternative formulation, Cornuejols et al. 1980 Let for each client j - sorted (distinct) distances (Kj – number of distinct distances for j-th client) 49
The PMP: alternative formulation, Cornuejols et al. 1980 Let for each client j - sorted (distinct) distances (Kj – number of distinct distances for j-th client) Decision variables 50 S - set of opened plants