1 / 24

The Generalized MDL Approach for Summarization

The Generalized MDL Approach for Summarization. Laks V.S. Lakshmanan (UBC) Raymond T. Ng (UBC) Christine X. Wang (UBC) Xiaodong Zhou (UBC) Theodore J. Johnson (AT&T Research) (Work supported by NSERC and NCE/IRIS.). Overview. Introduction Motivation & Problem Statement

celina
Download Presentation

The Generalized MDL Approach for Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Generalized MDL Approach for Summarization Laks V.S. Lakshmanan (UBC) Raymond T. Ng (UBC) Christine X. Wang (UBC) Xiaodong Zhou (UBC) Theodore J. Johnson (AT&T Research) (Work supported by NSERC and NCE/IRIS.)

  2. Overview • Introduction • Motivation & Problem Statement • Spatial Case – MDL & GMDL • Experiments  X • Categorical Case • More Experiments  X • Related work • Summary and Related/Future Work

  3. Introduction • How best to convey large answer sets for queries? • Simple enumeration: accurate but not necessarily most useful • Summaries: not (necessarily) 100% accurate but can be more intuitive • Why is this problem interesting? • OLAP queries over multi-dimensional data typically produce data intensive answers

  4. Introduction (contd.) • Example: (i)customer segmentation based on buying pattern 10 frequency  t 9 • too many answers, • in general • solution: summarize • description via range • constraints • axis-parallel hyper-rectangles • most concise = MDL 8 7 salary K 6 5 4 3 age 20 25 30 35 40 45 50 55 60 65 70

  5. Introduction (contd.) clothes • Example: (ii) aggregate sales performance analysis men’s  2 * last year’s sales women’s • description via hierarchical • ranges = tuples of nodes • most concise = MDL dress pnts wmn’s jns men’s jns frml wear blouses jkts shorts skirts tops ties vancouver edmonton NW san jose san francisco minneapolis location MW chicago boston summit NE albany new york

  6. Motivation • Examples: (i)customer segmentation based on buying pattern 10 frequency  t 9 X frequency < t/2 “white” otherwise 8 white budget = 2 7 white budget  10 salary K 6 X X 5 4 X 3 age 20 25 30 35 40 45 50 55 60 65 70

  7. Motivation (contd.) clothes • Example: (ii) aggregate sales performance analysis men’s  2 * last year’s sales women’s • description via hierarchical • ranges = tuples of nodes • most concise = MDL dress pnts wmn’s jns men’s jns frml wear blouses jkts shorts skirts tops ties vancouver edmonton NW san jose san francisco minneapolis location MW chicago boston summit NE albany new york

  8. Motivation (contd.) clothes • Example: (ii) aggregate sales performance analysis men’s  2 * last year’s sales women’s white budget = 2 X <½ * last year’s sales dress pnts wmn’s jns men’s jns frml wear white budget  7 blouses jkts shorts skirts tops ties vancouver edmonton NW san jose X X san francisco minneapolis location MW chicago boston summit X NE albany new york

  9. GMDL Problem Statement (spatial case) • k totally ordered dimensions Di  S (set of all cells) • B (blue) and R (red) – colored cells • W = S – (B R) (white cells) • Find axis-parallel hyper-rectangles {R1, …, Rm} (i.e., GMDL covering) s.t.: • (R1 …  Rm)  R =  (validity) • |(R1 …  Rm)  W|  w (white budget) • m is the least possible (optimality)

  10. (G)MDL Problem Statement (hierarchical case) • k (tree) hierarchical dimensions • cell = tuple of leaves • region = tuple of nodes • region R covers cell c iff c is a descendant of R, component-wise • covering rules similar to spatial case • MDL/GMDL problem formulations analogous

  11. Algorithms for spatial GMDL • challenges for spatial: even MDL 2D is NP-hard, so we must turn to heuristics • important properties: • blue-maximality • non-redundancy • Algorithms for spatial GMDL: • bottom-up pairwise (BP) merging • R-tree splitting (RTS) [based on Garcia+98] • color-aware splitting (CAS) • CAS corner

  12. Algorithms for spatial GMDL (CAS) • build indices IR, IB for red and blue cells • start with C = region R covering all blue cells; curr-consum = # white cells in R • while ( RC containing a red cell) { • grow the red cell to a larger blue-free region (using IB) • split R into at most 2k regions (excluding the grown red region) • replace R by new regions } • while (curr-consum > w) { • split as above, but based on white cells } • return C

  13. CAS – An Example • trade-off • non-overlapping regions •  loss in quality • overlapping regions  • greater bookkeeping • overhead X X X • Algorithms RTS, the two • CAS’  non-redundant • valid/feasible solutions • BP  may produce • redundant solution; can be • made non-redundant

  14. Categorical Case – MDL •  key diff. between spatial and categorical? • optimal covering  non-redundant • optimal need not be blue-maximal, but can be expanded into one • is blue-maximal non-redundant MDL covering unique? what about their size?

  15. A spatial example two blue-maximal non-redundant coverings of diff. size

  16. Categorical – fundamentals • projection of regions on dimensions: e.g., (MW, women’s) – projection on location = {chicago, minneapolis}. • Claim: R, S any categorical regions (tree hierarchies); Ri – projection of R on dimension i; i, Ri  Si or Si  Ri or Ri  Si =  • see violation in “tough” spatial example • major factor in deciding complexity

  17. Categorical – fundamentals (contd.) • Theorem: space of k categorical dimensions with tree hierarchies  unique blue-maximal non-redundant MDL covering. • Corollary: (i) the said covering can be obtained on a per hierarchy basis. (ii) furthermore, it can be done in polynomial time.

  18. Categorical case – MDL algorithm illustrated i 2 propagate after redundancy check 2 g h before redundancy check 2 a b c d e f a c d c 1 a d i 7 2 a b c d e f g h i 9 3 X a d 4 X a c d a c d 8 5 b c b c 6 X a a 1 2 3 4 6 2 5 1 2 4 5 1 2 3 4 2 2 initialize

  19. Categorical case – MDL • Lemma: Optimal MDL covering for a categorical space with tree hierarchies can be obtained by visiting each node once and each node of last hierarchy twice. • Key idea: for tree hierarchies, finding all blue-maximal regions and removing redundant ones yields the optimal covering.

  20. Categorical case – GMDL • Basic idea: for each internal node, determine the cost and gain of involving it in a GMDL covering; sort candidates in decreasing gain order and increasing cost. Pick greedily. • Example: candidate (1,h) (2,h) (3,h) (4,h) (5,h) occurrence 2 4 1 2 1 max-gain 1 3 0 1 0 cost 2 0 3 X 3

  21. Categorical Case – GMDL (contd.) • Compile similar info. for other parents of leaves; sort and pick best w cells for color change. [drop candidates with cost X or 0.] • Run MDL on the new data.

  22. Related Work • Substantial work on using MDL for summarization principle in data compression [Ristad & Thomas 95], decision trees [Quinaln & Rivest 89, Mehta+ 95], learning of patterns [Kilpelinen 95], etc. • [Agrawal+ 98] – subspace clustering. • Summarizing cube query answers and (G)MDL on categorical spaces – novel.

  23. Summary & Future Work • summarization using MDL/GMDL as a principle • MDL on spatial – NP-complete even on 2D; utility of GMDL – trade compactness for quality (i.e., include “impurity” in answers) • Heuristic algorithms • Efficient algo. for MDL for categorical with tree hierarchies • Heuristics for GMDL • Experimental validation

  24. Future Work • What is the best we can do to summarize data with both spatial and categorical dimensions? • How far can we push the poly time complexity? (e.g., almost-tree hierarchies? Can we impose restrictions on “allowable” intervals even on spatial dimensions?)

More Related