1 / 27

Partitioning – A Uniform Model for Data Mining

Explore the duality between element-based and space-based representations in data mining, focusing on partitioning data for efficient storage and retrieval. Learn about concept hierarchies, compression strategies, and handling size in data domains.

jwilcox
Download Presentation

Partitioning – A Uniform Model for Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo

  2. Motivation • Databases and data warehouses are currently separate systems Why? • Standard answer: • Details, details, details … • Our answer: • Fundamental issue of representation

  3. Relations Revisited • R(A1, A2, …, AN) • Set of tuples • Any choices at a fundamental level? Yes! • Duality between • Element-based representation • Space-based representation

  4. Element-based representation: Standard representation of tuples with all their attributes Space-based representation: The existence (count?) of a tuple is represented in its attribute space Duality

  5. Particles can be represented by their position More fundamental level: Particle Particles can be 1 values in a grid of locations Field Similar Dualities in Physics

  6. Space-Based Representation • Consider standard tuples as vectors in the space of attribute domains • Represent all possible attribute combinations as one bit: • 1 if data item is present • 0 if it isn’t • Allowing counts could be useful for projections (?)

  7. Space-Based Representation as a Partition • Partitions are mutually exclusive and collectively exhaustive sets of elements • The Space-Based Representation partitions attribute space into two sets: • Data item present in database (1) • Data item not present (0)

  8. Usefulness of Space-Based Representation • No indexes needed: instant value-based access • Index locking becomes dimensional locking • Aggregation very easy due to value-based ordering • Selections become “and”s What experience do we have with space-based representations?

  9. Data Cube Representation • One value (e.g., sales) given in the space of the key attributes • Space-based with respect to key attributes • Element-based with respect to non-key attributes

  10. Properties of the Domain Space • Ideally space should have distance, norm, etc. • Especially important for data mining • Does that make sense for all domains? • Can any domain be mapped to integer?

  11. Can all Domains be Mapped to Integer? • Simplistic answer: yes! • All information in a computer is saved as bits • Any sequence of bits can be interpreted as an integer • Problems • Order may be irrelevant, e.g., hair-color • Order may be wrong, e.g., sign bit for int • Even if order is correct, spacing may vary, e.g., float (solution in paper: intervalization) • Domains may be very large, e.g., movies

  12. Categorical attributes (irrelevant order) We need more than one attribute for an appropriate representation • Data mining solution: • 1 attribute per domain value • Our solution: • 1 attribute per bit slice • Values are corners of a Hypercube in log(Domain Size) dimensions • Distances are given trough MAX metric

  13. Fundamental Partition(Space-Based Representation) • # of dimensions = Number of attributes • # of represented points = product of all domain sizes • Exponential in number of dimensions! • We badly need compression!

  14. How Do We Handle Size? • Problem exponential in #of attributes • How can we reduce #of attributes? Review normalization: • We can decompose a relation into a set of relations each of which contains the entire key and one other attribute • This decomposition is • loss less • dependency preserving (BCNF relations only)

  15. Compression for Non-Key Attributes Fundamental partition contains one non-zero data-point in any non-key dimension only • Represent number by bit-slices Note: • This works for numerical and categorical attributes Original values can be regained by anding • Example 5 (binary 101) is bit 0 & bit 1’ & bit 2

  16. Concept Hierarchies Bit sliced representation have significant benefits beyond compression: • Bit slices can be combined into concept hierarchies: • Highest level: bit 0 • Next level: bit 0 & bit 1 • Next level: bit 0 & bit 1 & bit 2

  17. Compression for Key Attributes • Database state-independent compression could lead to information loss (counts > 1) • Database state-dependent compression: • Tree structure that eliminates pure subtrees => P-trees

  18. Other Ideas Compression is better if attribute values are dense within their domain • We could use extent domain • Compression good • Problems with insertion • Reorganization of storage • Index locking has to be reintroduced • …

  19. How Good is Compression so far? • If all domains are “dense”, i.e. all values occur • Size can easily be smaller than original relation • If non-key attributes are “sparse” • Not usually a problem: good compression • Problems only in extreme cases • E.g., movies as attribute values! • If key-attributes are “sparse” • Larger potential for problems, but also large potential for benefit (see data cubes)

  20. Are Key-Attributes Usually Sparse? • Many key attributes are dense (“structure” attributes as keys) • Automatically generated IDs are usually sequential • x and y in spatial data mining • Time in data streams • Keys in tables that represent relationships tend to be sparse (feature attributes as keys) • Student / course offering / grade • Data cubes!

  21. What Have We Gained?(Database Aspects) • Data simultaneously acts as index • No separate index locking • (unless extent domain is used) • All information saved as bit patterns • Easy “select” • Other database operations discussed in class

  22. What Have We Gained?(Feature Attribute Keys) • Direct mining possible on relations with feature attributes keys • E.g., student / course offering / grade • Rollup can be defined, etc. • Clustering, classification, ARM can make use of proximity inherent in representation • Bit-wise representation provides concept hierarchy for non-key attribute • Tree structure provides concept hierarchy for key attributes

  23. What Have We Gained?(Structure Attribute Keys) • For relations with structure attribute keys mining requires “and”ing • produces counts for feature attributes • Bit-wise representation provides concept hierarchy for non-key attribute Duality: • Concept hierarchies in this representation map exactly to tree structure when the attribute is a key

  24. Mapping Concept HierarchiesBit Slices <-> Tree P-tree: • Take key attributes, e.g. x and y, and bit interleave them: • x = 1 0 0 1 • y = 1 1 0 1 • 1 1 0 1 0 0 1 1 • Any two of these digits form a level in the P-tree – or a level in a concept hierarchy

  25. How Could We Use That Duality? • Join with other relations and project off key attributes (Meta P-trees) • Can we do that? • We lose uniqueness • We can use 1 to represent 1 or more tuples (equivalent to relational algebra) • Or we can introduce counts • Can be useful for data mining • Need for non-duplicate eliminating counts exists also in other applications

  26. How Do Hierarchies Benefit us in Databases? • Multi-granularity Locking • Subtrees form suitable units for storage in a block • Fast access! Proportional to • # of levels in tree • # of bits for bit slices

  27. Summary • Space-based representation has many benefits • Value-based access and storage • No separate index needed • Rollups easy • P-Trees • Follow from systematic compression • Benefits from concept hierarchies

More Related