270 likes | 295 Views
Explore the duality between element-based and space-based representations in data mining, focusing on partitioning data for efficient storage and retrieval. Learn about concept hierarchies, compression strategies, and handling size in data domains.
E N D
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo
Motivation • Databases and data warehouses are currently separate systems Why? • Standard answer: • Details, details, details … • Our answer: • Fundamental issue of representation
Relations Revisited • R(A1, A2, …, AN) • Set of tuples • Any choices at a fundamental level? Yes! • Duality between • Element-based representation • Space-based representation
Element-based representation: Standard representation of tuples with all their attributes Space-based representation: The existence (count?) of a tuple is represented in its attribute space Duality
Particles can be represented by their position More fundamental level: Particle Particles can be 1 values in a grid of locations Field Similar Dualities in Physics
Space-Based Representation • Consider standard tuples as vectors in the space of attribute domains • Represent all possible attribute combinations as one bit: • 1 if data item is present • 0 if it isn’t • Allowing counts could be useful for projections (?)
Space-Based Representation as a Partition • Partitions are mutually exclusive and collectively exhaustive sets of elements • The Space-Based Representation partitions attribute space into two sets: • Data item present in database (1) • Data item not present (0)
Usefulness of Space-Based Representation • No indexes needed: instant value-based access • Index locking becomes dimensional locking • Aggregation very easy due to value-based ordering • Selections become “and”s What experience do we have with space-based representations?
Data Cube Representation • One value (e.g., sales) given in the space of the key attributes • Space-based with respect to key attributes • Element-based with respect to non-key attributes
Properties of the Domain Space • Ideally space should have distance, norm, etc. • Especially important for data mining • Does that make sense for all domains? • Can any domain be mapped to integer?
Can all Domains be Mapped to Integer? • Simplistic answer: yes! • All information in a computer is saved as bits • Any sequence of bits can be interpreted as an integer • Problems • Order may be irrelevant, e.g., hair-color • Order may be wrong, e.g., sign bit for int • Even if order is correct, spacing may vary, e.g., float (solution in paper: intervalization) • Domains may be very large, e.g., movies
Categorical attributes (irrelevant order) We need more than one attribute for an appropriate representation • Data mining solution: • 1 attribute per domain value • Our solution: • 1 attribute per bit slice • Values are corners of a Hypercube in log(Domain Size) dimensions • Distances are given trough MAX metric
Fundamental Partition(Space-Based Representation) • # of dimensions = Number of attributes • # of represented points = product of all domain sizes • Exponential in number of dimensions! • We badly need compression!
How Do We Handle Size? • Problem exponential in #of attributes • How can we reduce #of attributes? Review normalization: • We can decompose a relation into a set of relations each of which contains the entire key and one other attribute • This decomposition is • loss less • dependency preserving (BCNF relations only)
Compression for Non-Key Attributes Fundamental partition contains one non-zero data-point in any non-key dimension only • Represent number by bit-slices Note: • This works for numerical and categorical attributes Original values can be regained by anding • Example 5 (binary 101) is bit 0 & bit 1’ & bit 2
Concept Hierarchies Bit sliced representation have significant benefits beyond compression: • Bit slices can be combined into concept hierarchies: • Highest level: bit 0 • Next level: bit 0 & bit 1 • Next level: bit 0 & bit 1 & bit 2
Compression for Key Attributes • Database state-independent compression could lead to information loss (counts > 1) • Database state-dependent compression: • Tree structure that eliminates pure subtrees => P-trees
Other Ideas Compression is better if attribute values are dense within their domain • We could use extent domain • Compression good • Problems with insertion • Reorganization of storage • Index locking has to be reintroduced • …
How Good is Compression so far? • If all domains are “dense”, i.e. all values occur • Size can easily be smaller than original relation • If non-key attributes are “sparse” • Not usually a problem: good compression • Problems only in extreme cases • E.g., movies as attribute values! • If key-attributes are “sparse” • Larger potential for problems, but also large potential for benefit (see data cubes)
Are Key-Attributes Usually Sparse? • Many key attributes are dense (“structure” attributes as keys) • Automatically generated IDs are usually sequential • x and y in spatial data mining • Time in data streams • Keys in tables that represent relationships tend to be sparse (feature attributes as keys) • Student / course offering / grade • Data cubes!
What Have We Gained?(Database Aspects) • Data simultaneously acts as index • No separate index locking • (unless extent domain is used) • All information saved as bit patterns • Easy “select” • Other database operations discussed in class
What Have We Gained?(Feature Attribute Keys) • Direct mining possible on relations with feature attributes keys • E.g., student / course offering / grade • Rollup can be defined, etc. • Clustering, classification, ARM can make use of proximity inherent in representation • Bit-wise representation provides concept hierarchy for non-key attribute • Tree structure provides concept hierarchy for key attributes
What Have We Gained?(Structure Attribute Keys) • For relations with structure attribute keys mining requires “and”ing • produces counts for feature attributes • Bit-wise representation provides concept hierarchy for non-key attribute Duality: • Concept hierarchies in this representation map exactly to tree structure when the attribute is a key
Mapping Concept HierarchiesBit Slices <-> Tree P-tree: • Take key attributes, e.g. x and y, and bit interleave them: • x = 1 0 0 1 • y = 1 1 0 1 • 1 1 0 1 0 0 1 1 • Any two of these digits form a level in the P-tree – or a level in a concept hierarchy
How Could We Use That Duality? • Join with other relations and project off key attributes (Meta P-trees) • Can we do that? • We lose uniqueness • We can use 1 to represent 1 or more tuples (equivalent to relational algebra) • Or we can introduce counts • Can be useful for data mining • Need for non-duplicate eliminating counts exists also in other applications
How Do Hierarchies Benefit us in Databases? • Multi-granularity Locking • Subtrees form suitable units for storage in a block • Fast access! Proportional to • # of levels in tree • # of bits for bit slices
Summary • Space-based representation has many benefits • Value-based access and storage • No separate index needed • Rollups easy • P-Trees • Follow from systematic compression • Benefits from concept hierarchies