SLIQ and SPRINT for disk resident data

SLIQ and SPRINTfor disk resident data

SLIQ • SLIQ is a decision tree classifier that can handle both numerical and categorical attributes • Builds compact and accurate trees • Uses a pre-sorting technique in the tree growing phase • Suitable for classification of large disk-resident datasets.

Issues • There are two major, criticalperformance, issues in the tree-growth phase: • How to find splitpoints • How to partition the data • The well-knowndecision treeclassifiers: • Grow trees depth-first • Repeatedly sort the data at every node • SLIQ: • Replace this repeated sorting with one-time sort • Use new a data structure call class-list • Class-list must remain memory resident at all time

Some Data

SLIQ - Attribute Lists These are projections on (rid, attribute).

SLIQ - Sort Numeric, Group Categorical

SLIQ - Class List N1

SLIQ - Histograms N1 age25 ? age30 ? Evaluate each split, using GINI or Entropy. ...

SLIQ - Histograms N1 age25 age30 Evaluate each split, using GINI or Entropy. ...

SLIQ - Histograms N1 salary20 salary30 Evaluate each split, using GINI or Entropy. ...

SLIQ - Histograms N1 Married Single Evaluate each split, using GINI or Entropy.

SLIQ - Perform best split and Update Class List N1 salary60 N2 N3

N1 salary60 N2 N3 SLIQ - Perform best split and Update Class List

N1 salary60 N2 N3 SLIQ - Histograms N1 N2 N1 age25 ? N2 Evaluate each split, using GINI or Entropy. ...

N1 salary60 N2 N3 SLIQ - Histograms N1 N2 N1 age25 N2 Evaluate each split, using GINI or Entropy. ...

SLIQ - Pseudocode • Split evaluation: EvaluateSplits() for each numeric attribute A do for each value v in the attribute list do find the corresponding entry in the class list, and hence the corresponding class and the leaf node Ni update the class histogram in leaf Ni compute splitting score for test (A ≤ v) for Ni for each categorical attribute A do for each leaf of the tree do find subset of A with best split

SLIQ - Pseudocode • Updating the class list UpdateLabels() for each split leaf Nido Let A be the split attribute for Ni. for each (rid,v)in the attribute list for Ado find the corresponding entry in the class list e (using the rid) if the leaf referenced by e is Ni then find the new leaf Nj to which (rid,v)belongs (by applying the splitting test) update the leaf pointer for e to Nj

SLIQ - bottleneck • Class-list must remain memory resident at all time! • Although not a big problem with today's memories, still there might be cases where this is a bottleneck. • So, what can we do when the class-list doesn't fit in main memory? • SPRINT is a solution...

SPRINT The main data structures used in SPRINT are: Attribute lists and Class histograms

SPRINT - Histograms age25 age30 Evaluate each split, using GINI or Entropy. ...

SPRINT - Histograms salary20 salary30 Evaluate each split, using GINI or Entropy. ...

SPRINT - Histograms Married Single Evaluate each split, using GINI or Entropy.

SPRINT - Performing Best Split • Once the best split point has been found for a node, we execute the split by creating child nodes. • Requires splitting the node’s lists for every attribute into two. • Partitioning the attribute list of the winning attribute (salary) is easy. • We scan the list, apply the split test, and move the records to two new attribute lists - one for each new child.

SPRINT - Performing Best Split • Unfortunately, for the remaining attribute lists of the node (age and marital), we have no test that we can apply to the attribute values to decide how to divide the records. • Solution: use the rids. • As we partition the list of the splitting attribute (i.e. salary), we insert the rids of each record into a probe structure (hash table), noting to which child the record was moved. • Once we have collected all the rids, we scan the lists of the remaining attributes and probe the hash table with the rid of each record. • The retrieved information tells us with which child to place the record.

SPRINT - Performing Best Split • If the hash-table is too large for the memory, splitting is done in more than one step. • The attribute list for the splitting attribute is partitioned up to the attribute record for which the hash table will fit in memory; • Portions of attribute lists of non-splitting attributes are partitioned; and the process is repeated for the remainder of the attribute list of the splitting attribute.

SLIQ and SPRINT for disk resident data

SLIQ and SPRINT for disk resident data

Presentation Transcript

Finding Time Series Motifs on Disk-Resident Data

SPRINT

Fast Nearest-neighbor Search in Disk-resident Graphs

Disk Clearing and Disk Sanitization

SLIQ: A Fast Scalable Classifier for Data Mining

Sprint

SPRINT: A Scalable Parallel Classifier for Data Mining

F8a-Hard disk data acquisition

SPRINT : A Scalable Parallel Classifier for Data Mining

(Sprint)

SPRINT

Chapter 6 – Managing Disk and Data Storage

Resident / Non-Resident Data Collection

Securing Disk-Resident Data through Application Level Encryption

Sprint Data Solutions for The Maranatha Group

Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets

Storing Data: Disk Organization and I/O

HARD DISK DATA RECOVERY

SLIQ: A Fast Scalable Classifier for Data Mining

Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets

Sprint

Hard Disk Data Recovery Services and Professionals