270 likes | 462 Views
Multimedia Data Mining using P-trees*. William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada July 23, 2002. * P-tree technology is patent pending by NDSU.
E N D
Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada July 23, 2002 *P-tree technology is patent pending by NDSU
Outline • Multimedia Data Mining • Peano Count Trees (P-trees) • Properties of P-trees • Data Mining Techniques Using P-trees • Implementation Issues and Performance • Conclusion Multimedia Data Mining using P-trees*
Multimedia Data Mining • Multimedia Data Mining • Extract high-level, information from large multimedia data sets. • Typically done in two steps: • Capture specific features of the data as feature vectors or tuples in a table or feature space. • Mine those tuples for info/knowledge • Association rule mining (ARM) clustering or classification on feature vectors Multimedia Data Mining using P-trees*
Multimedia Data • Remotely Sensed Imagery (RSI) • Usually 2-D (or 3-D) and relatively smooth • Large datasets (e.g., Landsat ETM+ ~100,000,000 pixels). • Video-Audio data mining • Usually result in high dimensional feature spaces • Multimedia datasets are usually very large. • Text mining (Feature space is high dimensional but sparse). • P-trees are well suited for representing such feature spaces • Lossless compressed representation • Good at manipulating high dimensional data set Multimedia Data Mining using P-trees*
Precision Agriculture Dataset:TIFF Image and other measurements (1320×1320) Yield RGB Nitrate Moisture Multimedia Data Mining using P-trees*
The Peano Count Tree (P-tree) P-tree represents feature vector data bit-by-bit, in a recursive, quadrant-by-quadrant, losslessly-compressed manner. • First: given a feature vector space, vertically fragment by column. Storage Decomposition Model (e.g., Bubba, circa 1985). • In SDM, each column is a separate file retaining original row order. • Sometimes called “Vertical Database Model” • Second: For P-trees, we vertically fragment further by bit position • Bit-SDM: each bit position of each column is a file-retain original row order • Each resulting file is called a bit-SeQuential (bSQ) of bSDM file. • The high-order bSQ file IS data. The others are DELTAs (ala, MPEG). Multimedia Data Mining using P-trees*
39 39 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 16 16 8 8 15 15 0 0 3 3 0 0 4 4 1 1 4 4 4 4 3 3 4 4 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 1-D bSQ file 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 An example of a P-tree 2-D bSQ file (same file in 2-D raster order) • Quadrant-based, Pure (Pure-1/Pure-0) quadrant • Peano or Z-ordering • Root Count Multimedia Data Mining using P-trees*
001 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 2 3 2 3 111 2 . 2 . 3 ( 7, 1 ) 10.10.11 ( 111, 001 ) 55 • Peano or Z-ordering • Pure-1/Pure-0 quadrant • Root Count 16 8 15 16 3 0 4 1 4 4 3 4 1 1 1 0 0 0 1 0 1 1 0 1 • Level • Fan-out • QID (Quadrant ID) Multimedia Data Mining using P-trees*
m 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 m 0 1 m m m 0 1 1 1 m 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 Peano Mask Tree or PM-tree(3-value logic) • Pure1-Trees (most compressed, 2-value logic) pure1-quad=1 else 0 • Truth- or Predicate-Trees (2-value logic: 1-bit=T, 0-bit=F) • Given any condition (e.g., 0, mixed, 0, 1) for each quadrant, if condition is true, 1-bit, else 0-bit. • All are lossless compressed representations of the dataset Multimedia Data Mining using P-trees*
P-tree-1st bit: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 P-tree-2nd bit: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 1 1 1 m //|\ 0100 AND-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 OR-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 m 1 1 / / \ \ m 0 1 m //|\ //|\ 1110 0010 P-tree Operations Count-tree55 Mask-tree m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 16 __8____ _15__ 16 1 m m 1 / / | \ / | \ \ / / \ \ / / \ \ 3 0 4 1 4 4 3 4 m 0 1 m 1 1 m 1 //|\ //|\ //|\ //|\ //|\ //|\ 1110 0010 1101 1110 0010 1101 Complements 9 m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 0 __8___ __1__ 0 0 m m 0 / / | \ / | \ \ / / \ \ / / \ \ 1 4 0 3 0 0 1 0 m 1 0 m 0 0 m 0 //|\ //|\ //|\ //|\ //|\ //|\ 0001 1101 0010 0001 1101 0010 Multimedia Data Mining using P-trees*
Ptree ANDing Operation PM-tree1: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 PM-tree2: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 1 1 1 m //|\ 0100 Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 Depth-first Pure-1 path code 0 100 101 102 12 132 20 21 220 221 223 23 3 &0 20 21 22 231RESULT 0 0 0 20 20 20 21 21 21 220 221 223 22 220 221 223 23 231231 Parallel software implementations on computer clusters are very fast. Hardware implementations are being developed Multimedia Data Mining using P-trees*
ONE MULTIWAY AND, OR, COMPLEMENT PROG. AND, OR, COMPLEMENT AND, OR Basic P-trees Pi, j Predicate P-trees P(p) COMPLEMENT AND COMPLEMENT OR Value P-trees Pi(v) Interval P-trees Pi(v1, v2) AND AND OR Tuple P-trees P(v1, v2, …, vn) Cube P-trees P([v11, v12], …, [vN1, vN2]) Various P-trees Multimedia Data Mining using P-trees*
Scalability of P-tree Operations Software multi-way ANDing 60.00 50.00 40.00 Time in ms 30.00 20.00 10.00 0.00 0 2 4 6 8 10 12 14 16 18 Dataset size in million Tuples Beowulf cluster of 16 dual P2 266 MHz processors with 128 MB RAM. Multimedia Data Mining using P-trees*
Properties of P-trees 1. a) b) 2. a) b) c) d) 3. a) b) c) d) • 4. rc(P1 | P2)= 0 rc(P1)= 0andrc(P2) = 0 • v1 v2 rc{Pi(v1)& Pi(v2)} = 0 • rc(P1 | P2) = rc(P1) + rc(P2) - rc(P1 & P2) • rc{Pi (v1) | Pi(v2)} = rc{Pi(v1)} + rc{Pi(v2)}, where v1 v2 Multimedia Data Mining using P-trees*
Notations rc(P) : root count of P-tree P N :number of pixels n : number of bands m :number of bits P1& P2:P1AND P2 P1 | P2 :P1OR P2 P´:COMPLEMENT of P Pi, j : basic P-tree for band i bit j. Pi(v) : value P-tree for value v of band i. Pi(v1, v2) : interval P-tree for interval [v1, v2] of band i. P0 : is pure0-tree, a P-tree having the root node which is pure0. P1 : is pure1-tree, a P-tree having the root node which is pure1. Multimedia Data Mining using P-trees*
Techniques Using P-trees • DTI Classifiers • Bayesian Classifiers • ARM • KNN and Closed KNN Classifiers Multimedia Data Mining using P-trees*
Techniques Using P-trees • DTI Classifiers • For large amounts of multimedia data and data streams, standard DTI is very limited in effectiveness. • Fast calculation of measurements, such as information gain through P-tree ANDing, enables P-tree technology to handle large quantities of data and streaming data. • The P-tree based decision tree induction classification method was shown to be significantly faster than existing DTI classification methods. Multimedia Data Mining using P-trees*
Techniques Using P-trees • Bayesian Classifiers • Computing conditional probabilities can be prohibitive for many multimedia applications, since the volume is often large. • From the very 1st paper in 2002 KDD proceedings (“For massive datasets Bayesian methods still begin by a ‘load data into memory’ step, make compromising assumptions, or resort to subsampling to skirt the issue”). • Naïve Bayesian Classification is used to minimize computational costs, but can give poor results (compromising assumption!) • P-tree technology avoids the need to use Naïve Bayesian or subsampling, since conditional probability values derive directly from anding P-trees Multimedia Data Mining using P-trees*
Techniques Using P-trees • Association Rule Mining (ARM) • In most cases multimedia data sizes are too large to be mined in reasonable time using existing algorithms. • P-tree techniques used in an efficient association rule mining algorithm, P-ARM, has shown significant improvement compared with FP-growth and Apriori. Multimedia Data Mining using P-trees*
Techniques Using P-trees • KNN and Closed KNN Classifiers • KNN classifiers typically have a very high cost associated with re-building the classifier when new data arrives (e.g., data streams). • The construction of the neighborhood is the high cost operation • P-tree technologys find closed-KNN neighborhoods quickly. • Experimental results have shown P-tree closed-KNN yields higher classification accuracy as well as significantly higher speed. • Our P-KNN algorithm, combined with GAs, earned honorable mention in the 2002 KDD-cup competition (task-2) and actually won one of the two subproblems (“broad classification problem”). • KDD-cup-2 data was very much multimedia (Hierarchical categorical data, undirected graph data, text data (medline abstracts). Multimedia Data Mining using P-trees*
T Closed-KNN The black dot is the target pixel. For k = 3, to find 3rd nearest neighbor, standard KNN arbitrarily select one point from the boundary as the 3rd neighbor. Closed-KNN includes all points on the boundary Closed-KNN yields a surprisingly higher classification accuracy than traditional KNN and the closed neighborhood is naturally yielded by P-KNN, while traditional KNN require another full dataset scan to find the closed neighborhood. Therefore, P-KNN is both faster and more accurate. Multimedia Data Mining using P-trees*
Performance – Accuracy 1997 TIFF-Yield Dataset: 80 75 70 65 Accuracy (%) 60 55 KNN-Manhattan (L1) KNN-Euclidian (L2) 50 KNN-Max (L) KNN-Hobbit (Hi-order basic bit) P-tree: Perfect Center (closed-KNN) P-tree: Hobbit (closed-KNN) 45 40 256 1024 4096 16384 65536 262144 Training Set Size (no. of pixels) Multimedia Data Mining using P-trees*
Performance - Accuracy (cont.) 1998 TIFF-Yield Dataset: 65 60 55 50 45 Accuracy (%) 40 35 KNN-Manhattan KNN-Euclidian 30 KNN-Max KNN-Hobbit 25 P-tree: Perfect Center (closed-KNN) P-tree: Hobbit (closed-KNN) 20 256 1024 4096 16384 65536 262144 Training Set Size (no of pixels) Multimedia Data Mining using P-trees*
Performance - Time 1997 Dataset: both axis in logarithmic scale Training Set Size (no. of pixels) 256 1024 4096 16384 65536 262144 1 0.1 0.01 Per Sample Classification time (sec) 0.001 KNN-Manhattan KNN-Euclidian KNN-Max 0.0001 KNN-Hobbit P-tree: Perfect Centering (cosed-KNN) P-tree: Hobbit (closed-KNN) 0.00001 Multimedia Data Mining using P-trees*
Performance - Time (cont.) 1998 Dataset : both axis in logarithmic scale Training Set Size (no. of pixels) 256 1024 4096 16384 65536 262144 1 0.1 0.01 Per Sample Classification Time (sec) 0.001 KNN-Manhattan KNN-Euclidian 0.0001 KNN-Max KNN-Hobbit P-tree: Perfect Centering (closed-KNN) P-tree: Hobbit (closed-KNN) 0.00001 Multimedia Data Mining using P-trees*
Conclusion • One of the major issues of multimedia data mining is the sheer size of the resulting feature space. • The P-tree, a data-mining-ready structure, deals with this issue and facilitates efficient data mining of streams. • P-tree methods can be faster and more accurate at the same time. Multimedia Data Mining using P-trees*
Questions? William.Perrizo@ndsu.nodak.edu Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada Multimedia Data Mining using P-trees* Multimedia Data Mining using P-trees*