1 / 27

Multimedia Data Mining using P-trees*

Multimedia Data Mining using P-trees*. William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada July 23, 2002. * P-tree technology is patent pending by NDSU.

leigh
Download Presentation

Multimedia Data Mining using P-trees*

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada July 23, 2002 *P-tree technology is patent pending by NDSU

  2. Outline • Multimedia Data Mining • Peano Count Trees (P-trees) • Properties of P-trees • Data Mining Techniques Using P-trees • Implementation Issues and Performance • Conclusion Multimedia Data Mining using P-trees*

  3. Multimedia Data Mining • Multimedia Data Mining • Extract high-level, information from large multimedia data sets. • Typically done in two steps: • Capture specific features of the data as feature vectors or tuples in a table or feature space. • Mine those tuples for info/knowledge • Association rule mining (ARM) clustering or classification on feature vectors Multimedia Data Mining using P-trees*

  4. Multimedia Data • Remotely Sensed Imagery (RSI) • Usually 2-D (or 3-D) and relatively smooth • Large datasets (e.g., Landsat ETM+ ~100,000,000 pixels). • Video-Audio data mining • Usually result in high dimensional feature spaces • Multimedia datasets are usually very large. • Text mining (Feature space is high dimensional but sparse). • P-trees are well suited for representing such feature spaces • Lossless compressed representation • Good at manipulating high dimensional data set Multimedia Data Mining using P-trees*

  5. Precision Agriculture Dataset:TIFF Image and other measurements (1320×1320) Yield RGB Nitrate Moisture Multimedia Data Mining using P-trees*

  6. The Peano Count Tree (P-tree) P-tree represents feature vector data bit-by-bit, in a recursive, quadrant-by-quadrant, losslessly-compressed manner. • First: given a feature vector space, vertically fragment by column. Storage Decomposition Model (e.g., Bubba, circa 1985). • In SDM, each column is a separate file retaining original row order. • Sometimes called “Vertical Database Model” • Second: For P-trees, we vertically fragment further by bit position • Bit-SDM: each bit position of each column is a file-retain original row order • Each resulting file is called a bit-SeQuential (bSQ) of bSDM file. • The high-order bSQ file IS data. The others are DELTAs (ala, MPEG). Multimedia Data Mining using P-trees*

  7. 39 39 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 16 16 8 8 15 15 0 0 3 3 0 0 4 4 1 1 4 4 4 4 3 3 4 4 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 1-D bSQ file 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 An example of a P-tree 2-D bSQ file (same file in 2-D raster order) • Quadrant-based, Pure (Pure-1/Pure-0) quadrant • Peano or Z-ordering • Root Count Multimedia Data Mining using P-trees*

  8. 001 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 2 3 2 3 111 2 . 2 . 3 ( 7, 1 ) 10.10.11 ( 111, 001 ) 55 • Peano or Z-ordering • Pure-1/Pure-0 quadrant • Root Count 16 8 15 16 3 0 4 1 4 4 3 4 1 1 1 0 0 0 1 0 1 1 0 1 • Level • Fan-out • QID (Quadrant ID) Multimedia Data Mining using P-trees*

  9. m 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 m 0 1 m m m 0 1 1 1 m 1 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 Peano Mask Tree or PM-tree(3-value logic) • Pure1-Trees (most compressed, 2-value logic) pure1-quad=1 else 0 • Truth- or Predicate-Trees (2-value logic: 1-bit=T, 0-bit=F) • Given any condition (e.g., 0, mixed, 0, 1) for each quadrant, if condition is true, 1-bit, else 0-bit. • All are lossless compressed representations of the dataset Multimedia Data Mining using P-trees*

  10. P-tree-1st bit: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 P-tree-2nd bit: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 1 1 1 m //|\ 0100 AND-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 OR-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 m 1 1 / / \ \ m 0 1 m //|\ //|\ 1110 0010 P-tree Operations Count-tree55 Mask-tree m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 16 __8____ _15__ 16 1 m m 1 / / | \ / | \ \ / / \ \ / / \ \ 3 0 4 1 4 4 3 4 m 0 1 m 1 1 m 1 //|\ //|\ //|\ //|\ //|\ //|\ 1110 0010 1101 1110 0010 1101 Complements 9 m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 0 __8___ __1__ 0 0 m m 0 / / | \ / | \ \ / / \ \ / / \ \ 1 4 0 3 0 0 1 0 m 1 0 m 0 0 m 0 //|\ //|\ //|\ //|\ //|\ //|\ 0001 1101 0010 0001 1101 0010 Multimedia Data Mining using P-trees*

  11. Ptree ANDing Operation PM-tree1: m ______/ / \ \______ / / \ \ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 PM-tree2: m ______/ / \ \______ / / \ \ / / \ \ 1 0 m 0 / / \ \ 1 1 1 m //|\ 0100 Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ 1101 0100 Depth-first Pure-1 path code 0 100 101 102 12 132 20 21 220 221 223 23 3 &0 20 21 22 231RESULT 0 0 0 20 20  20 21 21  21 220 221 223 22 220 221 223 23 231231 Parallel software implementations on computer clusters are very fast. Hardware implementations are being developed Multimedia Data Mining using P-trees*

  12. ONE MULTIWAY AND, OR, COMPLEMENT PROG. AND, OR, COMPLEMENT AND, OR Basic P-trees Pi, j Predicate P-trees P(p) COMPLEMENT AND COMPLEMENT OR Value P-trees Pi(v) Interval P-trees Pi(v1, v2) AND AND OR Tuple P-trees P(v1, v2, …, vn) Cube P-trees P([v11, v12], …, [vN1, vN2]) Various P-trees Multimedia Data Mining using P-trees*

  13. Scalability of P-tree Operations Software multi-way ANDing 60.00 50.00 40.00 Time in ms 30.00 20.00 10.00 0.00 0 2 4 6 8 10 12 14 16 18 Dataset size in million Tuples Beowulf cluster of 16 dual P2 266 MHz processors with 128 MB RAM. Multimedia Data Mining using P-trees*

  14. Properties of P-trees 1. a) b) 2. a) b) c) d) 3. a) b) c) d) • 4. rc(P1 | P2)= 0 rc(P1)= 0andrc(P2) = 0 • v1  v2 rc{Pi(v1)& Pi(v2)} = 0 • rc(P1 | P2) = rc(P1) + rc(P2) - rc(P1 & P2) • rc{Pi (v1) | Pi(v2)} = rc{Pi(v1)} + rc{Pi(v2)}, where v1  v2 Multimedia Data Mining using P-trees*

  15. Notations rc(P) : root count of P-tree P N :number of pixels n : number of bands m :number of bits P1& P2:P1AND P2 P1 | P2 :P1OR P2 P´:COMPLEMENT of P Pi, j : basic P-tree for band i bit j. Pi(v) : value P-tree for value v of band i. Pi(v1, v2) : interval P-tree for interval [v1, v2] of band i. P0 : is pure0-tree, a P-tree having the root node which is pure0. P1 : is pure1-tree, a P-tree having the root node which is pure1. Multimedia Data Mining using P-trees*

  16. Techniques Using P-trees • DTI Classifiers • Bayesian Classifiers • ARM • KNN and Closed KNN Classifiers Multimedia Data Mining using P-trees*

  17. Techniques Using P-trees • DTI Classifiers • For large amounts of multimedia data and data streams, standard DTI is very limited in effectiveness. • Fast calculation of measurements, such as information gain through P-tree ANDing, enables P-tree technology to handle large quantities of data and streaming data. • The P-tree based decision tree induction classification method was shown to be significantly faster than existing DTI classification methods. Multimedia Data Mining using P-trees*

  18. Techniques Using P-trees • Bayesian Classifiers • Computing conditional probabilities can be prohibitive for many multimedia applications, since the volume is often large. • From the very 1st paper in 2002 KDD proceedings (“For massive datasets Bayesian methods still begin by a ‘load data into memory’ step, make compromising assumptions, or resort to subsampling to skirt the issue”). • Naïve Bayesian Classification is used to minimize computational costs, but can give poor results (compromising assumption!) • P-tree technology avoids the need to use Naïve Bayesian or subsampling, since conditional probability values derive directly from anding P-trees Multimedia Data Mining using P-trees*

  19. Techniques Using P-trees • Association Rule Mining (ARM) • In most cases multimedia data sizes are too large to be mined in reasonable time using existing algorithms. • P-tree techniques used in an efficient association rule mining algorithm, P-ARM, has shown significant improvement compared with FP-growth and Apriori. Multimedia Data Mining using P-trees*

  20. Techniques Using P-trees • KNN and Closed KNN Classifiers • KNN classifiers typically have a very high cost associated with re-building the classifier when new data arrives (e.g., data streams). • The construction of the neighborhood is the high cost operation • P-tree technologys find closed-KNN neighborhoods quickly. • Experimental results have shown P-tree closed-KNN yields higher classification accuracy as well as significantly higher speed. • Our P-KNN algorithm, combined with GAs, earned honorable mention in the 2002 KDD-cup competition (task-2) and actually won one of the two subproblems (“broad classification problem”). • KDD-cup-2 data was very much multimedia (Hierarchical categorical data, undirected graph data, text data (medline abstracts). Multimedia Data Mining using P-trees*

  21. T Closed-KNN The black dot is the target pixel. For k = 3, to find 3rd nearest neighbor, standard KNN arbitrarily select one point from the boundary as the 3rd neighbor. Closed-KNN includes all points on the boundary Closed-KNN yields a surprisingly higher classification accuracy than traditional KNN and the closed neighborhood is naturally yielded by P-KNN, while traditional KNN require another full dataset scan to find the closed neighborhood. Therefore, P-KNN is both faster and more accurate. Multimedia Data Mining using P-trees*

  22. Performance – Accuracy 1997 TIFF-Yield Dataset: 80 75 70 65 Accuracy (%) 60 55 KNN-Manhattan (L1) KNN-Euclidian (L2) 50 KNN-Max (L) KNN-Hobbit (Hi-order basic bit) P-tree: Perfect Center (closed-KNN) P-tree: Hobbit (closed-KNN) 45 40 256 1024 4096 16384 65536 262144 Training Set Size (no. of pixels) Multimedia Data Mining using P-trees*

  23. Performance - Accuracy (cont.) 1998 TIFF-Yield Dataset: 65 60 55 50 45 Accuracy (%) 40 35 KNN-Manhattan KNN-Euclidian 30 KNN-Max KNN-Hobbit 25 P-tree: Perfect Center (closed-KNN) P-tree: Hobbit (closed-KNN) 20 256 1024 4096 16384 65536 262144 Training Set Size (no of pixels) Multimedia Data Mining using P-trees*

  24. Performance - Time 1997 Dataset: both axis in logarithmic scale Training Set Size (no. of pixels) 256 1024 4096 16384 65536 262144 1 0.1 0.01 Per Sample Classification time (sec) 0.001 KNN-Manhattan KNN-Euclidian KNN-Max 0.0001 KNN-Hobbit P-tree: Perfect Centering (cosed-KNN) P-tree: Hobbit (closed-KNN) 0.00001 Multimedia Data Mining using P-trees*

  25. Performance - Time (cont.) 1998 Dataset : both axis in logarithmic scale Training Set Size (no. of pixels) 256 1024 4096 16384 65536 262144 1 0.1 0.01 Per Sample Classification Time (sec) 0.001 KNN-Manhattan KNN-Euclidian 0.0001 KNN-Max KNN-Hobbit P-tree: Perfect Centering (closed-KNN) P-tree: Hobbit (closed-KNN) 0.00001 Multimedia Data Mining using P-trees*

  26. Conclusion • One of the major issues of multimedia data mining is the sheer size of the resulting feature space. • The P-tree, a data-mining-ready structure, deals with this issue and facilitates efficient data mining of streams. • P-tree methods can be faster and more accurate at the same time. Multimedia Data Mining using P-trees*

  27. Questions? William.Perrizo@ndsu.nodak.edu Computer Science Department North Dakota State University, USA MDM/KDD2002 Workshop Edmonton, Alberta, Canada Multimedia Data Mining using P-trees* Multimedia Data Mining using P-trees*

More Related