C.J. Date, circa 1980 Transactions on a DBMS vs. file processing programs on file systems .

Data Mining on a Data Warehousevs.Transaction Processingon DatabaseWORKLOAD? (read/write) REPOSITORY? (static/dynamic) • C.J. Date, circa 1980 • Transactions on a DBMS vs. • file processing programs on file systems. • “Use a DBMS instead of file systems! Unify data resources, centralize control, promote standards and consistency, eliminate redundancy, increase data value and usage, yadda, yadda” • Circa 1990 • “Buy a separate DW for DM” (separate from your DBMS for TP)” • 2 separate, quite redundant, non-sharing, inconsistent.. systems! • What happened? • Great marketing success! (sold more hardware and software) • Great Concurrency Control R&D failure! We failed to integrate transactions and queries (OLTP and OLAP, i.e., updates and reads) in one system with acceptable performance! • Marketing of DWs was so successful, nobody noticed the failure!

OUTLINE Some still hold out hope that DWs and DBs will eventually be unified again. I believe industry will demand it. Already, there’s work on updating DWs For now let’s just focus on DATA. Data Mining (DM) is just less structured querying. And on that side of the spectrum, you run up against two curses immediately. Curse of cardinality (solutions don’t scale with data volume.) Curse of dimensionality(solutions don’t scale with data dimension I will talk about techniques we use to address these curses. Horizontal processing of vertically structured data (instead of the ubiquitous vertical processing of horizontal data (record orientation). Parallelize the DM engine. • Parallelize software DM engine on clusters of computers. • Parallelize greyware DM engine on clusters of people (i.e., browser enable all software for visual data mining) Why do we need data mining? Data volume expands by Parkinson’s Law: Data volume expands to fill available data storage Disk-storage expands by Moore’s law: Available storage doubles every 9 months!

I had to make up these Name! Projected data sizes are overrunning our ability to name those sizes! We’re awash with data! • Network data: hi-speed, DWDM, All-opt(mgmt, flow classif’n,QoS,security) • 10 terabytes by 2004 ~ 1013 B • US EROS Data Center (EDC) archives Earth Observing System (EOS) Remotely Sensed Imagery (RSI), satellite and aerial photo data • 15 petabytes by 2007 ~ 1016 B • National Virtual Observatory (aggregated astronomical data) • 10 exabytes by 2010 ~ 1019 B • Sensor data from sensors (including Micro & Nano -sensor networks) • 10 zettabytes by 2015 ~ 1022 B • WWW (and other text collections) • 10 yottabytes by 2020 ~ 1025 B • Genomic/Proteomic/Metabolomic data (microarrays, genechips, genome sequences) • 10 gazillabytes by 2030 ~ 1028 B?). • Stock Market prediction data (prices + all the above? especially astronomy data? • 10 supragazillabytes by 2040 ~ 1031 B? Useful information must be teased out of these large volumes of data through data mining.

More’s Law: More’s Less The more volume you have, the less information you have.(AKA: Shannon’s Canon) A simple illustration: Which phone book has more info? (both involve the same 4 data granules) BOOK-1BOOK-2 Name NumberName Number Smith 234-9816 Smith 234-9816 Jones 231-7237 Smith 231-7237 Jones 234-9816 Jones 231-7237 Data mining reduces volume and raises the information level.

Grasshopper caused significant economic loss each year. TIFF image Yield Map Early infestation prediction is key to damage control. Precision Agriculture Data Mining Yield prediction:dataset consists of an aerial photograph (RGB TIFF image taken during the growing season) and a synchroniz yield map (crop yield taken at harvest); thus, 4 feature attributes (B,G,R,Y) and ~100,000 pixels Producer want the association between color intensities & yield The Rule,hi_green&low_red hi_yield, is intuitive and could be made and verified without data mining (just querying). DM found a stronger rule,hi_NIR & low_redhi_yield. Now many producers use VIR instead of RBG cameras. Grasshopper Infestation Prediction (again involving RSI data) Pixel classification on remotely sensed imagery holds significant promise to achieve early detection. Pixel classification (signaturing) has many applications pest detection, forest fire detection, wet-lands monitoring … (for signaturing we developed the SMILEY software/greyware system) http:midas.cs.ndsu.nodak.edu/~smiley

Sensor Network Data Mining • Micro and Nano scale sensor blocks are being developed for sensing • Biological agents • Chemical agents • Motion detection • coatings deterioration • RF-tagging of inventory • Structural materials fatigue • There will be trillions++ of individual sensors creating mountains of data. • The data must be mined for it’s information.

Situation space ================================== \ CARRIER / Sensor Network Application: CubE for Active Situation Replication (CEASR) Nano-sensors dropped into the Situation space Wherever threshold level is sensed (chem, bio, thermal...) a ping is registered in 1 compressed Ptree for that location. Using Alien Technology’s Fluidic Self-assembly (FSA) technology, clear layers are laminated into a cube, with a embedded nano-LED at each voxel. .:.:.:.:..::….:. : …:…:: ..: . . :: :.:…: :..:..::. .:: ..:.::.. .:.:.:.:..::….:. : …:…:: ..: . . :: :.:…: :..:..::. .:: ..:.::.. .:.:.:.:..::….:. : …:…:: ..: . . :: :.:…: :..:..::. .:: ..:.::.. The Ptree is transmitted to the cube, where the pattern is reconstructed (uncompress Ptree, display on the cube). Each energized nano-sensor transmits a ping (location is triangulated from the ping). These locations are then translated to 3-dimensional coordinates at the display. The corresponding voxel on the display lights up. This is the expendable, one-time, cheap sensor version. A more sophisticated CEASR device could sense and transmit the intensity levels, lighting up the display voxel with the same intensity. Soldier sees replica of sensed situation prior to entering space

Anthropology ApplicationDigital Archive Network for Anthropology (DANA)(data mine arthropological artifacts (shape, color, discovery location,…)

visualization Pattern Evaluation and Assay Data Mining OLAP Classification Clustering ARM Loop backs Task-relevant Data Data Warehouse: cleaned, integrated, read-only, periodic, historical raw database Selection Feature extraction, tuple selection Data Cleaning/Integration: missing data, outliers, noise, errors Smart files Data Mining Queryingis asking specific questions and expecting specific answers Data Mininggoes into MOUNTAINS of raw data for info gems. But also finds, fool’s gold? Relevance analysis, assays info and knowledge gems.

Fractals, … Standard querying Searching and Aggregating Data Prospecting Machine Learning Data Mining Association Rule Mining OLAP (rollup, drilldown, slice/dice.. Supervised Learning – classification regression SQL SELECT FROM WHERE Complex queries (nested, EXISTS..) FUZZY query, Search engines, BLAST searches Unsupervised Learning - clustering Walmart vs.KMart Data Mining versus Querying There is a whole spectrum of techniques to get information from data: On the Query end, much work is yet to be done(D. DeWitt, ACM SIGMOD Record’02). On the Data Mining end, the surface has barely beenscratched. But even those scratches had a great impact – One of the early scatchers became the biggest corporation in the world last year. A Non-scratcher filed for bankruptcy Our Approach:Vertical, compressed data structures, Predicate-trees or Peano-trees (Ptrees in either case)1 processed horizontally (DBMSs processhorizontal data vertically) • Ptrees are data-mining-ready, compressed data structures, which attempt to address the curses of scalability and curse of dimensionality. 1 Ptree Technology is patent pending by North Dakota State University

Vertical Data Structures History • In the 1980’s vertical data structures were proposed for record-based workloads • Decomposition Storage Model (DSM, Copeland et al) • Attribute Transposed File (ATF) • Bit Transposed File (BTF, Wang et al); Viper • Band Sequential Format (BSQ) for Remotely Sensed Imagery • DSM and BTF initiatives have disappeared. Why? (next slide) • Vertical auxiliary and system structures • Domain & Request Vectors (DVA/ROLL/ROCC Perrizo, Shi, et al) • vertical system structures (query optimization & synchronization) • Bit Mapped Indexes (BMIs - very popular in Data Warehouses) • all indexes are vertical auxiliary structures really • BMI’s use bit maps (positional approach to IDing records) • other indexes use RID lists (keyword or value approach)

Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 Horizontally structured records Scanned vertically = pure1? true=1 pure1? false=0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 pure1? false=0 pure1? false=0 pure1? false=0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 1. Whole is pure1? false  0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 2. Left half pure1? false  0 P11 0 0 0 0 1 01 3. Right half pure1? false  0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 1 10 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 10 0 0 0 0 1 1 0 1 0 0 ^ ^ ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 4. Left half of rt half? false0 ^ 7 0 1 4 0 0 1 0 1 01 5. Rt half of right half? true1 0 1 0 6. Lf half of lf of rt? true1 To count occurrences of 7,0,1,4 use pure111000001100: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level=2 01 21-level ^ 7. Rt half of lf of rt? false0 then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R11 into P11 goes as follows: R(A1 A2 A3 A4) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 R11 0 0 0 0 1 0 1 1 Top-down construction of the 1-dimensional Ptree representation of R11, denoted, P11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves, until purity is achieved. Horizontally AND basic Ptrees P11 And it’s pure so branch ends But it is pure (pure0) so this branch ends

R11 0 0 0 0 1 0 1 1 Top-down onstruction of basic Ptrees is best for understanding, but bottom-up is much more efficient. 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 1 1 1 bottom up construction of2-Dimensional Ptrees(eg, natural dim choice for images) Bottom-up construction of 1-Dim, P11, is done using in-order tree traversal and the collapsing of pure siblings, as follow: P11 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 bit-file (e.g., hi-order bit of Green band): 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order is: Ptree using 2-Dim Peano order.

1=001 55 level-3 (pure=43) 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 2 3 1 16 0 0 15 0 1 16 level-2 2 3 0 0 0 4 1 0 1 0 4 0 4 3 0 1 4 level-1 3 7=111 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 level-0 1 2 . 2 . 3 ( 7, 1 ) 10.10.11 ( 111, 001 ) Count tree (an alternative)? Counts are needed in data mining, but basic Ptrees are more compressed and counts can be produced quickly from them. Can, however, construct Count-tree where each inode counts 1s in that quadrant QID (Quadrant ID): e.g., 2.2.3 Quadrant counts Root Count Tree levels: 3, 2, 1, 0, with Purity counts of 43 42 41 40 respectively Fan-out = 2dim = = 22 = 4

3-Dimensional Ptrees(e.g., for the CEASR sensor network

Logical Operations on Ptrees (are used to get counts of any pattern) Ptree dimension is a user parameter and can be chosen to fit the data; default=1-D Ptrees (recursive halving);Images2-D Ptrees (recursive quartering);3-D Solids3-D Ptrees (recursive eighth-ing) Or dimension can be chosen based on other considerations )optimize compression, increase processing speed) Ptree 1 Ptree 2 AND result OR result Ptree AND is faster than bit-by-bit AND since, any pure0 operand node means result node is pure0. e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes).

R( A1 A2 A3 A4) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 1 Horizontal Processing of Vertical Structuresfor Record-based Workloads • For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? • For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, where there is no reconstructive post processing?

YOUR DATA MINING YOUR DATA Data Integration Language DIL Ptree (Predicates) Query Language PQL DII (Data Integration Interface) DMI (Data Mining Interface) Data Repository lossless, compressed, distributed, vertically-structured database Architecture for the DataMIME™ System(DataMIMEtm = data mining, NO NOISE) (PDMS = P-tree Data Mining System) Internet

Unsorted relation Generalized Raster and Peano Sorting: generalizes to any table with numeric attributes (not just images). Raster Sorting: Attributes 1st Bit position 2nd Peano Sorting: Bit position 1st Attributes 2nd

Unsorted Generalized Raster Generalized Peano crop adult spam function mushroom Generalize Peano Sorting KNN speed improvement (using 5 UCI Machine Learning Repository data sets) 120 100 80 Time in Seconds 60 40 20 0

Astronomy Application:(National Virtual Observatory data) What Ptree dimension and what ordering should be used for astronomical data?, where all bodies are assumed on surface of celestial sphere (shares equatorial plane with earth and has no specified radius) Hierarchical Triangle Mesh Tree (HTM-tree, seems to be the accepted standard) Peano Triangle Mesh Tree (PTM-tree) Peano Celestial Coordinate tree (RA=Recession Angle (longitudinal angle); dec=declination (latitude angle) PTM is similar to HTM used in the Sloan Digital Sky Survey project. In both: • Sphere is divided into triangles • Triangle sides are always great circle segments. • PTM differs from HTM in the way in which they are ordered?

1,2 1,2 1,3,3 1,1,2 1,0 1,3,0 1,1,1 1,0 1,1,0 1,1 1,3 1,3,2 1,1 1.1.3 1,3,1 1,3 The difference between HTM and PTM-trees is in the ordering. 1 1 Ordering of PTM-tree Ordering of HTM Why use a different ordering?

dec RA PTM Triangulation of the Celestial Sphere The following ordering produces a sphere-surface filling curve with good continuity characteristics, For each level. Traverse southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point. Traverse southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point. left Equilateral triangle (90o sector) bounded by longitudinal and equatorial line segments right right left turn Traverse the next level of triangulation, alternating again with left-turn, right-turn, left-turn, right-turn..

PTM-triangulation - Next Level LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL

90o 0o -90o 0o 360o Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z South Plane  Plane Peano Celestial Coordinates Unlike PTM-trees which initially partition the sphere into the 8 faces of an octahedron, in the PCCtree scheme: Sphere is tranformed to a cylinder, then into a rectangle, then standard Peano ordering is used on the Celestial Coordinates. • Celestial Coordinates Recession Angle (RA) runs from 0 to 360o dand Declination Angle (dec) runs from -90o to 90o. Sphere  Cylinder

SubCell-Location Myta Ribo Nucl Ribo 17, 78 12, 60 Mi, 40 1, 48 10, 75 0 0 7, 40 0 14, 65 0 0 16, 76 0 9, 45 Pl, 43 Function apop meio mito apop StopCodonDensity .1 .1 .1 .9 PolyA-Tail 1 1 0 0 Organism Species Vert Genome Size (million bp) Gene Dimension Table g0 g1 g2 g3 o0 human Homo sapiens 1 3000 Organism Dimension Table o1 fly Drosophila melanogaster 0 185 o2 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 o3 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 yeast Saccharomyces cerevisiae 0 12.1 e0 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 e0 mouse Mus musculus 1 3000 e1 e1 e2 e2 e3 LAB PI UNV STR CTY STZ ED AD S H M N e3 Experiment Dimension Table (MIAME) 3 2 a c h 1 2 2 b s h 0 2 4 a c a 1 2 4 a s a 1 PUBLIC (Ptree Unfied BioLogical InformtiCs Data Cube and Dimension Tables) Gene-OrganismDimension Table (chromosome,length) Gene-Experiment-Organism Cube (1 iff that gene from that organism expresses at a threshold level in that experiment.) many-to-many-to-many relationship

SubCell-Location Myta Ribo Nucl Ribo Function apop meio mito apop StopCodonDensity .1 .1 .1 .9 PolyA-Tail 1 1 0 0 Original Gene Dimension Table g0 g0 1 0 0 1 g1 0 1 1 g2 0 1 0 1 g3 1 0 g1 0 1 1 0 1 0 1 0 1 g2 1 0 0 0 1 0 0 g3 Myta Ribo Nuc l apop Me i o Mi to SCD 1 SCD 2 SCD 3 SCD 4 Poly-A G E N E 1 0 0 1 0 1 0 0 0 1 1 g0 0 1 0 0 1 0 0 0 0 1 1 g1 0 0 1 0 0 1 0 0 0 1 0 g2 0 1 0 1 0 0 1 0 0 1 0 g3 Boolean Gene Dimension Table (Binary) g3 g2 g1 g0 Protein-Protein Interaction Pyramid 0

Association of Computing Machinery KDD-Cup-02 NDSU Team

Network Security Application(Network security through Vertical Structured data) • Network layers do their own partitioning • Packets, frames, etc. (usually independent of any intrinsic data structuring – e.g., record structure) • Fragmentation/Reassembly, Segmentation/Reassembly • Data privacy is compromised when the horizontal (stream) message content is eavesdropped upon at the reassembled level (in network • A standard solution is to host-encrypt the horizontal structure so that any network reassembled message is meaningless. • Alt.: Vertically structure (decompose, partition) data (e.g., basic Ptrees). • Send one Ptree per packet • Send intra-message packets separately • Trick flow classifiers into thinking the multiple packets associated with a particular message are unrelated. • The message is only meaningful after destination demux-ing • Note: the only basic Ptree that holds actual information is the high-order bit Ptree. Therefore encrypt it! • It seems like there ought to be a whole range of killer ideas associated with the concept of using vertical structuring data within network transmission units • Active networking? (AND basic Ptrees (or just certain levels of) at active net nodes?)

Nearest Neighbor Classification (AKA: regression, case-based reasoning) is the most common method of data mining. Given a table R(A1,…,An,C) where C is chosen as the class attribute of interest (e.g., in homeland security data mining, C={terrorist, non-terrorist}; precision agriculture, C={low_yield, medium_yield, hi_yield}; Network flow classification, C={flow-1, flow-2, …}) and Ais are feature attributes on which the classification decision is based; network virus classification, C={DoS attack, SYN-flooding attack,…}; in cancer research, C={cancerous, non-cancerous} NN classification amounts to using a training dataset (such as R above) and a distance or similarity function on the attributes A1,…,An to decide the best prediction of class label for a new tuple, a=(a1,…,an) which does not have a known class, by letting those training rows closest to vote (assumes the class labels should be continuous in A1,…,An). That is, for the homeland security application, the NN classifier will classify an unknown individual, I, as a potential terrorist iff the known individuals who have characteristics (history, nationality, …) close to I’s are predominantely terrorists. Or in precision agriculture, using last years data as training data, we take an aerial photo of a field mid-growing-season. For a given point, p, in the field, we find the points in last years data that reveal the nearest match on colors (e.g., R,G,B, NIR…) and let them vote as to the probable yield to be expected at p this year. In Network Flow classification, we examine header fields to find near matches with those packets that have been assigned to a given flow in the past. In the virus classification application, if a message or flow has nearly the same characteristics as previously identified attacks of a particular class, we reject the message or flow. In the cancer research app, we judge a gene to be cancer causing if nearly the same expression patterns have been predominately by known cancer causing genes in past studies.

Suppose we consider A10 as the class attribute, and suppose we know that only a5 a6 a11 a12 a13 a14 are relevant to this classification. Suppose an unclassified sample has with ( a5 a6 a11 a12 a13 a14 ) = (0 0 0 0 0 0), has to be classified (need a prediction as to most likely a10-value. In 3-Nearest-Neighbor (3NN) classification, we look for the 3 nearest rows in the training table below, count up the number of occurrences of each a10-value and let the predominate class be the prediction. Key a1 a2 a3 a4a5 a6 a7 a8 a9 a10=Ca11 a12 a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0

For the unclassified sample, a, with ( a5 a6 a11 a12 a13 a14 ) = (0 0 0 0 0 0), we scan for the 3-nearest nbrs. a5 a6a10=Ca11 a12 a13 a14distance The 3 nearest neighbors C=1 wins! d=2, don’t replace d=4, don’t replace d=2, don’t replace d=4, don’t replace d=3, don’t replace d=3, don’t replace d=3, don’t replace d=2, don’t replace d=2, don’t replace d=2, don’t replace d=2, don’t replace d=3, don’t replace d=2, don’t replace 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d=1, replace 0 0 0 0 0 0 Note, only 1 of many training tuple at a distance=2 from the sample got to vote. We didn’t know that distance=2 was going to be the vote cutoff until the end of the 1st scan. Finding the other distance=2 voters (Closed 3NN set or C3NN) requires another scan. t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 t15 0 0 1 0 1 0 1 2 t53 0 0 0 0 1 0 0 1 0 1 Key a1 a2 a3 a4a5 a6 a7 a8 a9 a10=Ca11 a12 a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0

2nd scan to find the Closed 3NN set (C3NN set) for the unclassified sample. Does it change the vote? 3NN set after 1st scan Unclassified sample: 0 0 0 0 0 0 a5 a6a10=Ca11 a12 a13 a14distance t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 t53 0 0 0 0 1 0 0 1 0 1 d=3, don’t replace d=2, include it also d=2, include it also d=4, don’t include d=4, don’t include d=3, don’t include d=3, don’t include d=2, include it also d=2, include it also d=3, don’t include d=2, include it also d=2, include it also d=2, include it also d=2, include it also 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d=1, already have d=2, already have d=1, already have 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 YES! Vote after 1st scan. Key a1 a2 a3 a4a5 a6 a7 a8 a9 a10=Ca11 a12 a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0

Find the Closed 3NN training set (C3NN set) using Ptree: Let all training points in Ds,0 (disk about sample, s, of radius 0) vote 1st if  3 of them, done, else go to Ds,1, etc. by constructing the tuple Ptree, Ps then ANDing with PC and PC’ Below black is used to denoteattribute complement(rather than ‘) andred means uncomplemented. a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 Ds,0 is empty, proceed to Ss,1 the sphere of radius=1 about s a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 a10=C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 Ps 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 a4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a9 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a15 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 a16 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 a17 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 a18 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 a19 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 t75 a20 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 a2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1

Ss,1:Construct Ptree, PS(s,1) = OR Pi = P|si-ti|=1; |sj-tj|=0, ji = ORPS(si,1)   S(sj,0)black=attribute complement,red=attribute P14 P13 P12 P11 P6 P5 0 1 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 OR i=5,6,11,12,13,14 i=5,6,11,12,13,14 j{5,6,11,12,13,14}-{i} a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 a10 =C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 PD(s,1) 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 a4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a9 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a15 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 a16 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 a17 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 a18 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 a19 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 t75 a20 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 a2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1

Ds,2:Construct Ptree, PD(s,2) = OR{all double-dim interval-Ptrees}; PD(s,2) =OR Pi,j Pi,j = PS(si,1) S(sj,1)  S(sk,0)black=attribute complement,red=attribute 0 1 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 i,j{5,6,11,12,13,14} k{5,6,11,12,13,14}-{i,j} We now have 3 nearest nbrs. We could quite and declare C=1 winner? We now have the closed 3-nbrhd. We declare C=0 winner! P5,12 P5,13 P5,14 P6,11 P6,12 P6,13 P6,14 P11,12 P11,13 P11,14 P12,13 P12,14 P13,14 P5,11 P5,6 a10 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 t75

R( A1 A2 A3 A4) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 or 1 0 1 Justification for using vertical structures (once again)? • For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? • For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result (histogram), where there is no reconstructive post processing and the actual data records need never be involved?

Appendix: Run Lists: Another way to handle vertical data. Generalized Ptrees using standard run length compression of vertical bit files (alternatively, using Lempl Zipf?, Golomb?, other?) R( A1 A2 A3 A4) -->R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 R11 0 0 0 0 1 0 1 1 • 1st run is Pure0  0:000 • truth:start R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 2. 2nd run is Pure1  1:100 3.3rd run is Pure0  0:101 4. 4th run is Pure1  1:110 RL11 RL12 RL13 RL21 RL22 RL23 RL31 RL32 RL33 RL41 RL42 RL43 0:000 1:010 0:101 1:000 0:001 1:010 0:100 1:101 0:110 1:000 0:100 1:000 0:110 1:000 0:010 1:011 0:100 1:000 0:100 1:000 0:010 0:000 1:010 0:000 1:010 0:000 1:001 0:010 1:100 0:101 1:110 0:000 1:100 0:101 1:110 1:000 0:100 1:101 Eg, to count, 111 000 001 100s, use “pure111000001100”: RL11^RL12^RL13^RL’21^RL’22^RL’23^RL’31^RL’32^RL33^RL41^RL’42^RL’43 Run Lists: record the type and start-offset of pure runs. E.g., RL11: RL11 0:000 1:100 0:101 1:110 (to complement, flip purity bits)

R11 R11 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1 RL110:000 1:100 0:101 1:110 RL110:000 1:100 0:101 1:110 1. Whole file is true  1 1. Whole file is not pure1 0 2. 1st half is false  0 2. 1st half is not pure1  0 3. 2nd half is not pure1  0 3. 2nd half is true  1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 10 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 0 1 1 1 1 0 0 0 0 1 10 0 1 4. 1st half of 2nd half not  0 4. 1st half of 2nd half true  1 5. 2nd half of 2nd half is  1 5. 2nd half of 2nd half true  1 6. 1st half of 1st of 2nd true 1 6. 1st half of 1st of 2nd is  1 7. 2nd half of 1st of 2nd not 0 7. 2nd of 1st of 2nd false  0 RunListTrees? (RLtrees) To facilitate subsetting (isolating a subset) and processing, a Ptree stucture can be constructed as follows: Or, a separate NotPure0 index tree (trees could be terminated at any level). 1st, AND NP0trees. Only 1-branches / result need ANDing thru list scans. The more operands, the fewer 1-branches.

start length P0RI11000:4101:1 R11 pattern 0 0 0 0 1 0 1 1 0 1 0 1 0 1 0 1 Length (# of consecutive replicas of pattern) RL110:01:1000:101 1:11001:1000 P1RI11100:1110:2 PLZVRI111000:1 Other Indexes on RunLists We could put Pure0-Run, Pure1-Run and even Mixed-Run (or LZV-Run) RunListIndexes on RL:

Best Pure1 tree Implementation? My guess circa 04jan For n-D Pure1 trees: • At any node, if |1-bits| in the tuple set represented by the node < lower threshold, LT, • Then that node will simply show the 1List, the list of 1-bit positions (use a 0-bit if =0) and have no children, • Else if the tuple set represented by that node < UT=2nm, anupper threshold, leave bit-slice uncompressed Building such Ptrees bottom up: Using in-order ordering, • If 1-count of the next UT-segment  LT install P-sequence, else install 1List. • If current UT-segment node is numbered k*(2n–1) and it and all 2n-1 predecessors are 1Lists, and the cardinality of the union of said 1Lists < LT, install the union in the parent node. Recurse this collapsing process upward to the root. Building such Ptrees top down: • For datasets larger than UT, recursively build down the pure1. • If ever a node has < LT 1-bits, install the 1List and terminate that branch. • At the level where the represented tuple set = UT, install 1List if |1-bits| < LT, else install P-sequence. Notes: • This method should extend well to data streams. When the data stream exceeds the current allocation (which, for n-D Ptrees will be a power of 2n), just grow the root of each Ptree to a new level, using the current Ptree as node 0. Nodes 1,2,3,…2n-1 of the new Ptree will start as 0 nodes without children and will grow as 1Lists until LLT is reach then they will be converted to P-sequences.

Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing The jury is still out on parallelization, vertical (by relation) or horizontal (by tree node) or some combination? Horizontal parallelization is pretty, but network multicast overhead is huge Use active networking? Clusters of Playstations?... Formally, P-trees are be defined as any of the following; Partition-tree: Tree of nested partitions (a partition P(R)={C1..Cn} each component is partitioned by P(Ci)={Ci,1..Ci,ni} i=1..n, each component is partitioned by P(Ci,j)={Ci,j1..Ci,jnij}... ) Ptrees Partition tree R / … \ C1 … Cn /…\ … /…\ C11…C1,n1Cn1…Cn,nn . . . Predicate-tree: For a predicate on the leaf-nodes of a partition-tree (also induces predicates on i-nodes using quantifiers) Predicate-tree nodes can be truth-values (Boolean P-tree) Predicate can be quantified existentially (1 or a threshold %) or universally Predicate-tree nodes can count # of true leaf children of that component (Count P-tree) Purity-tree: universally quantified Boolean-Predicate-tree (e.g., if the predicate is <=1>, Pure1-tree or P1tree) A 1-bit at a node iff corresponding component is pure1 (universally quantified) There are many other useful predicates, e.g., NonPure0-trees But we will focus on P1trees. All Ptrees shown so far were 1-dimensional (recursively partition by halving bit files),but they can be 2-D (recursively quartering) (e.g., used for 2-D images) 3-D (recursively eighth-ing), … Or based on purity runs or LZW-runs or …

Ptrees continued Further observations about Ptrees: Partition-tree: have set nodes Predicate-tree: have either Boolean nodes(Boolean P-tree) or count nodes (Count P-tree) Purity-tree: being universally quantified Boolean-Predicate-tree have Boolean nodes (since the count is always the “full” count of leaves, expressing Purity-trees as count-trees is redundant. A Partition-tree can sliced at a given level if each partition at that level is labeled with the very same label set (e.g., the Month partition of years). A Partition-tree can be generalized to a Set-graph when the siblings of a node do not form a partition.

C.J. Date, circa 1980 Transactions on a DBMS vs. file processing programs on file systems .

C.J. Date, circa 1980 Transactions on a DBMS vs. file processing programs on file systems .

Presentation Transcript

File Processing Systems

File Processing

File Processing

More on File Systems

More on File Systems

FILE PROCESSING

File Processing : File Organization and File Systems

File Processing

File Processing

File Processing

L24. More on Image File Processing

File Processing

File Processing

File Processing

File Processing

File Processing

File Processing

Presentation on Distributed File Systems

L26. More on Sound File Processing

File Processing