Spatial-enabled Mining in Oracle

Spatial-enabled Mining in Oracle Ravi Kothuri Spatial Technologies Oracle USA

Oracle10g Spatial Oracle Spatial: Store, Analyze and Visualize Spatial Data Spatial Data Types Mapviewer Vector (feature/topological), Raster, Network types, Versioning • Spatial Relationships • Route Computation • Raster Manipulation Visualization Scalability & Seamless Integration for Spatial Data

Oracle Spatial: Future Projects • 3-D • Extensions to SDO_GEOMETRY • Composite Surface and Composite/Multi-Solid • Support different operators: Anyinteract, Filter, NN, Within_distance • Scalable Storage and Management of PointCloud Data: Partitioning and Visibility Query (LOD) • TIN generation: need to experiment with variety of approaches • Intelligent Map Caching, WFS,…

Oracle Data Mining • Preprocessing, data clean up: number of transformations, normalization functions • Binning, Spatial Binning,… • Data Mining Functions: • Classification: Decision Trees, Adaptive Bayes,… • Clustering: KMeans, KModes, Oracle-specific • Spatial: BIRCH+Agglomerative Clustering • Association Rules: Apriori • Regression: • SVM with linear kernel and more… Robust Framework for Mining Data in Oracle

Spatial Data Mining • Where result patterns have a spatial component • Clustering • Colocation of data items • Spatial-enabled: Include Spatial Info in Data Mining • Information is implicit (not materialized) • What information to materialize? • Spatial correlation with target data (e.g., habitats of birds) • Spatial auto-correlation in Regression • Target Variable Y = a .X + p W Y • Where p is the spatial autocorrelation and W is neighborhood matrix • First step: materialize target variable estimates • How to incorporate spatial auto-correlation • Materialize spatial information, estimates as additional attributes

Materializing Neighborhood Influence • Compute a weighted-sum of interesting information (target variable, other attributes) from neighbors • E.g., if you are estimating CRIME for a region/point T take a “distance-based” weighted sum of crime of neighbors. • Additionally, you can also estimate population-in-10mile radius (based on race) etc. • Oracle Spatial provides specific functions to compute such neighborhood-based estimates A T B C(T) = C(A)/d(A,T) + C(B)/d(B, T) (1/d(A, T) + 1/d(B, T) )

Spatial-enabled Mining Table e.g. population in 2-miles, Crime in neighborhood,… Neighborhood Estimates Augmented Table Oracle Data Mining Mining Results

Spatial-enabled Mining Mapviewer ODM applications Classification, Regression, Association Rules,… Spatial Analysis (building blocks) Spatial Binning, Spatial Estimates, Clustering for polygons (BIRCH+agglomerative)

Case Study for Spatial-enabled Mining: How helpful are these estimates? • Test on a specific dataset • US Block groups from Census for CA (21K) • Crime Data for US Blockgroups (from a partner company) • Crimerate is number of crimes per 1000 of population • Separate the data into TRAINING data and TEST data • Compute Data Mining models using TRAINING data

Evaluation • Predict Crime for TEST regions with and without spatial estimates using ODM’s Mining functions • Test Regions: 450 locations in San Francisco area • Classification (Adaptive Bayes Network) • Create Bins or “classes” of the data and results • So how well the model predicts the “class” for new test regions • Regression (Support Vector Machines) • Predict the exact value of Regression analysis using SVM crimerate • Estimates for spatial neighborhood

Spatial Neighborhood • How do you define neighborhood? • Buffer around test location? Quarter-mile, to 10 mile • Nearest-neighbors? 2 to 20 • Compute spatial estimates for crime, • Can also be done for population (white, asian, black, hispanic,..)

Some Results: • Classification: • Accuracy increases from 62% to 89% with 7 nearest neighbors • Regression: • Root-Mean-Square-Error between predicted and actual value improves from ~25 to 8 (5-7 Neighbors) • Detailed results in a white paper on http://technet.oracle.com/products/spatial • Visualize the results with Mapviewer

Summary of the case study • Adding Neighborhood Influence to Data • Improves classification accuracy from 62% to 89% • Best Neighborhood for this case study: 5-7 neighbors or 2-mile distance • Details, Additions: White paper on OTN • http://technet.oracle.com/products/spatial • Recommendation for Businesses : Spatial-enable the data • Always geocode customer/business locations • Materialize demographic information from spatial neighborhood • Test the data and perform mining tasks

More research needed… • Current case study: • SVM w/o spatial, although worse than with spatial, is still good: Which attributes are helping? • Colocation Mining • “Co-location” of items as opposed to “co-occurrence” in a transaction • E.g., which sets of items are colocated and what are the implications (interesting patterns) • One approach: identify items that co-occur within “tiled” regions • Needs tighter integration with association rule mining

Spatial-enabled Mining in Oracle