580 likes | 762 Views
Geographic Knowledge Discovery in Spatial Interaction With Self-Organizing Maps. Ph.D. Dissertation Defense . Jun Yan Geography Department SUNY at Buffalo July 29, 2004. Dissertation Committee: Dr. Jean-Claude Thill (Chair) Dr. Ling Bian Dr. David Mark. Outline. Background
E N D
Geographic Knowledge Discovery in Spatial Interaction With Self-Organizing Maps Ph.D. Dissertation Defense Jun Yan Geography Department SUNY at Buffalo July 29, 2004 • Dissertation Committee: • Dr. Jean-Claude Thill (Chair) • Dr. Ling Bian • Dr. David Mark
Outline • Background • Spatial Interaction Data • Methodology • Self-Organizing Maps • Visual Data Mining • Case studies • Conclusions and Future Research
Information technologies More tools available More data available Background • Data-rich vs computation-rich: • challenge? • opportunity !!! Two Legs!!!
Background (Cont.) • Data Mining & Knowledge Discovery:“useful information from large databases” • useful • novel • valid • Understandable • Geographic data mining (GDM) and geographic knowledge discovery (GKD)?
Background (Cont.) User Controller DB Interface Target Data Selection Data Mining Evaluation DBMS Discoveries Domain Knowledge Knowledge Base Knowledge discovery process • Mining techniques: statistics, patternrecognition,machine learning, visualization, high performance computing … • Knowledge discovery process Data Mining
Background (Cont.) • Finding all the patterns autonomously in a database?: unrealistic • because the patterns could be too many but uninteresting • Data mining: an iterative, interactive, semi-automated process • people directs what to be mined • Visualization: Geovisualization (GVis) • visual data mining !!!
Visualization in KDD Process Selecting Application Domain Understanding basic data distribution, selecting meaningful target datasets Selecting Target Data Locating missing data, noise removing, data smoothing Processing Data Parameters setting, process tracking, process steering Extracting Information/Knowledge Interpretation, reporting, comparison, validity checking Interpretation and Evaluation
Background (Cont.) Learning Algorithm Examples Concept description or Other knowledge Background knowledge (sometimes) Inputs Outputs Input layer Output layer Hidden layer • Machine learning & Neural Networks
Background (Cont.) • Objectives: • Explore the effectiveness of neural networks in GKD • Examine the roles of GVis in GKD
Spatial Interaction Data • What is spatial interaction? • Pairsof places • Elemental: trips made by individuals • Aggregate: flows from origins to destinations • Examples: migration, freight shipment, movement of capital & information …
Spatial Interaction Data (Cont.) Region 1 Region 2 Origin Type 1 Region 3 Destination Type 2 Distance Type 3 Region1 Region1>Region 1 Trip 1 Region 2 Region1>Region 2 Trip 2 Region 3 Region1>Region 3 Trip 3 Trip table Basic O-D matrix Dyadic O-D matrix Elemental level Aggregate level
Spatial Interaction Data (Cont.) • Exploring the Patterns of Interaction • Very necessary!!! • Existing Exploratory Data Analysis (EDA): lack of interactivity • Challenges: • a large number of interactions • wide range of interaction magnitudes • multiple semantics
Spatial Interaction Data (Cont.) Interaction semantics O-D Matrices Origin Destination • Multidimensionality!!!
Spatial Interaction Data (Cont.) Electronic products Machinery Vehicle and parts Photographic products
Methodology • Self-Organizing Maps (SOM) • Visual Data Mining (VDM): • SOM as core DM engine • Interactivity
Self-Organizing Maps • A crucial task of KDD: reduce data complexity • Data Quantization:number of records, here number of spatial interactions • Data Projection:number of variables, here number of interaction semantics • By reducing data complexity, identification of meaningful geographic structures becomes possible • Traditional multivariate statistical methods share their limitations
Self-Organizing Maps (Cont.) Losing Node Winning Node Output Losing Node Input Layer Competitive Output layer A special type of competitive neural network; Based on some measure of dissimilarity in the attribute space; Capable of reducing data complexity on two dimensions simultaneously Actually an unsupervised pattern classifier.
Self-Organizing Maps (Cont.) • Best match unit (BMU) changes its value to fit with the input data; • Its neighboring nodes change their values to fit with the input data as well. Only the magnitude decreases with distance; • Like a flexible net; • Similar data will locate close to each other in the mapping
Dynamic linking Assignment Focusing Operation Brushing Colormap manipulation Interaction Forms Visualization Forms Visual Data Mining • Framework
Case Studies • Airline Origin and Destination Survey Market Table (DB1Market): http://www.bts.org • 10% of air flight itineraries • Geographic scale: airport level 280 metros in Contiguous US • Temporal range: 1993 to 2002 • Two case studies on DB1BMarket • Cross-sectional analysis • Temporal changes
Clustering Analysis 3 8 3 8 4 4 9-2 7 9-1 7 9 6 6 9-3 9-4 1 1 5 2 9-5 5 2 • A cluster is an area of low values (distance) surrounded by areas of high values (distance). • There are several clusters in the feature map
Clustering Analysis (Cont.) A cluster is a valley in a 3-D map
Cluster Analysis (Cont.) Market Share Contribution
AA MQ CO RU NW XJ WN ZW Multiple UA QX DL HP DL EV QX US Cluster Analysis (Cont.)
Markets with US Airways Market Share >= 50% Markets Represented by Cluster 2 Cluster 2 Cluster Analysis (Cont.)
Cluster Analysis: MarketsFrom Nashville CO RU WN AA NW DL UA US EV
Cluster Analysis: MarketsFrom Nashville (Cont.) CO RU WN AA NW DL UA US EV
Association Analysis Market Share Average Airfare
Temporal Changes (Cont.) TWA 2001 AA 2001 AA 1993 AA 2002
Temporal Changes (Cont.) Northwest share Continental share
Temporal Changes: Trajectory 01 US Airways fare US Airways share Southwest share 00 93 98 96 01 01 00 00 93 93 96 96 98 98 • Market from Buffalo to DC
Conclusions • Data rich environment: large databases, and high dimensionality • Data complexity reduction is crucial • Results suggest SOM: • summarize well the overall data distribution • capable of detecting clustered structures • can be used to analyze the properties of clustered structures • can be used to study the associations among input variables
Conclusions (Cont.) • Interactive visual data mining can: • examine subset data more closely • study relationships among interaction types • analyze how detected clusters are distributed in the actual geographic space • Help us gain a better understanding of the factors and spatial processes behind
Future Research • SOM/VDM analysis • DB1BMarket • Other types of spatial interaction data • Data at elemental level • Improved VDM environment • Human subject testing • Seemly-coupled
Background (Cont.) • Geographic database fits the profile: • massive volume:GIS, GPS, Remote Sensing … • high dimensionality • Geographic data mining (GDM) and geographic knowledge discovery (GKD)? • Current topic in GIS research
Exploratory analysis Knowledge construction Data driven Exploratory analysis Knowledge construction Analysis and modeling Evaluation of results Model driven Visual exploration & visual data mining Time Data presentation, visualization of uncertainty Visual knowledge construction & refinement Visual model tracking, model steering Background (Cont.) Roles of Visualization
Visualization in KDD Process Selecting Application Domain Understanding basic data distribution, selecting meaningful target datasets Selecting Target Data Locating missing data, noise removing, data smoothing Processing Data Parameters setting, process tracking, process steering Extracting Information/Knowledge Interpretation, reporting, comparison, validity checking Interpretation and Evaluation
Modeling Flows • Modeling Flows • Spatial interaction models: “Gravity Models” • Other geographic factors: • Geographic relationships among origins? • Geographic relationships among destinations? • Association among types of interaction?
Modeling Flows • Modeling Flows • Spatial interaction models: “Gravity Models” • Push: origin • Pull: destination • Transportation cost: distance decay Iij = k PiPj / dija= k Pi Pjdij -a
Limitations of Traditional Multivariate Methods • Data Projection • Factor analysis • Projection pursuit • Multi-dimensional scaling • Data Quantization • Partitioning methods • Hierarchical methods • Linearity • Stationary • Normal distribution • Limited data amount • One dimension compression • Non-linear • Non-stationary • Distribution unknown • Sparse • Large data amount • Multi-dimensional