400 likes | 594 Views
Geographic Data Mining. Marc van Kreveld Seminar for GIVE Block 1, 2003/2004. About …. A form of geographical analysis Current topic of interest in GIS research (and database research and AI research) Finding hidden information in large collections of geographic data. This seminar.
E N D
Geographic Data Mining Marc van Kreveld Seminar for GIVE Block 1, 2003/2004
About … • A form of geographical analysis • Current topic of interest in GIS research (and database research and AI research) • Finding hidden information in large collections of geographic data
This seminar • Learning about a topic together • Presenting to each other + interaction • Added value by good examples: • for important concepts, algorithms • possibly self-thought of, or extended • referring to GIS data and issues (hence the GIS course prerequisite) • Written assignment: joint survey
Material • Book by Harvey Miller and Jiawei Han (editors): selected chapters • Possibly: papers from conference proceedings • Mostly provided by me
Weeks • Week 36-46 • Probably: • Not September 4 (this Thursday) • Not in week 40 (Sept. 29 & Oct. 2) • Not October 23 • The above depending on participation!
Overview of Geographic Data Mining & Knowledge Discovery Chapter 1 of the book • KDD: knowledge discovery in databases • Data warehouses • Data mining • Geographic aspects of the above
Knowledge Discovery in Databases (KDD) • Large databases contain interesting patterns: non-random properties and relationships that are: • valid (general enough to apply to new data) • novel (non-trivial and unexpected) • useful (leads to effective action: decision making or investigation) • ultimately understandable (simple, and interpretable by humans)
Knowledge Discovery in Databases (KDD) • Because of quantity of data nowadays • Because we want information, not data • Because computing power allows it
KDD opposed to statistics • Statistics • small and clean numeric database • scientifically sampled • specific questions in mind • KDD: none of the above
KDD techniques • Statistics • Machine learning • Pattern recognition • Numeric search (?) • Scientific visualization
Data warehouse • Large repository of data • For analytical processing (DB: transactional processing) • Heterogeneous: different sources and formats (DB: homogeneous) • Supports OLAP tools (OnLine Analytical Processing)
OLAP Example • Measure of interest: sales • Dimensions of interest: item, store, week • (item, store, week) money[quantity sold times price ]
OLAP Example • 2-dim. aggregation:(item, store, . ) money • Another 2-dim. aggregation: sales by store and by week • 1-dim. aggregation: sales by week (all items and stores) • Data cube: all 2d possible aggregations, different types of summaries
KDD steps • Data selection • Data pre-processing • Data enrichment • Data reduction and projection • Data mining • Interpretation and reporting Presence of steps and order not fixed
KDD steps • Data selection: which records, variables chosen?
KDD steps • Data selection • Data pre-processing: removing noise, duplicate records, handling missing data, …
KDD steps • Data selection • Data pre-processing • Data enrichment: combining the selected data with external data
KDD steps • Data selection • Data pre-processing • Data enrichment • Data reduction and projection: reduction in number, reducing dimension
KDD steps • Data selection • Data pre-processing • Data enrichment • Data reduction and projection • Data mining: uncovering information, interesting patterns
KDD steps • Data selection • Data pre-processing • Data enrichment • Data reduction and projection • Data mining • Interpretation and reporting: evaluating, understanding, communicating
Data mining • Segmentation • Dependency analysis • Deviation and outlier analysis • Trend detection • Generalization and characterization
Description: Clustering: finding a finite set of implicit classes Classification: mapping data items into pre-defined classes Techniques: Cluster analysis Bayesian classification Decision or classification trees Artificial neural networks DM - segmentation
given classes clustering classification DM - segmentation
Description: Finding rules to predict the value of some attribute based on other attributes Techniques: Bayesian networks Association rules DM – dependency analysis (4, 12, 0.24) (3, 14, 0.21) (7, 13, 0.43) (2, 9, 0.78) (11, 11, 0.55) (5, 11, ???) (???, 12, 0.51)
DM – dependency analysis • Confidence and support measures for association rules of the form:[ if X then Y ]:confidence = #(X and Y in DB) / #(X in DB)support = #(X and Y in DB) / #(all in DB)
Description: Finding data with unusual deviations (=errors, or data of particular interest) Techniques: Clustering, other mining methods Outlier analysis DM – deviation & outlier analysis
Description: Finding lines, curves, summarizing the database (often as a function over time) Techniques: Regression Sequential pattern extraction DM – trend detection
Description: Obtaining compact descriptions of the data Techniques: Summary rules Attribute-oriented induction higher level concept low level concept DM – generalization and characterization concept hierarchy
Visualization and knowledge discovery • KDD is difficult to automate steered by human intelligence • Visualization helps to understand the data and which data mining techniques to try
KD + geography • Special case of KDD • Other special cases • marketing • biology • astronomy • Main features: location, distance, dimen-sionality (with dependent dimensions)
KD + geography (attr1, attr2, attr3, attr4); attr’s are numbers and (relatively) independent: statistics (attr1, attr2, attr3, attr4); attr’s can also be on other measurement scales: KDD (attr1, attr2, attr3, attr4); attr’s are often dependent and can be shapes: KD + geography Often: (lat., long., attr1, attr2, …) or: (shape description, attr1, attr2, …)
KD + geography • Study of scalable versions of DM tasks (in lat. and long.) • Certain dimensions can be non-metric (travel time need not be symmetric) • DM in data that is not in the form of tuples: sets of thematic map layers
Geographic data mining • Spatial segmentation (clustering, classification) • Spatial dependency (spatial association rules) • Spatial trend detection • Geographic characterization and generalization
GDM – spatial association rules • Example: If a location is within 500 m from water and the average winter temperature is at least –2 degreesthen there are frogs around distance relationship
GDM – spatial trend detection • Patterns of change with respect to neighborhood of some object • Example: (North America) Further from Pacific ocean fewer earthquakes
GDM - applications • Map interpretation • Remote sensing interpretation • Environmental mapping (soil type, etc.) • Extracting spatio-temporal patterns for cyclones, crimes • Spatial interaction (movement/flow of people, capital, goods)
Conclusions • GDM & GKD is an extension of (tool for) geographical analysis • GDM is different from DM due to • Geographic spaces, not attribute space • Neighborhood is extremely important • Scale issues • Data is different • Applications (interesting patterns to mine for) are different
This seminar on GDM • First: chapters from the book • CH 1: GDM & KD: an overview (today) • CH 2: Paradigms for spatial and spatio-temporal DM(11-9) • CH 3: Fundamentals of spatial DW for GKD (15-9) • CH 7: Algorithms and applications of SDM (Ronny) (18-9) • CH 8: Spatial clustering in DM (22-9) • CH 6: Modeling spatial dependencies (25-9)(not: 29-9 and 2-10) • CH 9: Detecting outliers (6-10) • CH 10: Knowledge construction based on GVis and KDD • CH 14: Mining mobile trajectories
This seminar • All PowerPoint presentations on the Web page of the course • Survey paper or written exam; possible topics for survey: • Hierarchical clustering • Clustering with obstacles • Proximity relationship mining • … • Or: joint survey of (geometric) algorithms for GDM
Each presentation • The chapter contents • Additional (spatial) examples(from the Web links or self-constructed) • Detect and present algorithmic problems that appear together: report on algorithmic issues in GDM • Present your chapter; don’t be afraid of overlap with other chapters