Field Data and Spatial Warehouses: Trends in Database Management Systems

Chap8: Trends in DBMS 8.1 Database support for Field Entities 8.2 Content-based retrieval 8.3 Introduction to spatial data warehouses 8.4 Summary

Learning Objectives • Learning Objectives (LO) • LO1: Learn about field data • Why learn about field data type? • What is field data type? How is represented in SDBMS? • What are common operations on fit? • LO2 : Learn about storage and retrieval of field data • LO3: Learn about spatial data warehouses • Mapping Sections to learning objectives • LO1 - 8.1.1 • LO2 - 8.1.2, 8.2 • LO3 - 8.3

Why learn about Field data-sets? • Field data is timely and abundant • Sensors (e.g. satellite based ones) provide periodic snapshot of Earth • Most up-to-date data about current events (e.g. fires, flood) • Field data are useful • in creating, revising and evaluating vector data sets • digital archival of fragile historical paper maps • to manually get details not captured in vector interpretations • Example: Location selection for a facility (e.g. a grocery store) • Consider a set of Aerial photographs of different locations • Vector interpretation includes roads, water bodies, elevation • What other information can aerial imagery reveal for construction planning? • Trees (types and location), buildings, …

What are Field data-sets? • Field data set examples • Satellite images, aerial photographs • Digitized paper maps • Earth Science data-sets, e.g. rainfall, temperature maps • Data types of Spatial field data sets • Images • Satellite based, e.g. www.terradata.com • Aerial photographs • Measurements from a Geo-registered sensor networks, e.g. weather • Video, i.e. time series of images • Audio data • Focus: Primarily images, • though some discussion will apply to other data types

Fields and Rasters: An Sampling of Field values • Definitions • Field: a mapping from a spatial domain to a value domain • Image: a mapping from a rectangular grid to a value domain • A rectangular grid is a collection of cells called pixels • Raster is geo-registered image, i.e. grid axis have absolute spatial locations • Fields are often approximated as rasters • Example: Figure 8.1 • Identify spatial domain, field, rectangular grid, raster approximation • Fields can be approximated as images if relative spatial locations are adequate Fig 8.1

Computing with field data • Field data manipulated using operations of • map algebra • image algebra • An Algebra is a mathematical structure consisting of • Operands and Operations. • Map Algebra • Operand: rasters • Operations: Can be classified into four groups • Local, Focal, Zonal and Global • Image Algebra • Operand: images • Operations: crop, zoom, rotate

Local Operation A local operation maps a raster into another raster such that the value of a cell in the new raster depends only on the value of that cell in the original raster. Examples: unary operation : thresholding binary operation: point wise addition Fig 8.2

Focal Operation In a focal operation, the value of a cell in the new raster is dependent on the values of the cell and its neighboring cells in the original raster. Examples: unary operations: focal sum, gradient, … Neighborhoods: Rook, Bishop and Queen Fig 8.3

Zonal Operation In a global operation, the value of a cell in the new raster is a function of the location or values of all cells in the original or another raster. Examples: zonal sum, zonal average, ... Fig 8.4

Global Operation In a zonal operation, the value of a cell in the new raster is a function of the value of that cell in the original layer and the values of other cells which appear in the same zone specified in another raster. Example: distance from nearest facility Fig 8.5

Image Operations:Trim • Image Operations • ignore the absolute locations of pixels. • come from image processing literature • Ex. smoothing, low pass filter, high pass filter, • Example: A trim operation extracts an axis-aligned subset of the original raster. Fig 8.6

Learning Objectives • Learning Objectives (LO) • LO1: Learn about field data • LO2 : Learn about storage and retrieval of raster data • How is raster data stored on secondary storage? • What query families are used for retrieval? • What is content based retrieval (CBR)? Why is it interesting? • How is CBR computationally approached? • LO3: Learn about spatial data warehouses • Mapping Sections to learning objectives • LO1 - 8.1 • LO2 - 8.1.2, 8.2 • LO3 - 8.3

Storage and Retrieval of Raster Data - 1 • Traditional Approach • store raster data in a file system • use custom software to retrieve data-items of interest • Example: personal photographs stored on MS Windows • Q? What attributes can one attach to digital photographs ? • Q? Is there an easy way to retrieve all pictures taken in San Francisco? • Limitations • Rigid schema • Limited ability to add and manage additional attributes • Canned Queries only • Limited ability to support ad-hoc queries • Data quality • Limited ability to identify duplicates or similar data-items

Storage and Retrieval of Raster Data in a SDBMS • A database approach • Database tables store • raster data items • attributes (i.e. meta-data), e.g. creation date, geo-location, subject, ... • use SQL like query language to retrieve desired data-items • retrieve all raster data-items overlapping with city of San Francisco (Q1) • retrieve latest raster data-item within city of Paris (Q2) • retrieve raster data-items similar to a given image (Q3) • Pros: • table schema definition allows user defined attributes • improve ability to pose ad-hoc queries (Ex. Q1, Q2) • improve data reliability and quality • Example: Query Q3 may be used for duplicate reduction

Storage and Retrieval of Raster Data - Challenges • Challenges in database based approach • storage: size( raster data item) > size (disk blocks) • retrieval: raster has rich content • A picture is worth a thousand word! • Approaches to storage challenge • 1. Delegate storage to DBMS • Use Binary Large Object (BLOB) data-type create table my_picture( image: BLOB; creation_date: date; place: point; … ) • 2. Do-it-yourself • Divide a raster data-item into smaller slices • Q? Which way of slicing reduce disk I/Os for common queries?

8.1.2 How is raster data stored on secondary storage? • Slicing approaches • Linear, e.g. one row per disk block (see Fig. 8.8(b)) • Tiling - see Fig. 8.8(c ) • Tiling is preferred • for queries extracting rectangular sub-images • Example - terraserver.com Fig 8.8

8.2 How is raster data queried? • Retrieval challenge of rich content • A. Meta-data approach • B. Content based retrieval • Meta-data approach • select a set of descriptive attributes • simpler SQL data types, e.g. numeric, string, date, ... • Example: source, location, time stamp, subject, resolution, ... • Store values of descriptive attributes for each raster data-item • Allow SQL queries on the descriptive attributes • Limitation of meta-data approach • Restricts queries to content captured by descriptive attributes • Does not support “Similarity” based queries • Ex. Find all raster data-items similar to a given raster data item.

8.2 Content Based Retrieval (CBR) • Examples • Q1. Find all raster data-items similar to a given raster data item • Q2. Locate a photograph of a river in Minnesota with trees nearby. • Q3. Find all images of state parks which have a lake within them, are within a radius of one hundred miles from Chicago, and are southwest of Chicago. • State of the Art • However, few robust implementations of CBR are available as of 2002 • Several research prototypes address similarity query Q1 • Result quality is similar to those of web searches (e.g. www.google.com) • Some of the retrieved raster data-item are useful. • Many similar data item are not retrieved in the result • Usable in application domains such as publishing • Our goal is to understand a current approach to similarity queries • involving spatial similarities

8.2 Content Based Retrieval (CBR) • Spatial Similarity • Consider a pair of raster images with common objects (e.g. parks, lakes) • Spatial similarity between raster images can be defined based on • similarity of spatial relationships (e.g. topological, directional) • Q? Which pairs exhibit higher similarity? • P1: (inside, disjoint) or P2: (inside, covered by) • P3: (disjoint, touch) or P4: (disjoint, inside) • P5: (north west, north) or P6: (west, east) • A graph framework for comparing spatial relationships • Nodes = spatial relationships ; Edges = connect most similar nodes • Similarity metric = number of edge on shortest path between 2 nodes • See Figures 8.9 and 8.10

8.2.1 Topological Relationship Similarity • Study Fig. 8.9, pp. 234 • Nodes = topological relationships • Edges = most similar • Similarity measure = path length • Inference from Model • P2: (inside, covered by) more similar than P1: (inside, disjoint) • Do you agree? • Review Figure 2.3 (pp. 30) Fig 8.9

8.2.2 Direction Relationship Similarity • Study Fig. 8.10, pp. 235 • Nodes = topological relationships; Edges = most similar • Similarity measure = path length • Inference: P5 (north-west, north) more similar than P6 (west, east) Fig 8.10

8.2.3 Distance Similarity • Distance similarity is based on • Euclidean distance between the centroids of the objects. • Example: Image R is more similar to P than Q in Fig. 8.11 (pp. 235) Fig 8.11

8.2.4 A Computational Approach to CBR • Attribute Relation Graph (ARG) • Node = objects in a raster • Edges = relationships • Ex. Raster of Fig. 8.12(a) • ARG in Fig. Fig. 8.12(b) • Point object O3 • Rectangles O1, O2 • Edge (O1, O2) shows that they are disjoint, at 61 degree direction and 5.2 units distant. • Vector representation of ARG • Lists objects and edge properties • Ex. In Fig. 8.12 Fig 8.12

A Computational Approach to CBR • Steps: • 1.Represent each raster data item by its ARG vector • 2. Map query raster data item by its ARG vector • 3. Find most similar raster data-items in the database by comparing ARG vector representations. • Use a distance metric • Use a multi-dim. Index • Comment: Result quality is similar to those of web searches. Some of the retrieved raster data-item are useful. Fig 8.13

Learning Objectives • Learning Objectives (LO) • LO1: Learn about field data • LO2 : Learn about storage and retrieval of field data • LO3: Learn about spatial data warehouses • What are data warehouses? Why are they interesting? • What are aggregate functions? Which ones are easy to compute? • Mapping Sections to learning objectives • LO1 - 8.1 • LO2 - 8.1.2, 8.2 • LO3 - 8.3

8.3 Why are Data Warehouses Interesting? • Data Warehouse facilitate group decision making • Consider a dataset • 1 measure (i.e. Sales) • 3 dimensions (e.g. Company, Year, Region) • Analysis questions • Q1. Rank Regions by total sales. • Q2. Rank years by total sales. • Q3. Where are sales consistently growing? • Cross tabulates summaries reports used to analyze the trends • Example:

8.3 Generating cross-tabulation summaries • TraditionalApproach • Use custom software pulling data out of a DBMS • Limitations: redundant of work, inefficient use of resources • Data Warehouse approach • Cross-tab. Can be generated using a set of simple report • Each report is generated from a SQL “Select ... group by” statement • Example: Fig. 8.19 (pp. 244) and Table 8.3 (pp. 245) • Cross-tab example in last slide is a union of • SALES-L0-A, SALES-L1-A, SALES-L1-B and SALES-L2 • Table 8.3 shows SQL queries to compute each part • Advantage • Rest of SQL is available for pre/post processing of data • Performance gains by eliminating unnecessary copying of data

Example Data Warehouse (Fig. 8.19) Fig 8.19

8.3.4 Cross-tabulation vs.report hierarchy • Spreadsheet view of a report • Views a report a N-dim. Spreadsheet • N = number of dimension attributes • Each cell contains value of “measure” • Cross-tabulation view of a Report hierarchy • Example: report hierarchy for • SALES-L0-A, SALES-L2-A, SALES-L1-B, SALES-L2, Fig. 8.19 (pp. 244)

8.3 What is a Data Warehouse? • Data Warehouse is a special purpose database • Primarily used for specialized data analysis purposes • Facilitates generation and navigation of a hierarchy of reports • Special purpose data-sets and queries • Data consists of • a few measure attributes • a set of dimension attributes • The measure attribute depends on dimension attributes • Queries generate reports • Report measure for selected values of dimensions • Aggregate measure for given subset of dimensions • What is a spatial data warehouse? • Data warehouses with spatial measures or dimensions • Example: census data - census tract is a spatial dimension • Example: logistics data - route is a spatial dimension

8.3.4 Data Warehouse Operations • Operations on a data warehouse • Roll-up, Drill-down • Slice, Dice • Pivot • Roll-up • Inputs: A report R, A subset S of dimensions in R • Output: A sequence of reports summarizing R • Example 1: R = SALES-Base, S = (Year, Region) in Fig. 8.19 (pp. 244) • Output consists of reports SALES-L0-A, SALES-L1-B, SALES-L2 • Example 2: R = SALES-Base, S = (Region, Year) • Output consists of reports SALES-L0-A, SALES-L1-A, SALES-L2 • Drill-down • Inputs: A report R, A dimension D not in R • Output: A reports detailing R on D • Example: R = SALES-L1-B, D = Region in Fig. 8.19 (pp. 244) • Output : report SALES-L0-A

8.3.4 Data Warehouse Operations • Slice, Dice • Reduce dimensions in a table- (Fig. 8.7, pp 232). • Inputs: A report R, A value V for a dimension D in R • Output: A subset of R where D =V • Example: R = SALES-L0-A, D = Year, V = 1994 in Fig. 8.19 (pp. 244) • Output: Table 8.5 (pp. 246) • includes tuple (ALL, 1994, America, 35) Fig 8.7

8.3.4 Data Warehouse Operations • Pivot • For a spreadsheet view of reports • Transposes a spreadsheet • Example • Inputs: A spreadsheet view of a report R • Output: A transposed spreadsheet • Ex.: R= SALES-L0-A, Fig. 8.19 (pp. 244)

Logical Data Model of a DWH • Purpose of a logical data model • Specify a framework to specify computational structure • Allow extension of SQL to model new needs • Cube operation • Input : A fact table • Output: A set of summary reports covering all subsets of dimension columns • Equivalent to union of all tables and reports in Fig. 8.19 (pp. 244) • Ex. Fig. 8.18, pp. 243 • SELECT Company, Year, Region, Sum(Sales) AS Sales • FROM SALES • GROUP BY CUBE Company, Year, Region

Physical Data Model of a DWH • Purpose: Computationally efficient implementation • Ideas: • Pre-computation - • pre-compute some of reports and use those to compute other reports • New indexing methods, e.g. bit-map index • Query Processing Strategies • Strategies for aggregate functions • New strategies for multi-table joins • Let us look at strategies for aggregate functions

DWH Physical Model: Aggregate function strategies • Aggregate Functions • Compute summary statistics for a given set of values • Examples: sum, average, centroid (Table 8.1, pp. 238) • Strategies for efficient computation • Characterize easy to compute aggregate functions • 3 categories • Distributive • Algebraic • Holistic • First 2 categories can be computed easily in one scan of the dataset

Definitions of Aggregate Function Categories • Notation: • F, G, G1, G2, … Gn are aggregate functions where n is small • S is a set of values, e.g. S = (1, 2, 3, 4) • P = (S1, S2, …, Sp) is a partition of S, e.g. P = (S1, S2), S1 = (1, 2), S2 = (3, 4) • Distributive( F ) if there exists a G such that • F( S ) = G ( F(S1), F(S2), …, F(Sn) ) • Example: sum is distributive • Illustration: sum(1, 2, 3, 4) = sum ( sum(1, 2), sum(3, 4) • Algebraic( F ) if there exists G1, …, Gn, (where n is small) and • F( S ) = G ( G1(S1), …, Gn(S1), G2(S1), …, Gn(S2), …, G1(Sp), …, Gn(Sp) ) • Example: average is distributive • Illustration: average(1, 2, 3, 4) • = { count(1, 2) * average(1, 2) + count(3, 4) * average } / { count(1,2) + count(3,4) }

Example: Distributive Aggregate Function • Examples in cross-tabulation scenario (Fig. 8.14, pp.238): • Example 1. Min is distributive • Example 2. Count is distributive Fig 8.14

Examples: Algebraic Aggregate Functions • Examples in cross-tabulation scenario (Fig. 8.15, pp.239): • Average and Variance are algebraic Fig 8.15

Discussion - Spatial Data Warehouse • Example • Consider the example in Fig. 8.16, pp. 241 • A map interpretation may be attached to each report • Each row has a spatial footprint, which can be aggregated by geometric-union • The collection of maps may be called a mapcube • Issues: • What is needed in OGIS standard to support map-cube operation? • Hierarchical collection of maps in mapcube • What is an appropriate cartography to convey the relationship among maps?

Spatial Data Warehouses and Mapcube Fig 8.16

Summary • Field data • useful in many applications due to rich content • Represented as raster or image • Operations can be categorized into local, focal, zonal, and global • Field data storage and retrieval • Tiling is a preferred way to divide raster data into disk blocks • Meta-data based query is often used for retrieval • Content based retrieval may be used for similarity searches • Data warehouses support analysis e.g. cross-tabulation reports • SQL CUBE operator support generation of DWH reports • Distributive and Algebraic aggregate functions can be computed easily

Field Data and Spatial Warehouses: Trends in Database Management Systems