390 likes | 480 Views
Chapter 8: Trends in DBMS 8.1 Database Support for Field Entities 8.2 Content-based Retrieval 8.3 Introduction to Spatial Data Warehouses 8.4 Summary. Why Learn about Field Data-sets?. Field data is timely and abundant Sensors (e.g. satellite based ones) provide periodic snapshot of Earth
E N D
Chapter 8: Trends in DBMS8.1 Database Support for Field Entities8.2 Content-based Retrieval8.3 Introduction to Spatial Data Warehouses8.4 Summary
Why Learn about Field Data-sets? • Field data is timely and abundant • Sensors (e.g. satellite based ones) provide periodic snapshot of Earth • Most up-to-date data about current events (e.g. fires, flood) • Field data are useful • In creating, revising and evaluating vector data sets • Digital archival of fragile historical paper maps • To manually get details not captured in vector interpretations • Example: Location selection for a facility (e.g. a grocery store) • Consider a set of Aerial photographs of different locations • Vector interpretation includes roads, water bodies, elevation • What other information can aerial imagery reveal for construction planning? • trees (types and location), buildings, …
What are Field Data-sets? • Field data set examples • Satellite images, aerial photographs • Digitized paper maps • Earth Science data-sets, e.g. rainfall, temperature maps • Data types of Spatial field data sets • Images • satellite based, e.g. www.terradata.com • aerial photographs • measurements from a Geo-registered sensor networks, e.g. weather • Video, i.e. time series of images • Audio data • Focus: Primarily images though some discussion will apply to other data types
Fields and Rasters: a Sampling of Field Values • Definitions • Field: a mapping from a spatial domain to a value domain • Image: a mapping from a rectangular grid to a value domain • A rectangular grid is a collection of cells called pixels • Raster is geo-registered image, i.e. grid axis have absolute spatial locations • Fields are often approximated as rasters • Example: Figure 8.1 • Identify spatial domain, field, rectangular grid, raster approximation • Fields can be approximated as images if relative spatial locations are adequate Figure 8.1
Computing with Field Data • Field data manipulated using operations of • map algebra • image algebra • An Algebra is a mathematical structure consisting of • Operands and Operations • Map Algebra • Operand: rasters • Operations: Can be classified into four groups • Local, Focal, Zonal and Global • Image Algebra • Operand: images • Operations: crop, zoom, rotate
Local Operation • A local operation maps a raster into another raster such that the value of a cell in the new raster depends only on the value of that cell in the original raster • Examples: unary operation : thresholding • binary operation: point wise addition Figure 8.2
Focal Operation • In a focal operation, the value of a cell in the new raster is dependent on the values of the cell and its neighboring cells in the original raster • Examples: unary operations: focal sum, gradient, … Neighborhoods: Rook, Bishop and Queen Figure 8.3
Zonal Operation • In a global operation, the value of a cell in the new raster is a function of the location or values of all cells in the original or another raster • Examples: zonal sum, zonal average, ... Figure 8.4
Global Operation • In a zonal operation, the value of a cell in the new raster is a function of the value of that cell in the original layer and the values of other cells which appear in the same zone specified in another raster • Example: distance from nearest facility Figure 8.5
Image Operations: Trim • Image Operations • Ignore the absolute locations of pixels. • Come from image processing literature • Ex. smoothing, low pass filter, high pass filter • Example: A trim operation extracts an axis-aligned subset of the original raster Figure 8.6
Storage and Retrieval of Raster Data - 1 • Traditional Approach • Store raster data in a file system • Use custom software to retrieve data-items of interest • Example: personal photographs stored on MS Windows • Q? What attributes can one attach to digital photographs ? • Q? Is there an easy way to retrieve all pictures taken in San Francisco? • Limitations • Rigid schema • limited ability to add and manage additional attributes • Canned Queries only • limited ability to support ad-hoc queries • Data quality • limited ability to identify duplicates or similar data-items
Storage and Retrieval of Raster Data in a SDBMS • A database approach • Database tables store • raster data items • attributes (i.e. meta-data), e.g. creation date, geo-location, subject, ... • Use SQL like query language to retrieve desired data-items • retrieve all raster data-items overlapping with city of San Francisco (Q1) • retrieve latest raster data-item within city of Paris (Q2) • retrieve raster data-items similar to a given image (Q3) • Pros: • table schema definition allows user defined attributes • improve ability to pose ad-hoc queries (Ex. Q1, Q2) • improve data reliability and quality • example: Query Q3 may be used for duplicate reduction
Storage and Retrieval of Raster Data - Challenges • Challenges in database based approach • Storage: size( raster data item) > size (disk blocks) • Retrieval: raster has rich content • a picture is worth a thousand word! • Approaches to storage challenge • Delegate storage to DBMS Use Binary Large Object (BLOB) data-type create table my_picture( image: BLOB; creation_date: date; place: point; … ) • Do-it-yourself • divide a raster data-item into smaller slices • Q? Which way of slicing reduce disk I/Os for common queries?
How is Raster Data Stored on Secondary Storage? • Slicing approaches • Linear, e.g. one row per disk block (see Figure 8.8(b)) • Tiling - see Figure 8.8(c ) • Tiling is preferred • For queries extracting rectangular sub-images • Example - terraserver.com Figure 8.8
How is Raster Data Queried? • Retrieval challenge of rich content • Meta-data approach • Content based retrieval • Meta-data approach • Select a set of descriptive attributes • simpler SQL data types, e.g. numeric, string, date, ... • example: source, location, time stamp, subject, resolution, ... • Store values of descriptive attributes for each raster data-item • Allow SQL queries on the descriptive attributes • Limitation of meta-data approach • Restricts queries to content captured by descriptive attributes • Does not support “Similarity” based queries • example: Find all raster data-items similar to a given raster data item
Content Based Retrieval (CBR) • Examples • Q1. Find all raster data-items similar to a given raster data item • Q2. Locate a photograph of a river in Minnesota with trees nearby • Q3. Find all images of state parks which have a lake within them, are within a radius of one hundred miles from Chicago, and are southwest of Chicago • State of the Art • However, few robust implementations of CBR are available as of 2002 • several research prototypes address similarity query Q1 • Result quality is similar to those of web searches (e.g. www.google.com) • some of the retrieved raster data-item are useful • many similar data item are not retrieved in the result • usable in application domains such as publishing • Our goal is to understand a current approach to similarity queries • involving spatial similarities
Content Based Retrieval (CBR) • Spatial Similarity • Consider a pair of raster images with common objects (e.g. parks, lakes) • Spatial similarity between raster images can be defined based on • similarity of spatial relationships (e.g. topological, directional) • Q? Which pairs exhibit higher similarity? • P1: (inside, disjoint) or P2: (inside, covered by) • P3: (disjoint, touch) or P4: (disjoint, inside) • P5: (north west, north) or P6: (west, east) • A graph framework for comparing spatial relationships • Nodes = spatial relationships • Edges = connect most similar nodes • Similarity metric = number of edge on shortest path between 2 nodes • See Figures 8.9 and 8.10
Topological Relationship Similarity • Study Figure 8.9, pp.234 • Nodes = topological relationships • Edges = most similar • Similarity measure = path length • Inference from Model • P2: (inside, covered by) more similar than P1: (inside, disjoint) • Do you agree? • Review Figure 2.3 (pp.30) Figure 8.9
Direction Relationship Similarity • Study Figure 8.10, pp.235 • Nodes = topological relationships; Edges = most similar • Similarity measure = path length • Inference: P5 (north-west, north) more similar than P6 (west, east) Figure 8.10
Distance Similarity • Distance similarity is based on • Euclidean distance between the centroids of the objects. • Example: Image R is more similar to P than Q in Figure 8.11 (pp.235) Figure 8.11
A Computational Approach to CBR • Attribute Relation Graph (ARG) • Node = objects in a raster • Edges = relationships • Example: raster of Figure 8.12(a) • ARG in Figure 8.12(b) • point object O3 • rectangles O1, O2 • edge (O1, O2) shows that they are disjoint, at 61 degree direction and 5.2 units distant • Vector representation of ARG • Lists objects and edge properties • Example in Figure 8.12 Figure 8.12
A Computational Approach to CBR • Steps: • Represent each raster data item by its ARG vector • Map query raster data item by its ARG vector • Find most similar raster data-items in the database by comparing ARG vector representations • use a distance metric • use a multi-dimensional index • Comment: Result quality is similar to those of web searches. Some of the retrieved raster data-item are useful Figure 8.13
Why are Data Warehouses Interesting? • Data Warehouse facilitate group decision making • Consider a dataset • 1 measure (i.e. Sales) • 3 dimensions (e.g. Company, Year, Region) • Analysis questions • Q1. Rank Regions by total sales. • Q2. Rank years by total sales. • Q3. Where are sales consistently growing? • Cross tabulates summaries reports used to analyze the trends • Example:
Generating Cross-tabulation Summaries • Traditional Approach • Use custom software pulling data out of a DBMS • Limitations: redundant of work, inefficient use of resources • Data Warehouse approach • Cross-tab. Can be generated using a set of simple report • each report is generated from a SQL “Select ... group by” statement • Example: Fig. 8.19 (pp. 244) and Table 8.3 (pp. 245) • cross-tab example in last slide is a union of • SALES-L0-A, SALES-L1-A, SALES-L1-B and SALES-L2 • table 8.3 shows SQL queries to compute each part • Advantage • rest of SQL is available for pre/post processing of data • performance gains by eliminating unnecessary copying of data
Example Data Warehouse Figure 8.19
Cross-tabulation vs. Report Hierarchy • Spreadsheet view of a report • Views a report a N-dim. Spreadsheet • N = number of dimension attributes • Each cell contains value of “measure” • Cross-tabulation view of a Report hierarchy • Example: report hierarchy for SALES-L0-A, SALES-L2-A, SALES-L1-B, SALES-L2, Figure 8.19 (pp.244)
What is a Data Warehouse? • Data Warehouse is a special purpose database • Primarily used for specialized data analysis purposes • Facilitates generation and navigation of a hierarchy of reports • Special purpose data-sets and queries • Data consists of: • a few measure attributes • a set of dimension attributes • The measure attribute depends on dimension attributes • Queries generate reports • report measure for selected values of dimensions • aggregate measure for given subset of dimensions • What is a spatial data warehouse? • Data warehouses with spatial measures or dimensions • Example: census data - census tract is a spatial dimension • Example: logistics data - route is a spatial dimension
Data Warehouse Operations • Operations on a data warehouse • Roll-up, Drill-down • Slice, Dice • Pivot • Roll-up • Inputs: A report R, A subset S of dimensions in R • Output: A sequence of reports summarizing R • Example 1: R=SALES-Base, S=(Year, Region) in Figure 8.19 (pp. 244) • Output consists of reports SALES-L0-A, SALES-L1-B, SALES-L2 • Example 2: R=SALES-Base, S=(Region, Year) • Output consists of reports SALES-L0-A, SALES-L1-A, SALES-L2 • Drill-down • Inputs: A report R, A dimension D not in R • Output: A reports detailing R on D • Example: R = SALES-L1-B, D = Region in Figure 8.19 (pp.244) • Output : report SALES-L0-A
Data Warehouse Operations • Slice, Dice • Reduce dimensions in a table (Figure 8.7, pp.232) • Inputs: A report R, A value V for a dimension D in R • Output: A subset of R where D=V • Example: R=SALES-L0-A, D=Year, V=1994 in Figure 8.19 (pp.244) • output: Table 8.5 (pp.246) • includes tuple (ALL, 1994, America, 35) Figure 8.7
Data Warehouse Operations • Pivot • For a spreadsheet view of reports • Transposes a spreadsheet • Example • Inputs: A spreadsheet view of a report R • Output: A transposed spreadsheet • Example: R= SALES-L0-A, Figure 8.19 (pp.244)
Logical Data Model of a DWH • Purpose of a logical data model • Specify a framework to specify computational structure • Allow extension of SQL to model new needs • Cube operation • Input : A fact table • Output: A set of summary reports covering all subsets of dimension columns • equivalent to union of all tables and reports in Figure 8.19 (pp.244) • Example: Figure 8.18, pp.243 • SELECT Company, Year, Region, Sum(Sales) AS Sales • FROM SALES • GROUP BY CUBE Company, Year, Region
Physical Data Model of a DWH • Purpose: Computationally efficient implementation • Ideas: • Pre-computation - • pre-compute some of reports and use those to compute other reports • New indexing methods, e.g. bit-map index • Query Processing Strategies • strategies for aggregate functions • new strategies for multi-table joins • Let us look at strategies for aggregate functions
DWH Physical Model: Aggregate Function Strategies • Aggregate Functions • Compute summary statistics for a given set of values • Examples: sum, average, centroid (Table 8.1, pp.238) • Strategies for efficient computation • Characterize easy to compute aggregate functions • 3 categories: • distributive • algebraic • holistic • First 2 categories can be computed easily in one scan of the dataset
Definitions of Aggregate Function Categories • Notation: • F,G,G1,G2,…,Gn are aggregate functions where n is small • S is a set of values, e.g. S=(1,2,3,4) • P=(S1,S2,…,Sp) is a partition of S, e.g. P=(S1,S2), S1=(1,2), S2=(3,4) • Distributive (F) if there exists a G such that • F(S) = G(F(S1),F(S2),…,F(Sn)) • Example: sum is distributive • Illustration: sum(1,2,3,4) = sum(sum(1,2),sum(3,4)) • Algebraic (F) if there exists G1,…,Gn, (where n is small) and • F(S) = G(G1(S1),…,Gn(S1),G2(S1),…,Gn(S2),…,G1(Sp),…, Gn(Sp)) • Example: average is distributive • Illustration: average(1,2,3,4)={count(1,2)*average(1,2)+count(3,4)*average} / {count(1,2)+count(3,4)}
Example: Distributive Aggregate Function • Examples in cross-tabulation scenario (Figure 8.14, pp.238): • Example 1: Min is distributive • Example 2: Count is distributive Figure 8.14
Examples: Algebraic Aggregate Functions • Examples in cross-tabulation scenario (Figure 8.15, pp.239): • Average and Variance are algebraic Figure 8.15
Discussion - Spatial Data Warehouse • Example • Consider the example in Figure 8.16, pp.241 • A map interpretation may be attached to each report • each row has a spatial footprint, which can be aggregated by geometric-union • The collection of maps may be called a mapcube • Issues: • What is needed in OGIS standard to support map-cube operation? • Hierarchical collection of maps in mapcube • what is an appropriate cartography to convey the relationship among maps?
Spatial Data Warehouses and Mapcube Figure 8.16
Summary • Field data • Useful in many applications due to rich content • Represented as raster or image • Operations can be categorized into local, focal, zonal, and global • Field data storage and retrieval • Tiling is a preferred way to divide raster data into disk blocks • Meta-data based query is often used for retrieval • Content based retrieval may be used for similarity searches • Data warehouses support analysis e.g. cross-tabulation reports • SQL CUBE operator support generation of DWH reports • Distributive and Algebraic aggregate functions can be computed easily