330 likes | 451 Views
E N D
1. 1 There Goes the Neighborhood!Spatial (or N-Dimensional) Searchin a Relational World Jim Gray, Microsoft
Alex Szalay, Johns Hopkins U.
2. 2
3. 3 Background I have been working with Astronomy community to build the World Wide Telescope: all telescope data federated in one internet-scale DB
A great Web Services app
The work here joint with Alex Szalay
SkyServer.Sdss.Org is first installment,
SkyQuery.Netis second installment (federated web services)
4. 4 Outline How to do spatial lookup:
The old way: HTM
The new way: zoned lookup
A load workflow system
Embedded documentation:literate programming for SQL DDL
5. 5 Spatial Data Access – SQL extension Szalay, Kunszt, Brunner http://www.sdss.jhu.edu/htm
Added Hierarchical Triangular Mesh (HTM) table-valued function for spatial joins
Every object has a 20-deep Mesh ID
Given a spatial definition,routine returns up to 10 covering triangles
Spatial query is then up to 10 range queries
Fast: 1,000 triangles / second / Ghz
6. 6 A typical call
7. 7 Integration with CLR Makes it Nicer Peter Kukol converted 500 lines of external stored procedure “glue code” to 50 lines of C# code.
Now we are converting library to C#
Also, Cross Apply is VERY useful
select objID, count(*)from PhotoObj p cross apply dbo.fGetNearbyObjEq(p.ra, p.dec, 1)
8. 8 But… Wanted a faster way to do this: some computations were taking toooooo long (see below).
Wanted to define areas in relational form.
Wanted a portable way that works on any relational system.
So, developed a “constraint database” approach – see below.
9. 9 The Idea:Equations Define Subspaces For (x,y) above the lineax+by > c
Reverse the space by-ax + -by > -c
Intersect 3 half-spaces: a1x + b1y > c1 a2x + b2y > c2 a3x + b3y > c3
10. 10 The Idea:Equations Define Subspaces a1x + b1y > c1 a2x + b2y > c2 a3x + b3y > c3
11. 11 Domain is Union of Convex Hulls Simple volumes are unions of convex hulls.
Higher order curves also work
Complex volumes have holes and their holes have holes. (that is harder).
12. 12 Now in Relational Terms create table HalfSpace (
domainID int not null -- domain name
foreign key references Domain(domainID),
convexID int not null, -- grouping a set of ˝ spaces
halfSpaceID int identity(), -- a particular ˝ space
a float not null, -- the (a,b,..) parameters
b float not null, -- defining the ˝ space
c float not null, -- the constraint (“c” above)
primary key (domainID, convexID, halfSpaceID)
(x,y) inside a convex if it is inside all lines of the convex
(x,y) inside a convex if it is NOT OUTSIDE ANY line of the convex
Convexes containing point (@x,@y):
select convexID -- return the convex hulls
from HalfSpace -- from the constraints
where (@x * a + @y * b) < c -- point outside the line?
group by all convexID -- insist no line of convex
having count(*) = 0 -- is outside (count outside == 0)
13. 13 All Domains Containing this Point The group by is supported by the domain/convex index, so it’s a sequential scan (pre-sorted!).
select distinct domainID -- return domains
from HalfSpace -- from constraints
where (@x * a + @y * b) < c -- point outside
group by all domainID, convexID -– never happens
having count(*) = 0 -- count outside == 0
14. 14 The Algebra is Simple (Boolean) @domainID = spDomainNew (@type varchar(16), @comment varchar(8000))
@convexID = spDomainNewConvex (@domainID int)
@halfSpaceID = spDomainNewConvexConstraint (@domainID int, @convexID int,
@a float, @b float, @c float)
@returnCode = spDomainDrop(@domainID)
select * from fDomainsContainPoint(@x float, @y float)
Once constructed they can be manipulated with the Boolean operations.
@domainID = spDomainOr (@domainID1 int, @domainID2 int,
@type varchar(16), @comment varchar(8000))
@domainID = spDomainAnd (@domainID1 int, @domainID2 int,
@type varchar(16), @comment varchar(8000))
@domainID = spDomainNot (@domainID1 int,
@type varchar(16), @comment varchar(8000))
15. 15 What! No Bounding Box? Bounding box limits search.A subset of the convex hulls.
If query runs at 3M half-space/sec then no need for bounding box, unless you have more than 10,000 lines.
But, if you have a lot of half-spaces then bounding box is good.
16. 16 OK: “solved” Areas Contain Point? What about: Points near point? Table-valued function find points near a point
Select * from fGetNearbyEq(ra,dec,r)
Use Hierarchical Triangular Mesh www.sdss.jhu.edu/htm/
Space filling curve, bounding triangles…
Standard approach
13 ms/call… So 70 objects/second.
Too slow, so pre-compute neighbors: Materialized view.
At 70 objects/sec: takes 6 months to compute materialized view on billion objects.
17. 17 Zone Based Spatial Join Divide space into zones
Key points by Zone, offset(on the sphere this need wrap-around margin.)
Point search look in a few zones at a limited offset: ra ± r a bounding box that has 1-p/4 false positives
All inside the relational engine
Avoids “impedance mismatch”
Can “batch” all-all comparisons
33x faster and parallel6 days, not 6 months!
18. 18 In SQL: points near point
19. 19 Quantitative Evaluation: 7x faster than external stored proc: (linkage is expensive) time vs. radius for neighbors function @ various zone heights. Any small zone height is adequate.
time vs. best time @ various radius. A zoneHeight of 4” is near-optimal
20. 20 All Neighbors of All points(can Batch Process the Joins) A 5x additional speedup (35x in total)for @deltaZone in {-1, 0, 1} example ignores some spherical geometry details in paper
21. 21 Spatial Stuff Summary Easy
Point in polygon
Polygons containing points
(instance and batch)
Works in higher dimensions
Side note: Spherical polygons are
hard in 2-space
Easy in 3-space
22. 22 Spatial Stuff Summary Constraint databases are in
Streams (data is query, query is in DB)
Notification: subscription in DB, data is query
Spatial: constraints in DB, data is query
You can express constraints as rows
Then You
Can evaluate LOTS of predicates per second
Can do set algebra on the predicates.
Benefits from SQL parallelism
SQL == Prolog // DataLog? ?
23. 23 References
Representing Polygon Areas and Testing Point-in-Polygon Containment in a Relational Database http://research.microsoft.com/~Gray/papers/Polygon.doc
A Purely Relational Way of Computing Neighbors on a Sphere, http://research.microsoft.com/~Gray/papers/Neighbors.doc
24. 24 Outline How to do spatial lookup:
The old way: HTM
The new way: zoned lookup
A load workflow system
Embedded documentation:literate programming for SQL DDL
25. 25 Loading consists of Many Tasks Very simply: capture, analyze to produce catalog info, convert to sql, validate, import, index,
Each of these steps has many sub-steps
We learned from the TerraServer that (1) LOADING IS WHERE THE TIME GOES.(2) You get to load it again when you discover better data you discover a bug in the data you discover a better design.(3) Essentially, you are always loading or preparing to load.
Everyone “knows this” but you have to experience it to grasp it.
26. 26 The SkyServer Load Manager Built a workflow engine with SQL Agent (our batch job scheduler) and DTS
State machine is in databasevisible to all worker nodes
Workers at each node pull work
Step is a stored procedure
logs to load monitor database
3 levels of logging: job, step, phase
Logs and reporting are VERY useful
Automatic some manual backout to fix problems
Demo http://skyserver.pha.jhu.edu/admin/tasklist.asp?all=1
27. 27 Outline How to do spatial lookup:
The old way: HTM
The new way: zoned lookup
A load workflow system
Embedded documentation:literate programming for SQL DDL
28. 28 How do you document your schema?
29. 29 Knuth’s Literate Programming Put documentation in the program
Tool generates manual from program..In the new world: tool generates online hypertext from program.
We Annotated every table, view, function, .. with tags:unitscommentreference to other tables (flags, star schema).
Program scans DDL, generates web site.
http://skyserver.sdss.org/en/help/docs/browser.asp
30. 30
31. 31
32. 32 SkyServer Object Browserhttp://skyserver.sdss.org/en/help/docs/browser.asp VB program scans DDL, generates web site.
It is VERY usefulFree Text Search VERY useful
33. 33 Outline How to do spatial lookup:
The old way: HTM
The new way: zoned lookup
A load workflow system
Embedded documentation:literate programming for SQL DDL
Not shown: A C# web service (http://skyservice.pha.jhu.edu/SdssCutout) or better yet!
http://skyservice.pha.jhu.edu/dr1/imgcutout/chart.asp