1 / 33

There Goes the Neighborhood Spatial or N-Dimensional Search in a Relational World

aya
Download Presentation

There Goes the Neighborhood Spatial or N-Dimensional Search in a Relational World

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

    2. 2

    3. 3 Background I have been working with Astronomy community to build the World Wide Telescope: all telescope data federated in one internet-scale DB A great Web Services app The work here joint with Alex Szalay SkyServer.Sdss.Org is first installment, SkyQuery.Net is second installment (federated web services)

    4. 4 Outline How to do spatial lookup: The old way: HTM The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL

    5. 5 Spatial Data Access – SQL extension Szalay, Kunszt, Brunner http://www.sdss.jhu.edu/htm Added Hierarchical Triangular Mesh (HTM) table-valued function for spatial joins Every object has a 20-deep Mesh ID Given a spatial definition, routine returns up to 10 covering triangles Spatial query is then up to 10 range queries Fast: 1,000 triangles / second / Ghz

    6. 6 A typical call

    7. 7 Integration with CLR Makes it Nicer Peter Kukol converted 500 lines of external stored procedure “glue code” to 50 lines of C# code. Now we are converting library to C# Also, Cross Apply is VERY useful select objID, count(*) from PhotoObj p cross apply dbo.fGetNearbyObjEq(p.ra, p.dec, 1)

    8. 8 But… Wanted a faster way to do this: some computations were taking toooooo long (see below). Wanted to define areas in relational form. Wanted a portable way that works on any relational system. So, developed a “constraint database” approach – see below.

    9. 9 The Idea: Equations Define Subspaces For (x,y) above the line ax+by > c Reverse the space by -ax + -by > -c Intersect 3 half-spaces: a1x + b1y > c1 a2x + b2y > c2 a3x + b3y > c3

    10. 10 The Idea: Equations Define Subspaces a1x + b1y > c1 a2x + b2y > c2 a3x + b3y > c3

    11. 11 Domain is Union of Convex Hulls Simple volumes are unions of convex hulls. Higher order curves also work Complex volumes have holes and their holes have holes. (that is harder).

    12. 12 Now in Relational Terms create table HalfSpace ( domainID int not null -- domain name foreign key references Domain(domainID), convexID int not null, -- grouping a set of ˝ spaces halfSpaceID int identity(), -- a particular ˝ space a float not null, -- the (a,b,..) parameters b float not null, -- defining the ˝ space c float not null, -- the constraint (“c” above) primary key (domainID, convexID, halfSpaceID) (x,y) inside a convex if it is inside all lines of the convex (x,y) inside a convex if it is NOT OUTSIDE ANY line of the convex Convexes containing point (@x,@y): select convexID -- return the convex hulls from HalfSpace -- from the constraints where (@x * a + @y * b) < c -- point outside the line? group by all convexID -- insist no line of convex having count(*) = 0 -- is outside (count outside == 0)

    13. 13 All Domains Containing this Point The group by is supported by the domain/convex index, so it’s a sequential scan (pre-sorted!). select distinct domainID -- return domains from HalfSpace -- from constraints where (@x * a + @y * b) < c -- point outside group by all domainID, convexID -– never happens having count(*) = 0 -- count outside == 0

    14. 14 The Algebra is Simple (Boolean) @domainID = spDomainNew (@type varchar(16), @comment varchar(8000)) @convexID = spDomainNewConvex (@domainID int) @halfSpaceID = spDomainNewConvexConstraint (@domainID int, @convexID int, @a float, @b float, @c float) @returnCode = spDomainDrop(@domainID) select * from fDomainsContainPoint(@x float, @y float) Once constructed they can be manipulated with the Boolean operations. @domainID = spDomainOr (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000)) @domainID = spDomainAnd (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000)) @domainID = spDomainNot (@domainID1 int, @type varchar(16), @comment varchar(8000))

    15. 15 What! No Bounding Box? Bounding box limits search. A subset of the convex hulls. If query runs at 3M half-space/sec then no need for bounding box, unless you have more than 10,000 lines. But, if you have a lot of half-spaces then bounding box is good.

    16. 16 OK: “solved” Areas Contain Point? What about: Points near point? Table-valued function find points near a point Select * from fGetNearbyEq(ra,dec,r) Use Hierarchical Triangular Mesh www.sdss.jhu.edu/htm/ Space filling curve, bounding triangles… Standard approach 13 ms/call… So 70 objects/second. Too slow, so pre-compute neighbors: Materialized view. At 70 objects/sec: takes 6 months to compute materialized view on billion objects.

    17. 17 Zone Based Spatial Join Divide space into zones Key points by Zone, offset (on the sphere this need wrap-around margin.) Point search look in a few zones at a limited offset: ra ± r a bounding box that has 1-p/4 false positives All inside the relational engine Avoids “impedance mismatch” Can “batch” all-all comparisons 33x faster and parallel 6 days, not 6 months!

    18. 18 In SQL: points near point

    19. 19 Quantitative Evaluation: 7x faster than external stored proc: (linkage is expensive) time vs. radius for neighbors function @ various zone heights. Any small zone height is adequate. time vs. best time @ various radius. A zoneHeight of 4” is near-optimal

    20. 20 All Neighbors of All points (can Batch Process the Joins) A 5x additional speedup (35x in total) for @deltaZone in {-1, 0, 1} example ignores some spherical geometry details in paper

    21. 21 Spatial Stuff Summary Easy Point in polygon Polygons containing points (instance and batch) Works in higher dimensions Side note: Spherical polygons are hard in 2-space Easy in 3-space

    22. 22 Spatial Stuff Summary Constraint databases are in Streams (data is query, query is in DB) Notification: subscription in DB, data is query Spatial: constraints in DB, data is query You can express constraints as rows Then You Can evaluate LOTS of predicates per second Can do set algebra on the predicates. Benefits from SQL parallelism SQL == Prolog // DataLog? ?

    23. 23 References Representing Polygon Areas and Testing Point-in-Polygon Containment in a Relational Database http://research.microsoft.com/~Gray/papers/Polygon.doc A Purely Relational Way of Computing Neighbors on a Sphere, http://research.microsoft.com/~Gray/papers/Neighbors.doc

    24. 24 Outline How to do spatial lookup: The old way: HTM The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL

    25. 25 Loading consists of Many Tasks Very simply: capture, analyze to produce catalog info, convert to sql, validate, import, index, Each of these steps has many sub-steps We learned from the TerraServer that (1) LOADING IS WHERE THE TIME GOES. (2) You get to load it again when you discover better data you discover a bug in the data you discover a better design. (3) Essentially, you are always loading or preparing to load. Everyone “knows this” but you have to experience it to grasp it.

    26. 26 The SkyServer Load Manager Built a workflow engine with SQL Agent (our batch job scheduler) and DTS State machine is in database visible to all worker nodes Workers at each node pull work Step is a stored procedure logs to load monitor database 3 levels of logging: job, step, phase Logs and reporting are VERY useful Automatic some manual backout to fix problems Demo http://skyserver.pha.jhu.edu/admin/tasklist.asp?all=1

    27. 27 Outline How to do spatial lookup: The old way: HTM The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL

    28. 28 How do you document your schema?

    29. 29 Knuth’s Literate Programming Put documentation in the program Tool generates manual from program.. In the new world: tool generates online hypertext from program. We Annotated every table, view, function, .. with tags: units comment reference to other tables (flags, star schema). Program scans DDL, generates web site. http://skyserver.sdss.org/en/help/docs/browser.asp

    30. 30

    31. 31

    32. 32 SkyServer Object Browser http://skyserver.sdss.org/en/help/docs/browser.asp VB program scans DDL, generates web site. It is VERY useful Free Text Search VERY useful

    33. 33 Outline How to do spatial lookup: The old way: HTM The new way: zoned lookup A load workflow system Embedded documentation: literate programming for SQL DDL Not shown: A C# web service (http://skyservice.pha.jhu.edu/SdssCutout) or better yet! http://skyservice.pha.jhu.edu/dr1/imgcutout/chart.asp

More Related