470 likes | 1.29k Views
Indexing a Sphere. Jim Gray, Microsoft Alex Szalay, Johns Hopkins University + Gyorgy Fekete, Maria Nieto Santisteban (JHU) April 2005. 3 Ways We Do Spatial?. Hierarchical mesh (extension to SQL) Uses table valued stored procedures Acts as a new “spatial access method”
E N D
Indexing a Sphere Jim Gray, Microsoft Alex Szalay, Johns Hopkins University + Gyorgy Fekete, Maria Nieto Santisteban (JHU) April 2005
3 Ways We Do Spatial? • Hierarchical mesh (extension to SQL) • Uses table valued stored procedures • Acts as a new “spatial access method” • Ported to Yukon CLR for a 3x speedup. • Zones: fits SQL well • Surprisingly simple & good. • Constraints: a novel idea • Lets us do algebra on regions. • Paper:There Goes the Neighborhood: Relational Algebra for Spatial Data Search
Sky Points and Regions Resources and Footprints • Sky coverage of surveys • Spatial pattern of observations • Regions with bad data (masks) Data Management • Organize objects by sky regions into buckets • Density of objects highly variable on the sky • Need a hierarchical partitioning • Need to map to (btree) index in databases • Search database by region
Typical Questions • Is this point in the survey? • Give me all objects in this region • What surveys covers this part of the sky? • What is the common area of these surveys? • Give me all “good” objects (not in bad areas) • Cross-match these two catalogs • Give me cumulative counts (Histogram) over area • Compute fast spherical transforms of densities • Interpolate sparsely sample functions (extinction maps, dust temperature, …)
Typical Calls -- find objects within 1 arcminute of (60,20) select objID, ra, dec from PhotoObj as p, fHtmCover(60,20,1) as triangle where p.htmID between triangle.startHtmID and triangle.endHtmID and <geometry test on (ra,dec) – (60,20) < 1 arcmin> -- or better yet select objID, ra, dec, distance from dbo.fGetNearbyObjEq(60,20,1) Coarse distance test careful distance test rejects false positives • -- Also, Cross Apply is VERY useful • select objID, count(*)from PhotoObj p cross apply dbo.fGetNearbyObjEq(p.ra, p.dec, 1)
y x=c/a x ax + by = c y=c/b y x Preliminaries:Equations Define Subspaces • For (x,y) above the lineax+by > c • Reverse the space by-ax + -by > -c • Intersect a 3 volumes: a1x + b1y > c1a2x + b2y > c2a3x + b3y > c3
Preliminaries • Spherical geometry is really hard • Distance acos((sin(radians(lat1))sin(radians(lat2)))+(cos(radians(lat1)))cos(radians(lat2))cos(radians(abs(lon1-lon2))))) • Area forget it! • But, astronomers have it easy • The geoid is not a sphere! • A trick: do everything in 3-space with planes
Cos(r) Escape From Surface of Sphere to 3D Avoids Spherical Geometry • Circle C defined by plane intersect sphere. Plane is cx,cy,cz unit vector & length = cos(radians(radius)) • Point P = (px,py,pz)Inside circle C ifP•C > length no transcendental functions – just 3 multiplies and a compare. • Arbitrary polygons are intersections of these regions.
Project Sphere onto OctahedronSzalay, Kunszt, Brunner • Every point belongs to one of 8 faces one of 8 triangles
1,2,2 1,2,1 1,2 1,2,3 1,2,0 1,1,3 1,1,0 1,1 1,3,2 1,0,2 1,1,1 1,3,1 1,0,1 1,1,2 1,0 1,3 1,3,3 1,0,3 1,3,0 1,0,0 Recursively Divide Triangles(into 4 sub-triangles) • This is a space-filling curveHausdorf, Hilbert, Peano, Z-transform… 1
1,2,2 [0.123,0.130) [0.120,0.121) [0.121,0.122) [0.122,0.123) [0.123,0.130) 1,2,1 1,2,3 1,2,0 1,1,3 1,1,0 [0.120,0.121) [0.121,0.122) 1,3,2 1,0,2 1,1,1 1,3,1 1,0,1 1,1,2 1,0,3 1,0,0 1,3,3 1,3,0 Space-Filling Curve:Each Triangle Maps to The Real Line [0..1] [0.12,0.13) So triangles correspond to ranges All points inside the triangle are inside the range.
How Know if point is inside triangle? • Answer: • Triangle sides are 3 great circles C1, C2, C3 • So (P•C1 > L1) & (P•C2 > L2) & (P•C3 > L3) Given an area (spherical polygon)what triangles cover it? • Answer: • Recursive descent • Triangle overlaps if any corner is inside polygonand/or any triangle edge intersects polygon. • Return list of start-stop pairs. HTM Library computes triangles overlapping an area.
Find Points Inside Area • fGetNearbyObjEq(ra,dec,r)fGetNearbyObjXyz(x,y,z,r)fGetNearestObjEq(ra,dec,r)fGetNearestObjEq(x,y,z,r) • fGetObjInside(areaSpec) -- string area description • Alex added Healpix,Igloo is also a nice iso-area decomposition
Digression: How does Search Work? • Coarse filter (index) reduces search space • Btree, QuadTree, HashTable, Bitmap, R*tree, … • Careful test discards false positives • Efficiency: Careful test True + false True
DigressionWhat is an Index? • New organization of the data • Quick way to go from attribute to values • A [alt key] → [primary key] map • HTM index is a quick (and clustered) way to go from HTM triangle ID (range) to all points in the triangle. [area] → {points}
DigressionB-Trees are Good at Range Lookups If you can tell me the triangles T (start, stop) covering an area: select * from points P join triangles Ton P.HTM_ID between T.start and t.stop Returns all points inside the triangles
B-Trees are Good at Range Lookups If you can tell me the triangles T (start, stop) covering an area: select * from points P join triangles dbo.fHTMcover(@ra,@dec,@r) Ton P.HTM_ID between T.start and t.stop Returns all points inside the triangles points Triangles Nested Loops Join Start Stop HTM Index
using System; using System.Collections; using System.Data; using System.Data.Sql; using System.Data.SqlTypes; namespace xphtm{ publicpartialclassHTM{ [SqlFunction(IsDeterministic = true, IsPrecise = true) ] publicstaticlong HTMlookup(int level, double RA, double dec) { double x, y, z; HtmTrixel ht = newHtmTrixel(); SpatialVector.radec2cartesian(RA, dec, out x, out y, out z); return ht.cartesian2HID(x, y, z, level); } [SqlFunctionAttribute(FillRowMethodName = "FillCoverRow",TableDefinition = "bigint HTMID_start, bigint HTMID_end") ] publicstaticIEnumerablefHTMcover(floatra, float dec, float radius) { ArrayList rowSetAnswer = newArrayList(); Int64[,] arrayAnswer = null; arrayAnswer = HtmCover.Circle(ra,dec,radius); int i = 0; for (int i = 0; i < (arrayAnswer.Length / 2); i++) { CoverRow rp = newCoverRow(arrayAnswer[i,0], arrayAnswer[i, 1]); rowSetAnswer.Add(rp); } return (IEnumerable) rowSetAnswer; } publicstaticvoid FillCoverRow(object source, outlong HTMIDstart, outlong HTMIDend) { CoverRow row = (CoverRow)source; HTMIDstart = row._HTMIDstart; HTMIDend = row._HTMIDend; } publicclassCoverRow { publiclong _HTMIDstart; publiclong _HTMIDend; public CoverRow(long HTMIDstart, long HTMIDend) { _HTMIDstart = HTMIDstart; _HTMIDend = HTMIDend; } } } } We are Done: Returns HTM ID of a point Returns table of HTM triangles SQL wrapper
History/Performance • HTM has ~½ false positive on circle (all sizes) • Pre-Yukon: • xp wrapper >500 lines C++ • Table linkage expensive (~400 μs), cover ~1.3ms • Yukon • Code MUCH simpler. • Scalar linkage: 18 μs, table linkage 80μs • HTM: lookup: 30 μs, cover: 500 μs Linkage costs: (HTM is 400 μs more)
Poster Child App • SkyServer lookups • Points near point • Points in polygon • Polygons overlap polygon • 2,000 lookups/second/cpu • Can add 10,000 points /second/cpu
HTM Library • Current implementation in C#, public domain. • automatic code generation for C++ and Java • A DB2 port (ROE), 2MASS using with Illustra • Used at over 20 places in the world • 15 SkyQuery surveys have HTM • Easy to add HTM to your data • In the VO, the ADQL SkyNode supports the spatial language http://www.sdss.jhu.edu/htm/
Summary of HTM • Good for Points in area. • Efficiency comes from going to 3-space. • Uses table-valued function as an index generator. • New approach to bounding boxes. • Nested-loop join with function on outside. • Not good for area-overlaps-area(but can simplify areas and test for empty)
y x=c/a x ax + by = c y=c/b y x To Repeat (for area algebra) Equations Define Subspaces • For (x,y) above the lineax+by > c • Reverse the space by-ax + -by > -c • Intersect a 3 volumes: a1x + b1y > c1a2x + b2y > c2a3x + b3y > c3
Domain is Union of Convex Areas Not a convex • Regions are unions of convex areas • Higher order curves also work • Complex areas have holes and their holes have holes. (that is harder). Holes are harder + It is the union of two convexs
Now in Relational Terms create table HalfSpace ( domainID int not null -- domain name foreign key references Domain(domainID), convexID int not null, -- grouping a set of ½ spaces halfSpaceID int identity(), -- a particular ½ space x float not null, -- the (a,b,..) parameters y float not null, -- defining the ½ space z float not null, c float not null, -- the constant (“c” above) primary key (domainID, convexID, halfSpaceID) (x,y,z) inside a convex if it is inside all lines of the convex (x,y,z) inside a convex if it is NOT OUTSIDE ANY line of the convex select convexID -- return the convex from HalfSpace -- from the constraints where @x * x + @y * y + @x * z < c -- point outside the line? group by all convexID -- consider all the lines of a convexID having count(*) = 0 -- count outside == 0
The Algebra is Simple (Boolean) @domainID = spDomainNew (@type varchar(16), @comment varchar(8000)) @convexID = spDomainNewConvex (@domainID int) @halfSpaceID = spDomainNewConvexConstraint (@domainID int, @convexID int, @x float, @y float, @z float, @c float) @returnCode = spDomainDrop(@domainID) select * from fDomainsContainPoint(@x float, @y float, @z float) Once constructed they can be manipulated with the Boolean operations. @domainID = spDomainOr (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000)) @domainID = spDomainAnd (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000)) @domainID = spDomainNot (@domainID1 int, @type varchar(16), @comment varchar(8000))
Defining Areas (Regions) • Circle: Center point + radiusRectangle: Corner points of squarePolygon: sequence corner points or edges Region: union of convexes • As with all spatial, many coordinate systems (equatorial, Cartesian, galactic, …) • Have a linear syntax: • CIRCLE J2000 RA DEC radius • CONVEX CARTESIAN 1 0 0 0 1 0 0 0 1… • Have advanced XML syntax for space-time. Spec: http://www.ivoa.net/Documents/latest/STC-X.htmlPerson: Arnold Rots http://hea-www.harvard.edu/~arots/ • Similar to GIS, but better suited for astronomy
String (easy) Areas (or struct in C# world) circleSpec := CIRCLE J2000 ra dec radArcMin | CIRCLE CARTESIAN x y z radArcMin rectSpec := RECT J2000 {ra dec}2 polySpec := POLY J2000 {ra dec}3+ | POLY CARTESIAN { x y z }3+ convexSpec := CONVEX { x y z d}+ regionSpec := REGION { convexSpec }+ areaSpec := circleSpec | rectSpec | polySpec | regionSpec
Regions • Easy to compare regions. • Easy to combine regions (union, difference, intersection, buffer,…) • Easy to find points in region • Easy to find regions containing points • Easy to find regions overlap region.
Some Details • HTM library used to simplify (detect nulls) • Compute (in TSQL now, C# later): • Corner points of region • Area of region • Bounding circle of region • Center point + bounding circle is spatial index. • See: “Simplify a Convex” Szalay/Gray.
What! No Bounding Box? • Bounding box limits search.A subset of the convexs . • If query runs at 3M halfspace/sec then no need for bounding box, unless you have more than 10,000 lines. • But, if you have a lot of half-spaces then bounding box is good. • We have bounding circles.
Things Can Get Complex This is convex!
Poster Child App • Used as footprint service. • Take many footprints • Fuzz them (buffer) to make coarser footprintconvex hull • See if two footprints overlap • ~20 lines of code +130 lines of logic/comments
Zones: Comparing Massive Point Datasets • Want to “cross match” 109 objects with another 109 objects • Using the HTM code, that is a LOT of calls (~106 seconds (~ a year)) • Want to do it in an hour.
Zone Approach • Divide space into zones • Key points by Zone, offset(on the sphere this need wrap-around margin.) • Point search look in a few zones at a limited offset: ra ± r a bounding box that has 1-π/4 false positives • All inside the relational engine • Avoids “impedance mismatch” • Can “batch” all-all comparisons • faster and 60,000x parallel1 hours, not 6 months! • This is Maria Nieto Santisteban’s PhD thesis r ra-zoneMax x √(r2+(ra-zoneMax)2) cos(radians(zoneMax)) zoneMax Ra ± x
In SQL: points near point Ignoring some spherical geometry details, see paper. select o1.objID -- find objects from zone o1 -- in the zoned table where o1.zoneID between -- where zone # floor((@dec-@r)/@zoneHeight) and -- overlaps the circle ceiling((@dec+@r)/@zoneHeight) and o1.ra between @ra - @r and @ra + @r -- quick filter on ra and o1.dec between @dec-@r and @dec+@r -- quick filter on dec and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2)))) < @r -- careful filter on distance Bounding box Eliminates the ~ 21% = 1-π/4 False positives Competes with HTM (faster because no xp linkage), but …. What about comparing 1 B points with 1B points, cross match. Still too slow.
time vs. radius for neighbors function @ various zone heights. Any small zone height is adequate. time vs. best time @ various radius. A zoneHeight of 4” is near-optimal Quantitative Evaluation: 7x faster than external stored proc: (linkage is expensive)
All Neighbors of All points(can Batch Process the Joins) • A 5x additional speedup (35x in total)for @deltaZone in {-1, 0, 1} example ignores some spherical geometry details in paper insert neighbors -- insert one zone's neighbors select o1.objID as objID, -- object pairs o2.objID as NeighborObjID, .. other fields elided from zone o1 join zone o2 -- join 2 zones on o1.zoneID-@deltaZone = o2.zoneID -- using zone number and ra and o2.ra between o1.ra - @r and o1.ra + @r -- points near ra where -- elided margin logic, see paper. and o2.dec between o1.dec-@r and o1.dec+@r -- quick filter on dec andsqrt(power(o1.x-o2.x,2)+power(o1.y-o2.y,2)+power(o1.z-o2.z,2)) < @r -- careful filter on distance Many other speedups, and then there is parallelism.
Poster Child Apps • CrossMatch SDSS with itself (250M obj)Producing neighbors within 30 arcseconds • Was 4,000 hours (estimated) • Now 2 hours • MaxBCG (cross-match SDSS to find clusters) • 9,167 cpu hours with “files” (normalize to 2,300 hrs?) • 240 cpu hours, 44 elapsed hours with zones • See “DB meets Grid” paper • Crossmatch • USNO (US Naval Observatory Catalog): 1.2 B obj • 2MASS (2 meter all sky survey): 0.5B objects
More Poster ChildUSNO-2Mass CrossMatch • Sequential: first try, 22 hours • Optimize code: 6.25 hours • Note: • data on separate machine • part of zone time is “pull” of data
Zones allow 60,000 Parallel Jobs Partition Parallelism: 3.7 hours 2MASS:USNOBZone:ZoneComparison MergeAnswer Build Index Source Tables Zoned Tables 2MASS→USNOB 350 Mrec 12 GB 2MASS 471 Mrec 140 GB 0:-1 64 Mrec 2 GB 2MASS 471 Mrec 65 GB 0:0 260 Mrec 9 GB USNOB 1.1 Brec 233 GB USNOB 1.1 Brec 106 GB 350 Mrec 12 GB 0:+1 26 Mrec 1 GB USNOB→2MASS 2 hours 1.2 hour .5 hour
Pipeline Parallelism: 2.5 hours Or… as fast as we can read USNOB + .5 hours 2MASS:USNOBZone:ZoneComparison MergeAnswer Build Index Source Tables Zones 2MASS→USNOB 350 Mrec 12 GB 2MASS 471 Mrec 140 GB 0:-1 64 Mrec 2 GB Next zone 0:0 260 Mrec 9 GB USNOB 1.1 Brec 233 GB Next zone 350 Mrec 12 GB 0:+1 26 Mrec 1 GB USNOB→2MASS 2 hours .5 hour
Spatial Stuff Summary • Easy • Point in polygon • Polygons containing points • (instance and batch) • Works in higher dimensions • Side note: Spherical polygons are • hard in 2-space • Easy in 3-space
Spatial Stuff Summary • Constraint databases are “cool” • Streams (data is query, query is in DB) • Notification: subscription in DB, data is query • Spatial: constraints in DB, data is query • You can express constraints as rows • Then You • Can evaluate LOTS of predicates per second • Can do set algebra on the predicates. • Benefits from SQL parallelism • SQL == Prolog // DataLog?
In SQL select o1.objID -- find objects from zone o1 -- in the zoned table where o1.zoneID between -- where zone # floor((@dec-@r)/@zoneHeight) and -- overlaps the circle floor((@dec+@r)/@zoneHeight) and o1.ra between @ra - @r and @ra + @r -- quick filter on ra and o1.dec between @dec-@r and @dec+@r -- quick filter on dec and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2)))) < @r -- careful filter on distance Bounding box Eliminates the ~ 21% = 1-π/4 False positives
References\\barcfs\public\Gray\Spatial There Goes the Neighborhood: Relational Algebra for Spatial Data Search, J. Gray; A. S. Szalay; G. Fekete; M. A. Nieto-Santisteban; W. O'Mullane; A. R. Thakar; G. Heber; A. H. Rotshttp://research.microsoft.com/research/pubs/view.aspx?tr_id=736 Roots of a Convex, Alex Szalay, Jim Gray, March 2005 Representing Polygon Areas and Testing Point-in-Polygon Containment in a Relational Database Jim Gray, Alex Szalay spRegion.SQLJim Gray, Alex Szalay SQL2005 CTP Table-Valued Function Performance vs SQL2000, Jim Gray, Peter Kukol Preliminary Study of SQL2005 C# HTM procedures Jim Gray, Peter Kukol, Gyorgy Fekete, Alex Szalay Computing the Match Table Jim Gray, Alex Szalay, Robert Lupton, Jeff Munn When Database Systems Meet the GridM. A. Nieto-Santisteban, J. Gray, A. S. Szalay, J. Annis, A. R. Thakar, W. J. O’Mullane Creating Sectors for SkyServer A. Szalay, G. Fekete, T. Budavari, J. Gray, A. Pope, A. Thakar Partitioning the Sky for Spatial Data TasksJim Gray, Maria Nieto Santisteban, Alex Szalay Spatial Clustering of Galaxies in Large DatasetsA. S. Szalay, T. Budavari, A. Connolly, J. Gray, T. Matsubara, A. Pope, I. Szapudi