230 likes | 346 Views
Prototype Web Services Using SDSS DR1. Alex Szalay, Tamas Budavari, Sam Carlisle, Jim Gray, Vivek Haridas, Nolan Li, Tanu Malik, Maria Nieto-Santisteban, Wil O’Mullane, Ani Thakar. NVO: How Will It Work?. Define commonly used ‘core’ services Build higher level toolboxes/portals on top
E N D
Prototype Web ServicesUsing SDSS DR1 Alex Szalay, Tamas Budavari, Sam Carlisle, Jim Gray, Vivek Haridas, Nolan Li, Tanu Malik, Maria Nieto-Santisteban, Wil O’Mullane, Ani Thakar
NVO: How Will It Work? • Define commonly used ‘core’ services • Build higher level toolboxes/portals on top • We do not build ‘everything for everybody’ • Use the 90-10 rule: • Define the standards and interfaces • Build the framework • Build the 10% of services that are used by 90% • Let the users build the rest from the components
Using SDSS DR1 • SDSS DR1 (Data Release1) is now publicly available http://skyserver.pha.jhu.edu/dr1/ • About 1TB of catalog data • Using MS SQL Server 2000 • Complex schema (72 Tables) • About 80 million photometric objects • Two versions (TARGET/BEST) • Automated documentation • Raw data at FNAL file server with URL access
Loading DR1 • Automated table driven workflow system for loading • Included lots of verification code • Over 16K lines of SQL code • Loading process was extremely painful • Lack of systems engineering for the pipelines • Poor testing (lots of foreign key mismatch) • Detected data bugs even a month ago • Most of the time spent on scrubbing data • Fixing corrupted files (RAID5 disk errors) • Once data was clean, everything loaded in 3 days • Neighbors calculation took about 10 hours • Reorganization of data took about 1 week of experiments in partitioning/layouts
Reorganization • Introduced partitions and filegroups • Photo, Tag, Neighbors, Spectro, Frame, Other, Profiles • Keep partitions under 100GB • Vertical partitioning – tried and abandoned • Both partitioning and index build now table driven • Stored procedures to create/drop indices at various granularities • Tremendous improvement in performance when doing this on a large memory machine (24GB) • Also much better performance afterwards
Spatial Features • Precomputed Neighbors • All objects within 30” • Boundaries, Masks and Outlines • Stored as spatial polygons Time Domain: • Precomputed Match • All objects with 1”, observed at different times • Found duplicates due to telescope tracking errors • Manual fix, recorded in the database • MatchHead • The first observation of the linked list used as unique id to chain of observations of the same object
Spatial Algorithms • Updated HTM library • Automated depth for HTM_Cover • Output vertices • Simplify polygon • Boolean operations on regions • Part of VO data model (A. Rots) • Zones • Much better performance for bulk neighbors at a fixed radius • Footprint service in progress • Bool Contains(point) • Region Intersect(region)
Web Services in Progress • Registry • Harvesting and querying • Data Delivery • Query driven Queue management • Graphics and visualization • Query driven vs interactive • Show spatial objects (Chart/Navi/List) • Footprint/intersect • It is a “fractal” • Cross-matching • SkyQuery and SkyNode • Ferris-wheel • Distributed vs parallel
Registry: Easy Clients Just use SOAP toolkit (T. McGlynn & J. Lee have done Perl client). Easy in Java java org.apache.axis.wsdl.WSDL2Java "http://skyservice.pha.jhu.edu/devel/registry/registry.asmx?wsdl" • Gives set of Classes for accessing the service • Gives Classes for the XML which is returned (i.e. SimpleResource) Still need to write client like RegistryLocator loc = new RegistryLocator(); RegistrySoap reg = loc.getRegistrySoap(); ArrayOfSimpleResource reses = null; reses = reg.queryRegistry(args[0]); http://skyservice.pha.jhu.edu/devel/registry/index.aspx
Generic Catalog Access • After 2 years of SDSS EDR and 6 months of DR1 usage, access patterns start to emerge • Lots of small users, requiring instant response • 1/f distribution of request sizes (tail of the lognormal) • How to make everybody happy? • No clear business model… • We need a separate interactive and batch server • We also need access to full SQL with extensions • Users want to access services via browsers • Other services will need SOAP access
Data Formats • Different data formats requested: • HTML, CSV, FITS binary, VOTABLE, XML, graphics • Quick browsing and exploration • Small requests, need to be nicely rendered • Needs good random access performance • Also simple 2D scatter plots or density plots required • Heavy duty statistical use • Aggregate functions on complex joins, lots of scans but small output, mostly want CSV • Successive Data Filter • Multi-step non-indexed filtering of the whole database,mostly want FITS binary
Data Delivery • Small requests (<100MB) • Putting data on the stream • Medium requests (<1GB) • Use DIME attachments to SOAP messages • Large requests (>1GB) • Save data in scratch area and use asynch delivery • Only practical for large/long queries • Iterative requests • Save data in temp tables in user space • Let user manipulate via web browser • Paradox: if we use web browser to submit, users want immediate response from batch-size queries
How To Provide a UserDB • Goal: through several search/filter operations reduce data transfer to manageable sizes (1-100MB) • Today: people download tens of millions of rows, and then do their next filtering on client side, using F77 • Could be much better done in the database • But: users need to create/manage temporary tables • DOS attacks, fragmentation, who pays for it • Security, who can see my data (group access)? • Follow progress of long jobs • Who does the cleanup?
Query Managament Service • Enable fast, anonymous access to small requests • Enable large queries, with ability to manage • Enable creation of temporary tables in user space • Create multiple ways to get query output • Needs to support multiple mirrors/load balancing • Do all this without logging in to Windows • Need also support of machine clients • Web Service: http://skyservice.pha.jhu.edu/devel/CasJobs/ • Two request categories: • Quick • Batch
Queue Management • Need to register batch ‘power users’ • Query output goes to ‘MyDB’ • Can be joined with source database • Results are materialized from MyDB upon request • Users can do: • Insert, Drop, Create, Select Into, Functions, Procedures • Publish their tables to a group area • Data delivery via the CASService (C# WS) • http://skyservice.pha.jhu.edu/devel/CasService/CasService.asmx
Graphics Tools • Simple xy plotshttp://skyservice.pha.jhu.edu/nli/wplot/ • Density plothttp://skyservice.pha.jhu.edu/devel/DensityMap/AllSkyView.aspxhttp://skyservice.pha.jhu.edu/devel/DensityMap/PlotQuery.aspx • Chart/Navi/Listhttp://skyservice.pha.jhu.edu/dr1/imgcutout/getjpeg.asmx • Can be built into various applications
Archive Footprint • Footprint is a ‘fractal’ • Result depends on context • all sky, degree scale, pixel scale • Translate to web services • Footprint()returns single region that contains the archive • Intersection(region, tolerance)feed a region and returns the intersection with archive footprint • Contains(point)returns yes/no (maybe fuzzy) if point is inside archive footprint
Cross-Matching • SkyQuery – SkyNode • Currently lots of proprietary features • Data transmitted via .NET DataSet => VOTable • Query plan written in MS T-SQL => ADQL • Spatial operator restricted to a cone =>VORegion • Made up metadata delivery => VORegistry • Data delivery in XML/HTML => VOTable • Catalogs in the near future • SDSS DR1, FIRST, 2MASS, INT • POSS-1, GSC-2, HST, ROSAT, 2dF • GALEX, IRAS, PSCZ
Spatial Cross-Match • For small area HTM is close to optimal, but needs more speed • For all-sky surveys the zone algorithm is best • Current heuristic is a linear chain of all nodes • Easy to generalize to include precomputed neighbors • But, for all sky queries very large numberof random reads instead of sequential
SDSS Portal Ferris-Wheel • Sky split into buckets/zones • All archives scan in sync • Queries enter at bottom • Results come back afterfull circle • Only sequential access=> buckets get into cache,then queries processed
Utilitites • FITSLIB 1.10C# library around the CFITSIO packagehttp://www.cs.jhu.edu/~haridas/tech/Fits/ • MIRAGEJava wrapper around Mirage, can directly access the VORegistry, and ConeSearchhttp://skyservice.pha.jhu.edu/develop/vo/mirage/mirage.html • HTM2.0Updated HTM library, conforming to the new Region specificationhttp://www.sdss.jhu.edu/htm/ • ADQLPrototype service to convert back and forth between ADQL and SQLhttp://skyservice.pha.jhu.edu/vivek/msdev/AstroDql/ws/http://skyservice.pha.jhu.edu/vivek/msdev/AstroDql/ws/Archive.asmx • SDSSQAJava application, emulating MS Query Analyzer
Summary • Web Services have been remarkably easy to use • Now different platforms are interoperable • We have invested a lot of energy to develop various interface libraries (FITS, VOTable) • Integrating graphics into web services was very easy • Next: • Parallel queries • Finish query queue management • Upgrade SkyQuery • Bring in more archives • Ferris-Wheel experiment • On-demand database creation • 100TB parallel data access layer