260 likes | 384 Views
A System for Fast Low-Cost Access to Very Large Catalogs. Michael Riggs, Vashundara Puttagunta, Maitreyee Pasad, Konstantinos Kalpakis, and Jeanne Behnke CSEE, UMBC and NASA GSFC. URL: http://www.csee.umbc.edu/~kalpakis/monet. Introduction. Motivation
E N D
A System for Fast Low-Cost Access to Very Large Catalogs Michael Riggs, Vashundara Puttagunta, Maitreyee Pasad, Konstantinos Kalpakis, and Jeanne Behnke CSEE, UMBC and NASA GSFC URL: http://www.csee.umbc.edu/~kalpakis/monet
Introduction • Motivation • Many catalogs in astronomy, few old and most of the new ones, contain data for very large number of spatial objects • Fast methods to access these catalogs are needed • Budget constraints limit the amount of $$ allocated to S/W and H/W to support fast access to the catalogs • Thus, low-cost solutions that achieve efficient and effective access to very large catalogs are needed • We demonstrate • how to achieve fast access to catalog data on a low cost desktop for a very large catalog using an Object-Relational Database System (ORDBMS) • how to enhance the functionality of the system by extending the ORDBMS with IDL
Test Catalog Used • We use the USNO-A2.0 catalog (called the Monet catalog here) which has approximately 500 million stars (points) Image of the Monet catalog found at: http:/www.nofs.navy.mil/projects/pmm/universe.html
Schemas Used • We experiment with a sample from the Monet catalog and the following four approaches (schemas/datablades/indexes) • Relational only • Traditional relational datatypes with B-tree indexes (clustered) on dec-RA and RA-dec (B schemas) • Object-Relational • Datatypes from the Geodetic Datablade (created for the EOSDIS project for geo-location) with an R-tree index on geo-location (G schemas) • Datatypes from Shapes2 Informix datablade with R-tree index on points (S schemas) • Datatypes from our custom-build SimpleShape datablade with R-tree index on points built bottom up using Hilbert coding of the point’s coordinates (I schemas)
Types of Queries Tested • Spatial Window(Online catalog access) Find all stars within a user-defined bounding rectangle • Spatial Or-Window (Online catalog access) Find all the stars within at least one of two different bounding rectangles • Spatial Self-Join (Catalog mining) Find all stars that are within a specified box centered on each star • Spatial Multi-Join (Catalog Correlations) • Spatial chain-join Find a spatial region in a “chain-joined” sequence of tables (catalogs) • Spatial Star-join Find a spatial region in a “star-joined” sequence of tables (catalogs)
Our Custom Datablade • The custom-made datablade SimpleShape extends the Informix Dynamic Server 2000 by providing • a fixed length opaque type (UDT) storing 4 small floats (16 bytes total) • full R-tree index support including new bottom-up index building • full B-tree index support based on 2 dimensional Hilbert Curves • the SimplePoint UDT holds the coordinates of a 2 dimensional point • the SimpleBox UDT holds the lower left and upper right SimplePoints of a rectangle to make a box • input/output functions and constructors
Format of the USNO-A2.0 • Each star in the catalog comes packaged as three integers. • (int RA, int DEC, int MAG) • RA = (RA in decimal hours) * 15 * 3600 * 100 • DEC = ((DEC in decimal degrees) + 90) * 3600 * 100 • MAG = combination of field, blue, red values plus flags • The catalog data are stored in a database table with schema (coordinates SimplePoint, magnitude integer) • the actual data in the table are stored in their original (compressed) format as above • user-defined functions (UDFs) change coords and magnitude into meaningful fields
Specializing the SimpleShape to the USNO-A2.0 Catalog • the RA and DEC fields are stored in a SimplePoint point (dec, ra) of type SimplePoint • the data format is not changed from packaged monet data • data conversion takes place in the DBMS when displaying results • the MAG field is stored as a single integer field in DBMS directly • additional utility UDFs are created • Ra() takes a SimplePoint and returns RA in decimal hours • Dec() takes a SimplePoint and returns DEC in decimal degrees • MonetBox() contructs a SimpleBox from RA and DEC in decimal hours and decimal degrees respectively • Blue() takes MAG and returns the blue magnitude • Red() takes MAG and returns the red magnitude • Field() takes MAG and returns the field of the star
Custom Data Loader • Custom High Performance Loader • Built from the Virtual Table Interface (VTI) of Informix Server • Maps binary OS file into database table • Only select queries with NO where clause can be issued on table • Reasons for development • Lack of Informix High Performance Loader for PC platform • Bug in regular table load command for PC platform • USNO-A2.0 is distributed in binary format • absense of string conversions make loading fast
Indexing for the Monet Data • R-tree index on coordinates • built using new bottom-up algorithm and the 2nd order Hilbert values of the points • allows fast window queries on entire catalog • faster build times than B-tree index • takes about 1 day to index the entire catalog • Second order Hilbert values • space filling curve converts two dimensions into one dimension • points are sorted based on their Hilbert value and stored adjacently in R-tree leaf pages • Hilbert value code and R-tree bottom-up index building algorithms are built into SimpleShape software
Sample SQL Statements for the SimpleShape Datablade • Sample SQL statements using the SimpleShape datablade • create table test (product_id int, location SimplePoint) • insert into test values (1, SimplePoint(2.333,5.77)) • insert into test values (2, ‘point(4.123, 8.234)’ ) • create table products (product_id int, location_x int, location_y int) • insert into test select product_id, SimplePoint(location_x, location_y) from products • select * from test where within (location, ‘box(0, 0, 100, 100)’ ) • Sample SQL statements for the USNO-A2.0 specialization • create table monet_data(coords SimplePoint, mag int) • select Ra(coords), Dec(coords), Red(mag), Blue(mag), Field(mag) from monet_data where within (coords, monetbox (-3.22, 22.45, 1.23, 22.78) ) • select (Blue(mag) - Red(mag)) as energy, count(*) as count from monet_data where within (coords, monetbox (-3.22, 22.45, 1.23, 22.78) ) group by energy order by count desc
More Sample SQL Statements • Custom High Performance Loader • create table load_data(value1 int, value2 int) using vti_load(file=‘/tmp/data.bin’); • select * from load_data • insert into real_table select * from load_data • Loading Monet Data • create table monet_all (coords SimplePoint, mag int) • create table monet_load(ra int, dec int, mag int) using vti_load(file=‘/tmp/zone0000.cat’) • insert into monet_all select SimplePoint(dec,ra), mag from monet_load • Indexing Monet Data • create index monet_all_idx on monet_all (coords SIMPLE_OPS) using rtree • Selecting Monet Data • select * from monet_all where within (coords, MonetBox(dec,ra,dec,ra))
System Configuration • The experiments with our custom datablade were done on a system with the following configuration • Hardware (costing less than $2K) • AMD Duron 650MHz Processor • FIC model AZ11 motherboard w/IDE ATA-66 controller • 256MB of PC100 RAM • Two IBM Ultra/100 7200RPM IDE disks model DTLA307045 • Software • Linux Mandrake version 7.0 • Updated Linux Kernel version 2.4.0-test4 • Informix Dynamic Server 2000 version 9.20.UC1 • Custom datablade and data loading software
System Configuration • The experiments with other datablades and the traditional relational solution were done on a system with the following configuration • Hardware • Sun UltraSparc 60 with 512 MB memory • two 8GB and two 4GB SCSI disks • Software • Solaris 2.6 • Informix Dynamic Server 9.14 with the Universal DataServer Option • Geodetic and Shapes2 datablades
Experimental Results Elapsed time to load data from the Monet catalog Elapsed time in creating R-tree indexes for the three datablades
Experimental Results • Loading and Indexing all the 500M stars The numbers for the Relational, Geodetic, and Shapes2 schemas are estimates based on regression from measurements for up to 140K stars. • Beware: Results are not for the faint of heart!
Experimental Results Loading times for the Monet catalog using the SimpleShape datablade
Experimental Results Bulding an R-tree index for the Monet catalog using the SimpleShape datablade Page Reads Elapsed time
Experimental Results • Number of page reads for various queries and schemas for 60K stars • The query window(s) used are 0.0666x0.0666 decimal degrees
Experimental Results • Performance of the custom-datablade (schema) for queries against a table with 60K stars
Efficient storage and retrieval • Indexing and clustering of data • Concurrency control • Security Informix DBMS IDL user functions • Data-analysis routines • Visualization capabilities RPC Calls • Data-analysis routines • Visualization capabilities IDL server IDL Extension for the Informix Server • Powerful data-mining applications can be built by providing an interface between an object-relational DBMS (Informix) and a statistical/scientific-computing package (IDL). • The IDL extension to Informix allows users to tap the data-analysis and visualization capabilities of IDL through Informix. • Implementation is based on UDFs and the RPC mechanism.
Data type Last IDL Data type Cast returned statement returned function from IDL in script by function . Integer $RETURN pointer to int toInt . Double $RETURN pointer to float toFloat . Binary arr $BLOB LO_Handle toBlob . structure $RETURN pointer to ROW toRow of lvarchars 2D Array $RETURN ptr to collection toList of rows The return variable is a double, i.e. float in Informix. So we use toFloat as casting function. Columns of the table that are used as arguments by idl_exec.$1 x ; $2 y $RETURN used to return a double Select toFloat ( idl_exec (ROW( x , y ), "r = double( sqrt ( ($1)^2 + ($2)^2 ) ) ; $RETURN r ” ) ) from Points; IDL script • The IDL interface provides two main functions • idl_execis a simple user defined function that executes the idl-script for each row of the table • idl_aggis a user defined aggregate function. • The functions are used along with casting functions that cast the pointer returned, to appropriate data type. Computing distance from the origin using idl_exec
Clustering using idl_agg:We use the CLUST_WTS function in IDL that computes the weights (the cluster centers) of an mxn array and returns an array of cluster centers. idl_agg, in this case, returns a collection of rows that can be made to look like a virtual table. The return variable is a 2D-array. toList is casting function for the pointer to collection of rows that idl_exec returns The initialization IDL script ; executed once for each group Iteration script; executed once for every row in the group; here it assigns values to the matrix select toList ( idl_agg( ROW(ra, dec), ROW( "count=-1", "count=count+1 &if count eq 0 then y = [$1, $2] else y = [[y], [$1,$2]]", "w = CLUST_WTS(y,N_CLUSTERS=3); $RETURN w ” ) )) :: list(ROW(a float, b float) not null) from monet_temp ; Final script; executed once at the end of each group; here it clusters data using some IDL routines $RETURN used to return a collection of rows of two doubles each. Casts the collection into required list.
Since we have $BLOB, idl_agg returns an LO_Handle to the blob. Hence toBlob is used for casting. Iteration script select toBlob ( idl_agg (ROW( ra, dec, blue, red ) , ROW( 'flag=0&i=long(-1)&temp=dblarr(4,1000)', 'i=i+1&if i ge 1000 then begin if flag eq 0 then begin a=temp & flag=1&end else a=[[a],[temp]] & i=0 & end & temp(*,i) = [$1,$2,$3,$4]', 'if i gt 0 then begin if flag eq 0 then a=temp(*,0:i) else a=[[a], [temp(*,0:i)]] & end & b = MONET_JPEG(a(1,*),a(0,*),a(2,*),a(3,*)); $BLOB b’ ) ) ) from monet_all; Initialization script Final script $BLOB used to return a blob that contains the jpeg image Getting a plot using idl_agg: We use a IDL UDF, MONET_JPEG that creates a plot with 1st two arguments (ra and dec) as the axis and the next two (red & blue) determining the color of the point. MONET_JPEG returns the jpeg of the plot in binary. The binary array is returned as a blob by idl_agg.