100 likes | 184 Views
Mark Holliman Wide Field Astronomy Unit Institute for Astronomy University of Edinburgh. Database Deployment scenarios and performance on SSD arrays. Summary.
E N D
Mark Holliman Wide Field Astronomy UnitInstitute for AstronomyUniversity of Edinburgh Database Deployment scenarios and performance on SSD arrays
Summary • Some Astronomical survey databases are becoming so large that some curation tasks are approaching unreasonable timeframes. This is evident in such surveys as the Galactic Plane Survey (GPS) for UKIDSS, and the VISTA Variables in the Via Lactea (VVV). The VVV alone contains >10 of TB of data and includes a detection table with >1010 rows. While RDBMSs are capable of handling this much data, the execution times for curation activities can stretch into weeks at a time. To address this issue, there are two main approaches: • Switch from a RDBMS to a Key-Pair based model or Column oriented DB (i.e. Hadoop, MonetDB, etc) • Throw Hardware at it (i.e. SSDs, SAN, LUSTRE, GPFS, etc) • As you can probably guess, this talk is about #2
The Big Database Problem • In an RDBMS the main bottleneck on almost all operations is disk I/O. While parallel processing and increased RAM can address some performance issues, ultimate performance figures are dictated by how fast data can be moved to/from the storage medium. • For large databases (>1TB) the most cost efficient storage medium are RAID5/6/10 arrays of spinning disks (HDD). These disks range in size from 1TB to 4TB at present. • These arrays provide redundancy in case of disk failure, while at the same time speed up I/O by spreading disk operations across multiple devices simultaneously • The rotational speed on the hard drives is the main factor in determining an HDD’s performance. The faster the disk spins, the faster read/write operations can occur. Most enterprise disks run at 7200RPM, though 10000RPM and 15000RPM disks are available (at a serious jump in price) • SSDs are just beginning to approach the size/price ratio necessary for large DBs.
Test Server Details • Intel Xeon 2.8GHz, 24 cores • 16GB RAM • Disk Subsystem (6Gbps SAS RAID) • HDD1,2: 7 x 1TB HDD RAID5 arrays • SSD1: 1 x 512GB Crucial SSD • SSD2: 6 x 512GB Crucial SSD RAID5 Array • OS: Windows Server 2008 R2 • DBMS: SQL Server 2008 R2
Full DB Tests • The first tests involved placing an entire database on each particular disk subsystem and running a set of queries to measure performance. The queries were constructed to represent 3 specific use cases in order to identify exactly where the SSDs provide performance gains. • Database: 2MASS • Query 1, All Indexed Columns: Select * From twomass_psc Where ((ra> 240) AND (ra<242)) OR ((ra> 120) AND (ra<122)) AND ((dec> -47) AND (dec<-44)) OR ((dec> 120) AND (dec<123)) • Query 2, All Nonindexed Columns: Select * From twomass_psc Where (j_m>16) AND (h_m>15.6 AND h_m<15.8) AND (k_m>15.1 AND k_m<15.3) • Query 3, Mixed Index Columns: Select * From twomass_psc Where (j_m>16) AND (h_m>15.6 AND h_m<15.8) AND (k_m>15.1 AND k_m<15.3) AND ((ra> 230) AND (ra<241)) OR ((ra> 110) AND (ra<121)) AND ((dec> -47) AND (dec<-44)) OR ((dec> 120) AND (dec<123)) AND ((k_psfchi>1) OR (k_psfchi<0.4)) AND (dist_opt<.3) AND (k_msig_stdap<1)
SSD Cost Issue • While putting an entire Database on SSDs is certainly preferable, it is unfortunately cost prohibitive for very large archive databases (where 10’s of TB are required). • At recent pricing: 1TB on SSD = ~£500 vs 1TB on HDD = ~£87 • So with this in mind we ran a second battery of tests to assess whether sufficient performance gains can be achieved with a hybrid system, whereby some database files are placed on SSDs while the rest remain on standard HDD arrays
SSD File Location Tests • Database: UKIDSS DR5 • 4 different representative queries were run (email me if you want to see the actual SQL) • #1: produce light curves • #2: produce variable light curves • #3: JOIN Source and Detection Tables (single index) • #4: JOIN Source and Detection Tables (multi indices) • 4 different file/disk configurations • All files on HDD • Indices on SSD (RAID) • Source Tables on SSD (RAID) • Detection Tables on SSD (RAID)
Conclusions • The biggest performance gains from SSDs are seen on queries involving non-indexed columns • RAID arrays of SSDs do provide performance gains over individual disks, though these gains are not universal or consistently scaled • Placing Source or Detection tables on SSDs both result in large performance gains • Performance gains are heavily dependant on the query being run, and how much of the query uses the tables on the SSDs • Due to SSD size constraints, Source tables are more likely to fit as they tend to be smaller • Curation activities could be optimized by moving database files from HDD to SSD for intensive operations, then moved back to HDD once complete. This only works if the file movement speeds are high enough to overcome the extra time necessary for running operations on the HDD stored files • Index file operations are actually slower on SSD than on HDD • This is a serious surprise • Further testing is necessary to determine if this is due to the inherent sequential access method of HDDs, or if this is due to some quirk of the MS SQL index file format