370 likes | 508 Views
Database File Systems in Support of eScience. Philip A. Adams – LLNL/National Ignition Facility John C. Hax – Oracle Corporation. Science – A product of data analysis.
E N D
Database File Systems in Support of eScience Philip A. Adams – LLNL/National Ignition Facility John C. Hax – Oracle Corporation
Science – A product of data analysis “Science does not result from the launch of a mission or the collection of data. Rather, science only occurs through the analysis and understanding of that data.” - Philosophy of the NASA Science Mission Directorate (SMD)
Questions to Ask • Are we building IT Systems that support Research and Analysis or Infrastructure that supports the collection of data?
Scientific Computing History Scientific Systems Commercial Relational Databases • Scientific (minimal data shared) • Raw Data • Decentralized/Desktop Management • Open source software • Low quality of support/service • Best Effort • Mission critical operations • Primarily file based – HDF5,Lustre • Millions of Files • Write once, read many • Background processing • Pipelines • Computationally intensive applications • Long running transactions • Output of Large Data Sets • Single application profile vs. • Enterprise (all data shared) • Metadata • Centralized management • Industrial strength software • High qualities of support/service • SLA guarantees • Mission Critical Operations • Mission critical operations • Databases & files • Read and Update • Enforced data integrity • Interactive processing • Interactive workflows • Transactional, intensive applications • Short running transactions (<8 hours) • Output of Individual Rows • Mixed application profile
Filesystems and Legacy Databases – The Gap Filesystem Benefits Database Benefits vs. • Provided maximum scalability to meet data • volume and ingestion requirements • HDF5 • GFS (Google Filesystem) • Lustre • Ubiquity of accessing filesystems • Number of protocols • NFS, SMB, CIFS and FTP • Able to access the data right from the OS • Windows, Mac, Linux, Solaris, HP/UX • Application programming interfaces • support native access • file open (f_open), file close (f_close) • importing the java io package • ifstream/ofstream C++ file I/O classes Superior query/search capability over filesystems SQL standard Easy manipulation of data Functions PL/SQL Java, C, PHP, Perl Low latency, interactive data access suited for application access Provides a structured way of storing data and ensuring data integrity Tables/Constraints Superior backup and recovery capabilities RMAN, Redo/Archive logging Block and Point-in-Time Recovery Block Level Corruption Detection Institutional Resources
Data Challenges • Physical Limitations • I/O Intensive - limitations on max IOPS • Network speeds - time to ship data to compute nodes • Multiple Data Silos • Governance issues • Pedigree of the data • Multiple access policies to get to the data • Duplicate data stored in each silo • Need to scale disparate systems as data grows • Increased effort required for Scientists, Developers, Administrators • Correlating the data across data silos • Coordinated backup and recovery plan • Multiple Data Aggregation Efforts
The Result: The Split Architecture – a step in the wrong direction • These drawbacks include but are not limited to: • Data curation • Security • Availability • Recoverability • Manageability • Because no common database and filesystem access protocol was available, the burden shifted to the application developers and scientific researchers to make sense of the two silos of information
How much of an issue is this? • Level 0 (Raw) data is typically enriched with data from other sources. • What happens when/if a diagnostic is found to have incorrect calibration data? • Without strict relationships, this could be a nightmare. It may be easier to rerun analysis to reproduce the Level 1, 2 and 3 data. However, an unknown quantity of Level 4 content has been generated from this data and is stored on many researchers’ workstations and file shares. Lack of pedigree in data analysis can result in instrument/machine damage, increased financial costs, or embarrassment to scientific researchers who rely on the data
Future of Scientific Computing and Analysis Data Intensive + • Collaborative • Data Intensive Collaborative Science
Data Intensive Collaborative Science Cost Complexity Knowledge Base Interdependence Drivers Collaboration Enablers Network Capacity Standards The Web Clustering/ Grid Technologies Moores Law
What’s driving the data volumes? Better and more diverse instrumentation Flexible optics Coordinated multi-instrument observatories Increased Precision Genomics Diverse types of data generated: SQL/Scalar, XML, Image, Monte Carlo simulations, Audio/Video, telemetry, and spectrometers
Database Filesystems Bridge the Gap between Filesystems and Relational Database Systems Maintain Filesystem Performance Leverage multiple access methods Single Security Mechanism Unified Administrative Tools Data Pedigree Unified Architecture and Skill sets Leverage Institutional Resources for IT Enabling Collaboration around Data Optimized for Data Access
Pedigree with a database filesystem 3/12/2014 13
Modern databases have much to offer in the realm of data analysis • RDF/OWL can allow semantic searching of data • Predictive Analytics • Spatial Data Analysis • Text Mining of Unstructured Content
Some of the native data mining techniques and algorithms available Algorithms Logistic Regression Naive Bayes Support Vector Machine Decision Tree Multiple Regression Minimum Description Length One-Class Support Vector Machine Enhanced K-Means Orthogonal Partitioning Clustering Apriori Non-negative Matrix Factorization Technique Classification Regression Attribute Importance Anomaly Detection Clustering Association Feature Extraction
Key Components of Secure Files Architecture Delta Update Write Gather Cache Transformation Management Inode Management Space management I/O Management Finally the database can accept both structured and non-structured data in an efficient manner
UCRL-PRES-236394 National Ignition Facility and 11g SecureFiles NLIT 2009 Philip A. Adams Sr. Systems ArchitectNational Ignition Facility Lawrence Livermore National Laboratory June 1-3 2009 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344
Overview of the National Ignition Facility The National Ignition Facility (NIF) is known as the world’s largest and most energetic laser When fully operational, its 192 beams will converge 1.8 MJ of laser energy onto a single target to achieve thermonuclear ignition NIF will enable experiments that produce temperatures and densities like those in the Sun or in an exploding nuclear weapon NIF-1107-14129.ppt Oracle, 11/12/07 19
Overview of the National Ignition Facility The 192 laser beams of NIF will generate: A peak power of 500 trillion watts, 1000 times the electric generating power of the United States A pulse energy of 1.8 million joules of ultraviolet light A pulse length of three to twenty billionths of a second NIF-1107-14129.ppt Oracle, 11/12/07 20
The Optics make NIF work Optical components: 7500 large optics including 3072 laser glass slabs as well as large lenses, mirrors, and crystals More than 15,000 small optical components Precision optics: Total area of 33,000 square feet (3/4 of an acre) More than 40 times the total precision optical surface in the world’s largest telescope (Keck Observatory, Hawaii) NIF-1107-14129.ppt Oracle, 11/12/07 21
Example of Optic Damage 3 ns 2 µm NIF-1107-14129.ppt Oracle, 11/12/07 22
On high quality optical surfaces initiated damage sites are very small NIF-1107-14129.ppt Oracle, 11/12/07 23
Performance Gains found in NIF with 11g SecureFiles Test Environment Database Server HP Blade Server w/ 4-way AMD Opteron CPUs RHEL 4 – 32-bit kernel 11g Oracle Database – 32-bit version Single Instance ASM Dual Port Fibre Channel Mezzanine Card (2 Gbit) Application Server Dell PowerEdge 2650 w/ 2-way Intel Xeon CPUs RHEL 4 – 32-bit kernel 10g Oracle Application Server 10g Oracle CMSDK (Content Management Software Development Toolkit) NIF-1107-14129.ppt Oracle, 11/12/07 24
Performance Gains found in NIF with 11g SecureFiles Test Environment SAN Storage 3PAR S400 Production Environment 11g RAC Environment 10g CMSDK Clustered Application Server Environment NIF-1107-14129.ppt Oracle, 11/12/07 25
Measure the throughput of the environment Perform dd tests to the disks to establish the theoretical max: WRITE > dd if=/dev/zero of=/dev/raw/raw6 count=10000 bs=1M READ > dd if=/dev/raw/raw6 if=/dev/null count=10000 bs=1M MONITOR > iostat –xdk 3 100 We saw 180 MB/sec Read/Write throughput to the disks Warning: Be sure not to perform dd write tests on your ASM configured storage or else you’ll damage it NIF-1107-14129.ppt Oracle, 11/12/07 26
Create a few test tables Create a test table for BasicFiles and a test table for SecureFiles: BasicFile Example: CREATE TABLE "FOO_BASICFILE_TABLE" ( "PKEY" NUMBER(4) NOT NULL , "DOCUMENT" BLOB) TABLESPACE "LOB_DEMO" LOB ("DOCUMENT") STORE AS BASICFILE ( TABLESPACE "LOB_DEMO"); SecureFiles Example: CREATE TABLE "FOO_SECUREFILE_TABLE" ( "PKEY" NUMBER(4) NOT NULL , "DOCUMENT" BLOB) TABLESPACE "LOB_DEMO" LOB ("DOCUMENT") STORE AS SECUREFILE ( TABLESPACE "LOB_DEMO"); NIF-1107-14129.ppt Oracle, 11/12/07 27
Throughput Results of Table Tests • Speed tests from database server (Oracle 11.1.0 DB, using Oracle jdk 1.5.0_11 in $OH/jdk, using ojdbc5.jar) • Inserting twenty 32MB image files per test
SecureFile vs. BasicFile Server Results NIF-1107-14129.ppt Oracle, 11/12/07 29
Measure the throughput of the network Used a tool called iperf available at: http://sourceforge.net/projects/iperf/ On Server run: ./iperf -s –fM On Client run: ./iperf -f M -c blackstone ------------------------------------------------------------ Client connecting to blackstone.llnl.gov, TCP port 5001 TCP window size: 0.06 MByte (default) ------------------------------------------------------------ [ 5] local XXX.XXX.XXX.XXX port 58590 connected with XXX.XXX.XXX.XXX port 5001 [ ID] Interval Transfer Bandwidth [ 5] 0.0-10.0 sec 1120 MBytes 112 MBytes/sec NIF-1107-14129.ppt Oracle, 11/12/07 30
Throughput Results of Client-Server Tests • Speed tests from database server (Oracle 10.1.2 Client, using jdk 1.5.0_11 and ojdbc14.jar) • Inserting twenty 32MB image files per test
SecureFile Performance Benefits During our testing, we’ve seen a 2-20 times increase in performance using SecureFiles over traditional BasicFiles We’ve seen equivalent or better performance using SecureFiles as we see writing the same file to our NFS mounted NetApp NIF-1107-14129.ppt Oracle, 11/12/07 33
Database Tuning to optimize for SecureFiles Create a separate tablespace for your LOB data Use Uniform Extents – 1M seems best overall Tried 32M/64M extents with no performance increase; your mileage may vary Enable Automatic Segment Space Management on the tablespace Create large enough redo log files We used 200M – 1024M to reduce log file switches during heavy loads NIF-1107-14129.ppt Oracle, 11/12/07 34
Database Tuning to optimize for SecureFiles Utilize the AWR Snapshots before and after a SecureFile load and note the wait conditions SQL> EXECUTE dbms_workload_repository.create_snapshot(); PL/SQL procedure successfully completed Run the AWR report $ORACLE_HOME/rdbms/admin/awrrpt.sql NIF-1107-14129.ppt Oracle, 11/12/07 35
Conclusion • The ultimate goal of science is to create new knowledge and new discoveries. • Database Filesystems have a number of features which can benefit the scientific community and ease the burden of pedigree, data management, and analysis • Using a database filesystem will enable data intensive collaborative science. • As new discoveries are made and data volumes increase, it is imperative to have a robust database system that is not only capable of managing the pedigree of that data, but also serve as a knowledge repository for the future.
For More Information http://search.oracle.com SecureFiles or http://www.oracle.com/