210 likes | 230 Views
Explore unstructured data storage in Sybase IQ, benefits, limitations, loading, optimizing, and new functions. Learn how IQ compresses and efficiently stores large objects.
E N D
5a Large Object Datatypes
Agenda • In this session, we will be discussing: • What Is Unstructured Data? • Why Should I Place Unstructured Data In Sybase IQ? • Understanding How Sybase IQ Stores The Data • Sybase IQ Limitations for Unstructured Data • Creating, Loading, and Extracting Unstructured Data • New Functions • Sybase IQ 12.6 Functionality • Tuning Sybase IQ for Unstructured Data • Summary • Question And Answer
What Is Unstructured Data? • Freeform binary or textual data that has meaning to a user or an application, but not to the database • Why is it unstructured? • Format is not known to the RDBMS • Cannot be indexed due to the unstructured nature • Large object types • Binary – Generally referred to as a BLOB (Binary Large Object) • Text – Generally referred to as a CLOB (Character Large Object) • Often referred to as a LOB or LOBs
What Is Unstructured Data? • Can be any type of data you desire • Images • Video • Sound • Scanned or electronic documents • Applications • Company specific data format • It has to make sense to you, not the database!
Why Should I Place Unstructured Data In Sybase IQ? • Single location for application data • The application can be written to access all structured and unstructured data using a single interface • The DBAs have control over all data for the application, not just the structured data • Filesystems are generally less efficient at handling large sequential files, causing more disk I/O • In order to achieve performance from filesystems, additional software and training is necessary • Requires tight coordination with system administrators • Application can be taken down for non-database problems • Filesystems going offline • Storage filling up
Why Should I Place Unstructured Data In Sybase IQ? • Generally requires less storage • Filesystems cannot naturally compress the datafiles • IQ will compress the LOB data just as it compresses other data • Each IQ data page is run through a compression algorithm to compress the contents of the page into one or more disk blocks • Filesystems can have as much as 10% overhead • Inode table • Reserved space for the “root” account • Reading from filesystem must be done sequentially • The application can be designed to return portions of the data in parallel • Sybase IQ loads are done in parallel
Understanding How Sybase IQ Stores The Data • Each object is stored in one or more data pages • If an object needs a portion of a page, the entire page is marked as in use for that LOB • Each page is then put through the IQ compression algorithms to squeeze as much storage savings as possible • No other indexing or changes take place on the LOB data
Sybase IQ Limitations for Unstructured Data • There is no database limit on the number of LOB columns that can exist • Limited only by the total number of tables and columns in a database • A table can have an unlimited number of LOB columns • Limited only by the total number of columns allowed on a table • Each LOB columns is virtually unlimited in size • Maximum LOB size is 4GB * page size • Absolute maximum is 4GB * 512KB = 2 PetaBytes • Most RDBMS engines are limited to 2-4 GB per entry
Creating, Loading, and Extracting Unstructured Data • Added new datatype “long binary” to support LOB data • Create a table with LOB support create table blob_data( file_id int primary key ,filename char(64) ,ext char(6) null ,file_size unsigned bigint ,lobcol long binary null ) LOB Column
Creating, Loading, and Extracting Unstructured Data • New syntax added to support loading LOB data • “BINARY FILE()” added to column specification • Data for the binary file location must be delimited • LOB data does not exist in primary data file • Primary data file is a pointer to LOB data being loaded
Creating, Loading, and Extracting Unstructured Data • Load a table with a LOB column set temporary option load_memory_mb=50; go load table blob_data ( file_id ',' ,filename ',‘ ,ext ',‘ ,file_size ',' ,lobcol binary file ( ',' ) ,filler(1) -- change this to filler(2) for Windows Data ) from ‘blob_file.dat’ format ascii preview on quotes off escapes off go commit go LOB Column
Creating, Loading, and Extracting Unstructured Data • Sample blob_file.dat data 1,boston,jpg,/s1/loads/lobs/boston.jpg,1234, 2,map_of_concord,bmp,/s1/loads/maps/concord.bmp,321, 3,zero length test,NULL,,123, 4,null test,NULL,NULL,456, • Notice that there is no LOB data in the file • The IQ load engine will open the secondary files and load them into the appropriate rows • Row 1: /s1/loads/lobs/boston.jpg • Row 2: /s1/loads/maps/concord.bmp • Rows 3 and 4 will be NULL
Creating, Loading, and Extracting Unstructured Data • More LOB loading notes… • If the file cannot be opened for any reason a NULL is loaded instead and no error is thrown • An error will be thrown if the column is created with “NOT NULL” • LOB data can only be loaded from an ASCII primary data file (e.g. blob_file.dat, map.jpg) • The ASCII file can be fixed width, however, the secondary file name column MUST have a delimiter after it
Creating, Loading, and Extracting Unstructured Data • Three methods to extracting LOB data • Return the entire contents of the column • Return a portion of the column • Extract the column to disk • To return the entire contents of the column to the client application • select lobcol from my_table where file_id = 1 • To return a portion of the column to the client application • select byte_substr64( lobcol, 1, 1000 ) from my_table where file_id = 1 • Will return bytes 1 through 1000 of the LOB • Extract the column to disk • select bfile ( ‘/tmp/my_lob.dat’, lobcol ) from my_table where file_id = 1 • Will directly write the entire row/column to disk on the IQ server host
New Functions • BYTE_LENGTH64( long binary ) • Returns the total bytes contained in the LOB column • select byte_length64( lobcol ) from my_table where file_id = 1 • BYTE_SUBSTR64( long binary, offset, length) • Returns a portion/substring of the LOB data • select byte_substr64( lobcol, 1, 1000 ) from my_table where file_id = 1 • BFILE( filename, long binary ) • Extract the contents of the LOB column to filename • Each row should have a unique filename • The filename can be any string manipulation • If the filename is non-unique the contents will be overwritten • select bfile( filename + ‘.’ + ext, lobcol ) from my_table where file_id = 1
Sybase IQ 12.6 Functionality • New domain BLOB created for the “long binary” datatype • New domain CLOB created for the “long varchar” datatype • Single- and Multi-byte support added for the CLOB datatype • New load table syntax • Secondary_File_Error option specifies the desired error handling when an error occurred during opening/reading a secondary file • Supported syntax • UPDATE • INSERT..VALUES • INSERT..SELECT • LOAD • DELETE • TRUNCATE • SELECT..INTO • INSERT..LOCATION • SELECT
Tuning Sybase IQ for Unstructured Data • CORE_Options14 • This sets the number of threads used to read an individual lob secondary file. Measurements have been inconsistent due to vast hardware differences and dependencies. A value much larger than 3 has shown to cause system time to be too high. Start with a value of 3 and tune from there. • CORE_Options15 • This sets the number of threads in the lob load team. A value of zero will use 1 thread per cpu, if available. A non-zero value of n sets the team size to n, if available. The lob load threads have to live within the other settings such as max threads per connection, max team size, etc. On a machine with many CPUs, this may need to be set manually.
Tuning Sybase IQ for Unstructured Data • FP_LOB_Workunit_MBSize • This defaults to 100 MB and there is probably no reason to change it. It is the amount of data each thread reads in a single unit of work when loading lob data. • Loading LOB data is very sensitive to I/O and CPUs • Have sustained load rates of over 2 GB per second! • That’s 7 TB an hour • Had 72 CPUs and enough I/O controllers to handle the throughput • Loading LOB data is not memory intensive • A minimally configured IQ server (1-2 GB RAM total) is sufficient to load LOB data • Load_Memory_MB is not used heavily for LOB data
Large Objects (LOBs) • Existing LONG BINARY columns created before 12.5ESD8 must be dropped before 12.6 installation • LONG BINARY size limit is between 512TB and 2PB (depending upon IQ PAGE size) – that is per row! • LONG VARCHAR now has the same “restrictions” as LONG BINARY • ASE TEXT can be inserted into LONG VARCHAR • A whole new manual entitled • Large Object Management in Sybase IQ
Summary • A fast, compressible way to store unstructured data in an RDBMS • Compression ratio will vary, but is guaranteed to be no larger than the original file without the O/S overhead • Data can be written and read in parallel • Remove the need to rely on resources outside the database administrators and developers • Guaranteed performance since all resources are under the DBA’s control