490 likes | 629 Views
Session id:40364. Handling Large Amounts of Biological Data. Xiaobin Guan, Ph.D. Senior Oracle DBA/Bioinformatician National Institutes of Health. Introduction. Bioinformatics In Silico Large Database DNA Sequence Using CLOB Using Partition Tables. NISC Database Environment.
E N D
Session id:40364 Handling Large Amounts of Biological Data Xiaobin Guan, Ph.D.Senior Oracle DBA/Bioinformatician National Institutes of Health
Introduction • Bioinformatics • In Silico • Large Database • DNA Sequence • Using CLOB • Using Partition Tables
NISC Database Environment • NIH Intramural Sequencing Center • Established in 1997 • A multi-disciplinary genomics facility • Large-scale DNA sequencing • Applied Biosystems (ABI) DNA Analyzers • Produce 10,000 DNA sequences per day
NISC Pipeline • The Laboratory Information Management System (LIMS). • Move the sequencing data from each PC to a partition (/area1) on our main Unix Server. • A Perl script is then running to validate the trace name and run folder name, and also check for duplicates. Then, moved to another partition (/area2). • Phred is run on each trace file to get rid of the low quality bases at the beginning and end of each read.
NISC Pipeline • Vector Screening is then performed on each read, and masked out where the vector is. • Contaminant Checking is to use BLAST to screen any contaminants. The information about contamination is then stored in the database. • QC Report is generated to show the quality and other information.
Why CLOB? • To store DNA sequences • Combination of ‘ACGT’ character strings • The length can be more or less than 4KB
A Simple Create Table Statement CREATE TABLE dna_sequence1 (base_id NUMBER(6), base_sequence CLOB) TABLESPACE example;
Specify the Segment Name, and LOB Storage CREATE TABLE dna_sequence2 (base_id NUMBER(6), base_sequence CLOB) LOB (base_sequence) STORE AS dna_seq_lob (TABLESPACE lob_seg_ts) TABLESPACE example;
Specify the Index Name and Index Storage CREATE TABLE dna_sequence3 (base_id NUMBER(6), base_sequence CLOB) LOB (base_sequence) STORE AS dna_seq_lob1 (TABLESPACE lob_seg_ts INDEX dna_seq_clob_idx( TABLESPACE nisc_index)) TABLESPACE example;
Check Segment and Index Name SELECT table_name, column_name, segment_name, index_name FROM user_lobs; TABLE_NAME COLUMN_NAME SEGMENT_NAME INDEX_NAME --------------- --------------- --------------------------- ------------------------ DNA_SEQUENCE1 BASE_SEQUENCE SYS_LOB0000040338C00002$$ SYS_IL0000040338C00002$$ DNA_SEQUENCE2 BASE_SEQUENCE DNA_SEQ_LOB SYS_IL0000040341C00002$$ DNA_SEQUENCE3 BASE_SEQUENCE DNA_SEQ_LOB1 DNA_SEQ_CLOB_IDX
Query the Table SELECT * FROM dna_sequence WHERE base_id = 20; 20 actcggtactgggacccatgtggtggatttctatccttgaagctgcacgtaaagacccggtttttgcgggtatctctgataatgccaccgctcaaatcgctacagcgtgggcaagtgcactggctgactacgccgcagcacataaatctatgccgcgtccggaaattctggcctcctgccaccagacgctggaaaactgcctgatagagtccacccgcaatagcatggatgccactaataaagcgatgctggaatctgtcgcagcagagatgatgagcgtttctgacggtgttatgcgtctgcctttattcctcgcgatgatcctgcctgttcagttgggggcagctaccgctgatgcgtgtaccttcattccggttacgcgtgaccagtccgacatctatgaagtctttaacgtggcaggttcatcttttggttcttatgctgctggtgatgttctggacatgcaatccgtcggtgtgtacagccagttacgtcgccgctatgtgctggtggcaagctccgatggcaccagcaaaaccgcaaccttcaagatggaagacttcgaaggccagaatgtaccaatccgaaaaggtcgcactaacatctacgttaaccgtattaagtctgttgttgataacggttccggcagcctacttcactcgtttactaatgctgctggtgagcaaatcactgttacctgctctctgaactacaacattggtcagattgccctgtcgttctccaaagcgccggataaaagcactgagatcgcaattgagacggaaatcaatattgaagccggctctgagctgatcccgctgatcacca
In-line or Out-of-line Storage • In-line • Out-of-line • Enable storage in row • Disable storage in row • Tablespaces
CLOB Usage • Table structure • This table contains two CLOB columns • BASECALLS stores DNA sequences • BASEQUALS stores the quality score of each sequence • The length of both fields varies between a few hundred to up to 6 thousand characters
Test Protocol • Create tablespaces • Four for 4 tables, and two for LOB storage • Create four test tables • T1, in-line, one tablespace • T2, in-line, two tablespaces • T3, out-of-line, one tablespace • T4, out-of-line, two tablespaces
Test Table 1 (T1) CREATE TABLE T1 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL1" LOB("BASECALLS") STORE AS (TABLESPACE "TEST_CALL1" ENABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE "TEST_CALL1" ENABLE STORAGE IN ROW);
Test Table 2 (T2) CREATE TABLE T2 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL2" LOB("BASECALLS") STORE AS (TABLESPACE "TEST_CALL_LOB1" ENABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE "TEST_CALL_LOB1" ENABLE STORAGE IN ROW);
Test Table 3 (T3) CREATE TABLE T3 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL3" LOB("BASECALLS") STORE AS (TABLESPACE "TEST_CALL3" DISABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE "TEST_CALL3" DISABLE STORAGE IN ROW);
Test Table 4 (T4) CREATE TABLE T4 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL4" LOB("BASECALLS") STORE AS (TABLESPACE "TEST_CALL_LOB2" DISABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE "TEST_CALL_LOB2" DISABLE STORAGE IN ROW);
Call Functions in SQL SELECT dbms_lob.getlength(base_sequence) FROM dna_sequence1 DBMS_LOB.GETLENGTH(BASE_SEQUENCE) --------------------------------- 878 1269 893 872 961 807 806 808 833 837 10 rows selected.
Call procedures in PL/SQL DECLARE v_dna_seq CLOB; v_seq_amt BINARY_INTEGER :=10; v_seq_buffer VARCHAR2(10); BEGIN v_dna_seq := 'atctcgagtagctgaagctccaatgntggtggaattcacgagttgctt'; DBMS_LOB.READ (v_dna_seq, v_seq_amt, 1, v_seq_buffer); DBMS_OUTPUT.PUT_LINE('The first 10 bases for this DNA sequence are: ' || v_seq_buffer); END; / The first 10 bases for this DNA sequence are: atctcgagta PL/SQL procedure successfully completed.
Substr vs. dbms_lob.substr • Substr(the_string, from_character, number_of_characters); • Dbms_lob.substr(the_string, number_of_characters, from_character).
Substr vs. dbms_lob.substr CREATE table substring (str varchar2(20), lob clob); INSERT INTO substring VALUES ('Oracle10G', 'Oracle10G'); SELECT substr (str, 7, 3), dbms_lob.substr(lob, 7, 3) lob FROM substring; ow03@NISCDEV.NHGRI.NIH.GOV> SUB LOB --- ---------- 10G acle10G 10G acle10G SELECT substr (str, 7, 3), dbms_lob.substr(lob, 3, 7) lob FROM substring; ow03@NISCDEV.NHGRI.NIH.GOV> SUB LOB --- ---------- 10G 10G 10G 10G
Lob Usage Limitation • Not in the ORDER BY, or GROUP BY or in an aggregate function. • Not in a SELECT... DISTINCT or SELECT... UNIQUE statement or in a join. • Not in ANALYZE... COMPUTE or ANALYZE... ESTIMATE statements. • Not as a primary key column. • Not select a LOB column through dblink. ORA-22992: cannot use LOB locators selected from remote tables.
Partitioning and Its Usage Scenarios at NISC
Partition Method • Range Partitioning, introduced in Oracle 8. • Hash Partitioning, introduced in 8i. • List Partitioning, introduced in 9i release 1. • Composite Partitioning. The range-hash partition was introduced in 8i, and the range-list partition was introduced in 9i release 2. This is a good example how Oracle adds functionalities to the new release.
Benefit of Partitioning • The amount of time for each operation can be significantly reduced because of the small segment. • Improve query performance. The I/O will be balanced among disks. • Reduce the downtime. • Part of the table can be put to read only mode. • Easy to implement.
When to Partition • When table becomes large. 2GB is considered as a general guideline. • When the data is kind of adding on, meaning new data will go to the new partition.
Work with Range Partition • Create table with range partitioning. • Convert a non-partition table to a partition table. • Merge/split partition. • Tablespace usage with partition. • Maintain range partition.
Partitioning Usage Examples • Create tablespace • Create table • Add partition • Drop partition • Exchange partition • Move partition • Merge partition • Split partition • Truncate partition • Rename partition
Create Partitioned Table CREATE TABLE dna_sequence (base_id NUMBER(6), base_sequence CLOB) LOB (base_sequence) STORE AS dna_seq_lob2 TABLESPACE example PARTITION BY RANGE (BASE_ID) (partition dna_sequence1 values less than (100) tablespace dna_sequence_p1, partition dna_sequence2 values less than (200) tablespace dna_sequence_p2, partition dna_sequence3 values less than (300) tablespace dna_sequence_p3);
Query the Partitioned Table SELECT table_name, partition_name, tablespace_name, high_value FROM user_tab_partitions ORDER BY partition_name; TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE ---------------- -------------------- -------------------- ---------- DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100 DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200 DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300
Add Partition ALTER TABLE dna_sequence ADD PARTITION dna_sequence4 VALUES LESS THAN (400) TABLESPACE dna_sequence_p1; TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE --------------- ----------------- -------------------- ---------- DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100 DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200 DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300 DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1 400
Drop Partition ALTER TABLE dna_sequence DROP PARTITION dna_sequence4; Run partition.sql; TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE ---------------- ------------------- -------------------- --------- DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100 DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200 DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300
Exchange Partition CREATE TABLE dna_sep03 AS SELECT * FROM dna_sequence WHERE 1=2; ALTER TABLE dna_sequence EXCHANGE PARTITION dna_sequence3 WITH TABLE dna_sep03;
Move Partition ALTER TABLE dna_sequence MOVE PARTITION dna_sequence4 TABLESPACE dna_sequence_p2 NOLOGGING;
Split Partition ALTER TABLE dna_sequence SPLIT PARTITION dna_sequence4 AT (350) INTO ( PARTITION dna_sequence4 TABLESPACE dna_sequence_p1, PARTITION dna_sequence5 TABLESPACE dna_sequence_p2) PARALLEL ( DEGREE 5 ); TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE ----------------- -------------------- -------------------- ---------- DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100 DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200 DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300 DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1 350 DNA_SEQUENCE DNA_SEQUENCE5 DNA_SEQUENCE_P2 400
Truncate Partition ALTER TABLE dna_sequence TRUNCATE PARTITION dna_sequence4 DROP STORAGE;
Rename Partition/Table • Rename partition • ALTER TABLE dna_sequence RENAME PARTITION dna_sequence4 TO dna_sequence5; • Rename table • ALTER TABLE dna_sequence RENAME TO dna_seq; • RENAME dna_seq TO dna_sequence;
Conclusion By proper use of the Oracle features such as CLOB, and partitioning table, it becomes a lot easier to manage the database containing large amounts of biological data.
Major Benefits using CLOB and Partitioning at NISC • Space Savings: Proper use of CLOB • Better performance: Put big tables into smaller segments • Better Maintenance: Easier backup and recovery; Less down time
Q & Q U E S T I O N S A N S W E R S A
Reminder – please complete the OracleWorld online session surveyThank you.Xiaobin Guan, Ph.D.NISC/NIHXiaobin_Guan@nih.gov