Handling Large Amounts of Biological Data

Session id:40364 Handling Large Amounts of Biological Data Xiaobin Guan, Ph.D.Senior Oracle DBA/Bioinformatician National Institutes of Health

Introduction • Bioinformatics • In Silico • Large Database • DNA Sequence • Using CLOB • Using Partition Tables

NISC Database Environment • NIH Intramural Sequencing Center • Established in 1997 • A multi-disciplinary genomics facility • Large-scale DNA sequencing • Applied Biosystems (ABI) DNA Analyzers • Produce 10,000 DNA sequences per day

NISC Pipeline • The Laboratory Information Management System (LIMS). • Move the sequencing data from each PC to a partition (/area1) on our main Unix Server. • A Perl script is then running to validate the trace name and run folder name, and also check for duplicates. Then, moved to another partition (/area2). • Phred is run on each trace file to get rid of the low quality bases at the beginning and end of each read.

NISC Pipeline • Vector Screening is then performed on each read, and masked out where the vector is. • Contaminant Checking is to use BLAST to screen any contaminants. The information about contamination is then stored in the database. • QC Report is generated to show the quality and other information.

Why CLOB? • To store DNA sequences • Combination of ‘ACGT’ character strings • The length can be more or less than 4KB

LOBs vs. Long/Long Raw

A Simple Create Table Statement CREATE TABLE dna_sequence1 (base_id NUMBER(6), base_sequence CLOB) TABLESPACE example;

Specify the Segment Name, and LOB Storage CREATE TABLE dna_sequence2 (base_id NUMBER(6), base_sequence CLOB) LOB (base_sequence) STORE AS dna_seq_lob (TABLESPACE lob_seg_ts) TABLESPACE example;

Specify the Index Name and Index Storage CREATE TABLE dna_sequence3 (base_id NUMBER(6), base_sequence CLOB) LOB (base_sequence) STORE AS dna_seq_lob1 (TABLESPACE lob_seg_ts INDEX dna_seq_clob_idx( TABLESPACE nisc_index)) TABLESPACE example;

Check Segment and Index Name SELECT table_name, column_name, segment_name, index_name FROM user_lobs; TABLE_NAME COLUMN_NAME SEGMENT_NAME INDEX_NAME --------------- --------------- --------------------------- ------------------------ DNA_SEQUENCE1 BASE_SEQUENCE SYS_LOB0000040338C00002$$ SYS_IL0000040338C00002$$ DNA_SEQUENCE2 BASE_SEQUENCE DNA_SEQ_LOB SYS_IL0000040341C00002$$ DNA_SEQUENCE3 BASE_SEQUENCE DNA_SEQ_LOB1 DNA_SEQ_CLOB_IDX

Query the Table SELECT * FROM dna_sequence WHERE base_id = 20; 20 actcggtactgggacccatgtggtggatttctatccttgaagctgcacgtaaagacccggtttttgcgggtatctctgataatgccaccgctcaaatcgctacagcgtgggcaagtgcactggctgactacgccgcagcacataaatctatgccgcgtccggaaattctggcctcctgccaccagacgctggaaaactgcctgatagagtccacccgcaatagcatggatgccactaataaagcgatgctggaatctgtcgcagcagagatgatgagcgtttctgacggtgttatgcgtctgcctttattcctcgcgatgatcctgcctgttcagttgggggcagctaccgctgatgcgtgtaccttcattccggttacgcgtgaccagtccgacatctatgaagtctttaacgtggcaggttcatcttttggttcttatgctgctggtgatgttctggacatgcaatccgtcggtgtgtacagccagttacgtcgccgctatgtgctggtggcaagctccgatggcaccagcaaaaccgcaaccttcaagatggaagacttcgaaggccagaatgtaccaatccgaaaaggtcgcactaacatctacgttaaccgtattaagtctgttgttgataacggttccggcagcctacttcactcgtttactaatgctgctggtgagcaaatcactgttacctgctctctgaactacaacattggtcagattgccctgtcgttctccaaagcgccggataaaagcactgagatcgcaattgagacggaaatcaatattgaagccggctctgagctgatcccgctgatcacca

In-line or Out-of-line Storage • In-line • Out-of-line • Enable storage in row • Disable storage in row • Tablespaces

CLOB Usage • Table structure • This table contains two CLOB columns • BASECALLS stores DNA sequences • BASEQUALS stores the quality score of each sequence • The length of both fields varies between a few hundred to up to 6 thousand characters

Test Protocol • Create tablespaces • Four for 4 tables, and two for LOB storage • Create four test tables • T1, in-line, one tablespace • T2, in-line, two tablespaces • T3, out-of-line, one tablespace • T4, out-of-line, two tablespaces

Test Table 1 (T1) CREATE TABLE T1 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL1" LOB("BASECALLS") STORE AS (TABLESPACE "TEST_CALL1" ENABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE "TEST_CALL1" ENABLE STORAGE IN ROW);

Test Table 2 (T2) CREATE TABLE T2 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL2" LOB("BASECALLS") STORE AS (TABLESPACE "TEST_CALL_LOB1" ENABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE "TEST_CALL_LOB1" ENABLE STORAGE IN ROW);

Test Table 3 (T3) CREATE TABLE T3 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL3" LOB("BASECALLS") STORE AS (TABLESPACE "TEST_CALL3" DISABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE "TEST_CALL3" DISABLE STORAGE IN ROW);

Test Table 4 (T4) CREATE TABLE T4 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL4" LOB("BASECALLS") STORE AS (TABLESPACE "TEST_CALL_LOB2" DISABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE "TEST_CALL_LOB2" DISABLE STORAGE IN ROW);

Results

DBMS_LOB Package

Functions/Procedures to Read or Return LOB Values

Functions/Procedures to Write LOB Values

Functions/Procedures for BFILEs

Call Functions in SQL SELECT dbms_lob.getlength(base_sequence) FROM dna_sequence1 DBMS_LOB.GETLENGTH(BASE_SEQUENCE) --------------------------------- 878 1269 893 872 961 807 806 808 833 837 10 rows selected.

Call procedures in PL/SQL DECLARE v_dna_seq CLOB; v_seq_amt BINARY_INTEGER :=10; v_seq_buffer VARCHAR2(10); BEGIN v_dna_seq := 'atctcgagtagctgaagctccaatgntggtggaattcacgagttgctt'; DBMS_LOB.READ (v_dna_seq, v_seq_amt, 1, v_seq_buffer); DBMS_OUTPUT.PUT_LINE('The first 10 bases for this DNA sequence are: ' || v_seq_buffer); END; / The first 10 bases for this DNA sequence are: atctcgagta PL/SQL procedure successfully completed.

Substr vs. dbms_lob.substr • Substr(the_string, from_character, number_of_characters); • Dbms_lob.substr(the_string, number_of_characters, from_character).

Substr vs. dbms_lob.substr CREATE table substring (str varchar2(20), lob clob); INSERT INTO substring VALUES ('Oracle10G', 'Oracle10G'); SELECT substr (str, 7, 3), dbms_lob.substr(lob, 7, 3) lob FROM substring; ow03@NISCDEV.NHGRI.NIH.GOV> SUB LOB --- ---------- 10G acle10G 10G acle10G SELECT substr (str, 7, 3), dbms_lob.substr(lob, 3, 7) lob FROM substring; ow03@NISCDEV.NHGRI.NIH.GOV> SUB LOB --- ---------- 10G 10G 10G 10G

Lob Usage Limitation • Not in the ORDER BY, or GROUP BY or in an aggregate function. • Not in a SELECT... DISTINCT or SELECT... UNIQUE statement or in a join. • Not in ANALYZE... COMPUTE or ANALYZE... ESTIMATE statements. • Not as a primary key column. • Not select a LOB column through dblink. ORA-22992: cannot use LOB locators selected from remote tables.

Partitioning and Its Usage Scenarios at NISC

Partition Method • Range Partitioning, introduced in Oracle 8. • Hash Partitioning, introduced in 8i. • List Partitioning, introduced in 9i release 1. • Composite Partitioning. The range-hash partition was introduced in 8i, and the range-list partition was introduced in 9i release 2. This is a good example how Oracle adds functionalities to the new release.

Benefit of Partitioning • The amount of time for each operation can be significantly reduced because of the small segment. • Improve query performance. The I/O will be balanced among disks. • Reduce the downtime. • Part of the table can be put to read only mode. • Easy to implement.

When to Partition • When table becomes large. 2GB is considered as a general guideline. • When the data is kind of adding on, meaning new data will go to the new partition.

Work with Range Partition • Create table with range partitioning. • Convert a non-partition table to a partition table. • Merge/split partition. • Tablespace usage with partition. • Maintain range partition.

Partitioning Usage Examples • Create tablespace • Create table • Add partition • Drop partition • Exchange partition • Move partition • Merge partition • Split partition • Truncate partition • Rename partition

Create Partitioned Table CREATE TABLE dna_sequence (base_id NUMBER(6), base_sequence CLOB) LOB (base_sequence) STORE AS dna_seq_lob2 TABLESPACE example PARTITION BY RANGE (BASE_ID) (partition dna_sequence1 values less than (100) tablespace dna_sequence_p1, partition dna_sequence2 values less than (200) tablespace dna_sequence_p2, partition dna_sequence3 values less than (300) tablespace dna_sequence_p3);

Query the Partitioned Table SELECT table_name, partition_name, tablespace_name, high_value FROM user_tab_partitions ORDER BY partition_name; TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE ---------------- -------------------- -------------------- ---------- DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100 DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200 DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300

Add Partition ALTER TABLE dna_sequence ADD PARTITION dna_sequence4 VALUES LESS THAN (400) TABLESPACE dna_sequence_p1; TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE --------------- ----------------- -------------------- ---------- DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100 DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200 DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300 DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1 400

Drop Partition ALTER TABLE dna_sequence DROP PARTITION dna_sequence4; Run partition.sql; TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE ---------------- ------------------- -------------------- --------- DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100 DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200 DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300

Exchange Partition CREATE TABLE dna_sep03 AS SELECT * FROM dna_sequence WHERE 1=2; ALTER TABLE dna_sequence EXCHANGE PARTITION dna_sequence3 WITH TABLE dna_sep03;

Move Partition ALTER TABLE dna_sequence MOVE PARTITION dna_sequence4 TABLESPACE dna_sequence_p2 NOLOGGING;

Split Partition ALTER TABLE dna_sequence SPLIT PARTITION dna_sequence4 AT (350) INTO ( PARTITION dna_sequence4 TABLESPACE dna_sequence_p1, PARTITION dna_sequence5 TABLESPACE dna_sequence_p2) PARALLEL ( DEGREE 5 ); TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE ----------------- -------------------- -------------------- ---------- DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100 DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200 DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300 DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1 350 DNA_SEQUENCE DNA_SEQUENCE5 DNA_SEQUENCE_P2 400

Truncate Partition ALTER TABLE dna_sequence TRUNCATE PARTITION dna_sequence4 DROP STORAGE;

Rename Partition/Table • Rename partition • ALTER TABLE dna_sequence RENAME PARTITION dna_sequence4 TO dna_sequence5; • Rename table • ALTER TABLE dna_sequence RENAME TO dna_seq; • RENAME dna_seq TO dna_sequence;

Conclusion By proper use of the Oracle features such as CLOB, and partitioning table, it becomes a lot easier to manage the database containing large amounts of biological data.

Major Benefits using CLOB and Partitioning at NISC • Space Savings: Proper use of CLOB • Better performance: Put big tables into smaller segments • Better Maintenance: Easier backup and recovery; Less down time

Q & Q U E S T I O N S A N S W E R S A

Reminder – please complete the OracleWorld online session surveyThank you.Xiaobin Guan, Ph.D.NISC/NIHXiaobin_Guan@nih.gov

Handling Large Amounts of Biological Data

Handling Large Amounts of Biological Data

Presentation Transcript

GenBank Huge amounts of data, easily accessible

Data Handling

Handling Data

Data Handling

Handling Large Numbers of Entries

Dramatically increase performance for large amounts of data in Xaml GridView and ListView

Making sense of large amounts of molecular data

Answering biological questions using large genomic data collections

FILS Handling of Large Objects

FILS Handling of Large Objects

FILS Handling of Large Objects

Integrative Analysis of multiple large-scale molecular biological data

Data Handling

Handling Data

Large Biological Molecules

Data Handling

Unsupervised Training Using Large Amounts of Arabic Broadcast News Audio Data

Large Object Handling

DATA HANDLING

Handling large biological datasets Part 1

Data Handling

Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology