120 likes | 264 Views
Motif Space Database Design. Kiranjit Sidhu. Outline. Schema Design Content of Database Functionality Future Plans. Sample PDB File. Sample PDB File Each PDB File represented as a text file (~ 60K Lines) Inefficient for pattern matching
E N D
Motif SpaceDatabase Design Kiranjit Sidhu
Outline • Schema Design • Content of Database • Functionality • Future Plans
Sample PDB File • Sample PDB File • Each PDB File represented as a text file (~ 60K Lines) • Inefficient for pattern matching • Relational Database required for most efficient solution
Structure of Database • DB divided into two major components: • Protein Data • Motif (Occurrence) Data • Protein Data • Obtained from PDB Files (Protein Data Bank) • Derived Data • Motif Data • Obtained from Luke’s FFSM technique • Derived Data
Tools Used • Obtaining Data • Perl Scripts • Database: • SQL Server 2000 and SQL Server 2005 • T-SQL (Bulk Import Data)
Obtaining Data Import Extract PDB File CSV File Temp Tables (T-SQL) Convert and Derive Final DB T-SQL Procedures
Uploading Protein Data • Input dataset: ~ 70,000 PDB/Chain Combinations • Entries in tables: • E.g. Approx. 800 Million Rows in the proteinchaindistance table • Initial version imported 10 PDB files in 1 day • Current version: under 3 minutes
Current Functionality • Protein (PDB) data has been completely uploaded into both: • Production Database (MotifSpace) • Development Database (MotifSpaceDev) • Visualize protein structure using data from database (data available) • Data can be obtained from Server using SOAP or web services. • Basic Queries such as • Different PDBs a specific motif occurs in? • Histograms to compute statistics.