GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree crabtree@pcbi.upenn

GUS 3.0: Implementationand DependenciesJune 19, 2002Jonathan Crabtreecrabtree@pcbi.upenn.edu

Outline • Schema "implementation" • what's done, what's not • Dependencies • data migration and testing • other tasks • Database implementation details • design decisions and implications • production & development dbs • Future work/current schema issues

Implementation • Implemented so far: • The schema itself (Pinney) • Updated Perl object layer (Brunk) • Revised GUS Application (GA) code (Schug) • Preliminary version of allgenes interface (Fischer) • Not yet implemented: • Extensive testing of schema, interface, objects, etc. • Data migration from GUSdev to GUS 3.0

Migration Dependencies • Instantiate/finalize GUS 3.0 schema (Pinney) • Upgrade database server operating system • Install and configure new RAID device • Write scripts to migrate existing data • Resolve any remaining inconsistencies • Freeze access to database • Annotator's interface (Diskin, Mazzarelli) • Current allgenes update (Pinney, Fischer)

Migration Dependencies cont. • Run scripts to migrate all existing data • Fix any problems that arise • Begin to "certify" plugins as 3.0-compliant • Discuss: how much does the GUS 3.0 schema "implementation" depend on our data migration? • In other words, the 3.0 schema can be viewed as implemented but untested. • Conflict with PlasmoDB final release date?

Migration Highlights • Two "namespaces" (Oracle schemas) to five: • GUSdev,RADdev => Core,DoTS,SRes,RAD3,TESS • Certain tables are now shared in Core, SRes • Avoid primary key conflicts by reloading RAD3 • Restructuring of DoTS "central dogma" tables: • GeneInstance, RNAInstance, ProteinInstance • Also GO terms, new LOE and Complex tables • Other pervasive changes: • e.g. ExternalDatabase => ExternalDatabaseRelease

Other Tasks • Script(s) to automate schema creation: • schemas (in the Oracle sense) • tables • sequences (to generate primary key values) • views • "bootstrap" rows • populate other tables as desired? (Anatomy, etc.) • constraints • indexes • GRANT permissions as desired

Other Tasks II • Complete schema documentation • Convert plugins to new schema as needed • Remove site-specific dependencies or standardize • e.g., hard-coded references to specific external_db_ids • Particularly for data loading plugins • make it easier to load and display sample dataset • Formalize schema development process • Which changes are "major" or "minor"? • Automatically determine which plugins are affected?

Database Design Decisions • GUS vs. other "plain" relational databases: • 1. subclassing (extra views) • 2. [blame] tracking/access control (extra columns) • 3. versioning (extra tables) • Minimal reliance on database-specific features • no stored procedures • no server-side Java • no object-relational tables • Generic links and naming conventions

1. Subclassing With Views • Advantages • conceptual clarity • straightforward to query the superclasses • schema evolution; views are easier to change • Implications • large tables (number of columns and rows) • complicates query optimization (number of rows) • slows row accesses (number of columns)

Subclassing cont. • Query optimization issues • Cost-based query optimization requires statistics • Confounded by coexistence of subclasses in table • Bigger tables make the worst case worse • Physical I/O issues • Any row access must read the entire row, including a potentially large set of irrelevant column values • Also increases the likelihood of chaining

Subclassing - alternatives • Use views for the superclass not the subclasses? • Isolates subclasses from one another more • Requires changing tables rather than views • Superclass view will be a large SQL UNION • Queries likely less efficient over superclasses • Keep existing system, but use partitions to specify physical placement of subclass rows • Solves some, but not all of the problems

"Large" Tables I • GUS: indexes=25G tables=~100G • NASequenceImp = 11G • AssemblySequenceVer = 8.6 G • SimilaritySpan = 8G / 74 million rows • Similarity = 4.8 G / 38 million rows • AssemblySequence = 3.6G • Evidence = 3G / 36 million rows • SimilarityVer = 2.8G • Approximately 10-20 quite large tables

"Large" Tables II • Tables with largest average row length: • GeneMapVer = 811 bytes • NASequenceImp = 747 bytes • AssemblySequence = 521 bytes • Tables with the most chained rows: • NASequenceImp = 384,524 rows • AssemblySequenceVer = 56,873 rows • AssemblySequence = 21,458 rows

2. Tracking/Access Control • Advantages: • Enables DBA to disburse wrath appropriately • Aids in correcting errors • Disadvantages: • Extra columns have foreign key constraints • Several small tables become bottlenecks for certain DDL and database update operations • Access controls not fully implemented • where and how should they be implemented?

Tracking II

3. Versioning • Advantages: • required for complete tracking • Disadvantages: • space overhead, results in slower updates • requires application-level code to implement • may be unnecessary in some DBMSs • currently not used uniformly • Different versions coexisting e.g., PlasmoDB

Development => Production • nemesis/8i (GUS) and erebus/9i (GUSdev) • Release cycle based on whole-database copy • Uses Oracle IMPORT and EXPORT utilities: • EXPORT over network to flat files • change owner/schema name • change physical placement of tables, indexes • Alternatives: • transportable tablespaces, SQL-based copy

Future Work • Issues with current schema from PlasmoDB • free text searching (and use of CLOB values) • more sophisticated schema for tracking session-oriented data (more on this tomorrow) • supporting queries for genome browser(s)

GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree crabtree@pcbi.upenn

GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree crabtree@pcbi.upenn

Presentation Transcript

Strategic Management and Implementation TMU

Yeast Cultivation

Rotary Coordinator Zone 7B and part 8 Rob Crabtree Rotary Club of Howick, D9920, New Zealand Pacific Islands

Auditing Distributed Digital Preservation

Gene Expression Arrays (Haverford College, Fall 2001)

Agenda

Take Over

Migratory Species Response to Climate Impacts

Multivalued Dependencies

Martha’s Task

Crabtree Modular Switches

SESSION VII CODES, STANDARDS AND GUIDELINES IWM 2002 MAY 29 – JUNE 1, 2002

Using Online Courses in Judicial Education

Financial Literacy: First Day of Class

MLA ’ s Credentialing Program A Brief Overview

WP4 / ATF

Lithic Technology

Iridium catalyzed cycloaddition reactions

IBM OS/360

The Linux /proc Filesystem

IYA2009 Theme

Quanta to the Continuum: Opportunities for Mesoscale Science