480 likes | 611 Views
Database Design for Cyber Infrastructure. August 2004 Dave Archbell dave @ sdsc.edu Advanced Database Projects http://daks.sdsc.edu http://www.sdsc.edu dbd4ci() { ## begin. Database Design (large or complex). General Design Considerations Data Modeling / Application Workflow
E N D
Database Design for Cyber Infrastructure August 2004 Dave Archbell dave @ sdsc.edu Advanced Database Projects http://daks.sdsc.edu http://www.sdsc.edu dbd4ci() { ## begin
Database Design (large or complex) • General Design Considerations • Data Modeling / Application Workflow • Data Normalization • Data Integration • Performance • Tutorials (2) • Questions
General Design Considerations • Users • Legacy Systems/Data • Data Sources/Formats • Application Requirements • Questions!?
Users (who are they?) • Scientific • Administrative • Technical • Lay-person • Student
Understanding who theusers are can impact • Access Controls • Service Levels • Interfaces • Character Sets • Languages • Language Styles
Legacy Systems/Data • What systems are currently in place? • Where does the data come from and how is the data generated? • What does the data look like (format)? • What is it used for? • What of these is REQUIRED to remain consistent or static?
$$$ $$$ Reliability Reliability Functionality Functionality Performance Performance Application Requirements • Budget • Functional requirements (what must it do?) • Environment / Platform • Mandated Technologies (Politics) • Uptime
$$$ Reliability Functionality Performance Politics $(n) Budget
Application Workflow • Order of operations • Accounting • Auditing • Error handling • Logic • Reporting
Application Logic – Gotcha Where should the logic live? • In the application • In the database • Shared responsibility
Where should the logic live? • Is the application stand-alone or a collection of applications? • Where are the users defined? • Is data restricted by user, group, or role? • Will the application be transactional (OLTP)? • Is auditing of transactions required? • Is THIS the only database used by the application? • Do the application/business rules apply to the data specifically?
Data Modeling • What will the data need to look like in order to support required functionality? • Can the application support multiple data formats which are dependent on how it is used? • Conceptual • Logical • Physical • CASE Tools
Conceptual Model • The big picture • Describes concepts • Provides platform for designing the Logical Model
Logical Model • Describes the entities and attributes to be used in storing data • Describes relationships in a generalized way that does not reflect the physical characteristics of any particular back-end database • May reference diverse databases to be treated as a collection
Physical Model Assuming SQL • Specific to the database vendor • The actual DDL which creates the physical structures where the data is stored • Describes storage characteristics • Constrains the data
Vendor(s) Selection • What are the users currently using and familiar with? • What functionality is required? • Where will the application logic live? • How many users need to be supported (today and tomorrow)? • What other software needs to integrate with the back-end? (Globus?) • What is our budget?
Vendor(s) Selection (cont.) • MySQL • PostgreSQL • MS SQL • DB2 • Oracle • Sybase • Informix • FirstSQL
Vendor(s) Selection (cont.) • Does we need to use open-source? • Do we need transactions? • Do we need stored procedures? • Do we need roles or groups? • Will we need to operate in a GRID environment? • Do we need the ability to federate? • Do we have a budget for purchasing?
Data Normalization Latin norma – used for making right angles • Analyzing inherent relationships between data • Determine which form is appropriate for your data • Reduces and/or eliminates redundant data • Reduces and/or eliminates data integrity errors
First Normal Form • Break data into tables (from entities) • This should be the smallest meaningful value • No repeating groups • One separate table for each group of data • Identify and define a primary key for each table [A primary-key is the set of columns (attributes) which uniquely identify a row of data in the table]
Second Normal Form • Removes data columns (attributes) that rely only on part of the primary key • New tables for data that is to more than one record in any single table • Use Foreign-Key references to relate tables to one another [A foreign-key is a column (attribute) which relates to the primary key of another table]
Third Normal Form • Removes any data or column (attribute) which does not rely solely on the primary key for each table • Requires that primary/foreign key relationships be defined and utilized for applications and reports
Fourth Normal Form • Fourth Normal Form requires only that there be no multi-valued dependencies • It is rarely used • One should be careful to ensure that the application has no requirement for multi-valued dependencies before considering this form
Data Integration Federation
Federation • Providing a common interface to disparate data sources • Sources are generally related (ie: BIO, GEO, etc.) • Sources can be autonomous • Discovery. Provides a framework to build applications which can identify new relationships and take advantage of them
PDB EOL AfCS Federated Sources (BIO)
PDB EOL AfCS Federation Federation with Memory Caching FEDERATION with Cache Cache in Memory FEDERATION
Federation with Memory Caching • Twice as fast with no other changes • Network latency not an issue • Performance and/or reliability of remote data source not an issue • Synchronization - automatic • Joins are local, so no data movement across the wire • Joins are performed within the same instance providing for better use of indexes
Mediation Define domain-specific data in standardized terms to allow for relating it to other data • Data Modeling • Knowledge Representation • Query Processing for model-based mediation
Mapping: Science Domains in DICE DATA Grid Data Svcs Federation Mining VIZ Mediation Analysis Astronomy Biology Chemistry Ecology Engineering Geology Medical Neurology Particle Physics Zoology
Mapping: Science Domains in DICE DATA Grid Data Svcs Federation Mining VIZ Mediation Analysis Astronomy Biology Chemistry Ecology Engineering Geology Medical Neurology Particle Physics Zoology
Performance • Hardware • Add memory • Faster Disks / Controllers • Disk Arrays (RAID-10 in PLAID) • NAS –vs– SAN • Indexes • Increase Buffers (memory usage) • Update Statistics • Hints (in query) • Canned and Optimized Queries • Platform (32-bit, 64-bit) • Cache Data • …. uh, de-normalize?!
Performance (the easy way) Throw big hardware at the problem • Doesn’t scale • Costly • May require specialized skill to administer • Works great as a stop-gap
Real-World Examples • BIRN http://www.nbirn.net • PDB http://www.rcsb.org/pdb • WIISARD http://health.ucsd.edu/news/2003/10_23_WIISARD.html
Transportation assets Hospital #1 Attack site Warm zone Prevailing wind Compromised transportation corridor Hot zone Decon Transport station Field Treatment Station Triage X DisasterResponse Team Incident command center Hospital #2 (ER overloaded) _
Mobile CDMA-2000 Internet Hospital #1 WirelessLAN connected to Internet Hot zone Delivery of field treatment data to Hospitals Electronic record of field care Transport station Field Treatment Station Triage X Incident command center Patient monitoring And location tracking Hospital #2 _
Tutorials • Normalizing • A simple application
Tutorial - Normalizing Event Registration System • Folks • Contact Information • Event • Location • EXTRA Information • Attendees • Start Time/Date • End Time/Date • Description
Tutorial – Simple Application Event Registration
Continuing Education • UCSD Extension Data Management Courses DBA Certificate Program Database Application Developer Certificate Program
Books and Reference • Database Design for Mere Mortals, Michael J. Hernandez • Information Modeling and Relational Databases, Terry Halpin • IBM RedBooks and RedPapers http://www.redbooks.ibm.com/ • Oracle Documentation http://www.oracle.com/technology/documentation/database10g.html
Database Design for Cyber Infrastructure } ## end dbd4ci