380 likes | 473 Views
Introduction to Biological Data Retrieval. Dr Richard White. Session 2 in module “Informatics for Omic Biosciences”. Scope. Today and in other sessions in this module the emphasis will be on data retrieval and analysis
E N D
Introduction to Biological Data Retrieval Dr Richard White Session 2 in module “Informatics for Omic Biosciences”
Scope • Today and in other sessions in this module the emphasis will be on data retrieval and analysis • Database design and construction will be covered in the MSc Specialist Options module “Information Systems in Bioinformatics” starting in February/March • Some examples are taken from wider biological and biodiversity scenarios, reflecting my background, but including some references and practice with bioinformatics databases
Introduction to databases • There are many examples of different kinds of biological, bioinformatics and biodiversity databases. • You will meet a number of these during this module. • You will often have to ask yourself how the principles we discuss today apply to these databases, and how you might apply them to databases in your own area of interest. • In this lecture, we introduce databases in general, not specifically bioinformatics databases …
Overview • What is a database? • What are they used for? • Database system components • DBMS architectures • Methods to access data • Practical session
What is a database? We meet them everywhere. Everyone has a vague idea of what a database is. • What they are not • What they are • What are they used for?
What databases are not • unstructured piles of data (including heaps of web pages sometimes referred to as a “database”) • directories full of text files • data collected for one kind of analysis • spreadsheets • information systems
Spreadsheets versus databases (1) • A spreadsheet is typically viewed as an entire table of cells which may contain • numbers (data) • text (labels) • formulae (calculations producing results) • A database may be structured in various ways, usually so that a small subset of the data is presented as the result of a search
Spreadsheets - databases (2) • Can be used immediately with little preparation (or thought) • Data is visible • Data entry is simple • Require planning • Data is hidden • May require a program to help you retrieve data • May require a program to help you enter data
Spreadsheets - databases (3) • Little checking is carried out • Tables and graphs can be produced • Single user • Extensive integrity checks can be arranged • Reports can be programmed • Searches can be made • Can be multi-user • Can be put on the Web
What a database is • Multiple uses of the data are envisaged • So data is stored separately from any application programs which might use it • Designed for retrieval in various anticipated and unanticipated ways
What are they used for? (1) Contents of databases in bioinformatics (covered in Peter Kille’s lecture in session 1): • nucleotide sequence databases • protein sequence databases • protein structure databases • special-purpose databases
What are they used for? (2) Uses of databases in biodiversity (examples will be given in my session at the end of this module): • information about species names • data about species • data about biological specimens • data about areas, places, sampling sites, etc. (sometimes stored in Geographical Information Systems (GIS)
Database system components (1) A database management system (DBMS) has the following essential components: • Data tables (the data itself) • Database “engine” (stores data to and retrieves data from the tables) • User interface (for humans to enter, view and edit data) Some commercial general-purpose DBMSs, such as Microsoft Access, make the engine and the interface appear as one
Database system components (2) A DBMS usually also includes: • Database “drivers”, import & export modules, etc. (for programs to store, retrieve and alter data) • Application programs (using the above to connect with the data and do useful things with it, sometimes called “business logic”; may be general-purpose or specialised) • Report writer (a specialised application program) • Utilities (for back-ups, integrity checking, etc.)
DBMS types (database internal structure) What are the main types of database design? (The internal mechanics, not the information or the appearance of the database as seen by the user.)
DBMS types (database internal structure) • “Free text” - records not divided into fields • “Flat-file” - records have fields (one table with columns like a spreadsheet), common and easy to understand, often inefficient • Hierarchical, Network - now obsolete • Relational - several linked tables (with related records), usually the choice of the professional • Object-oriented - for the adventurous enthusiast
Example of simple relational database • Your CD collection database • Table of CDs (linked to record companies) • Table of record companies • Table of performers • Table of tracks (linked to performer and CD)
Methods for accessing databases • What methods exist to access bioinformatics databases? • Basically there are several approaches to the use of databases:
Methods for accessing databases • Downloading a copy of the database or files of data (e.g. by FTP) • Direct access to the database tables on your PC (e.g. using Microsoft Access) • Direct access to the database tables on the remote database server (e.g. using SQL) • Using a Web page form (which interfaces to a search program on the database server), maybe downloading a file of data • Using a search program (e.g. Perl, BioPerl) on your PC to connect to the database server search program
Database use 1: direct access to database tables • Run your own database on your own computer (e.g. MS Access) • Use a program on your PC which gives you direct access to the tables in the remote database (client-server database access) In both cases, you need instructions as to what the tables are and what they contain, such as SQL.
Data retrieval from a database • A database consists of one or more tables • Data retrieved from a database can be thought of as consisting of another (usually smaller) table • So how is this smaller table specified?
Specifying the result table • By “selecting” rows, by some property such as performer = “Nigel Kennedy” • By “projecting” (choosing a subset of) the columns required, as in title, performer, label • By “joining” two tables together, by means of a linking column such as performer • SQL (Structured Query Language), which you met briefly in the Computing module, is a commonly used language in which to make these requests
SQL statements • SQL (“Structured Query Language”) is a language for specifying the creation of databases and the updating and retrieval of information in them. • It is general and “portable” – so that it can be used with a variety of different database systems without having to learn a new language for each one. • The language goes far beyond this scope of this course. Briefly, it can be used to: • Specify the tables in the database and the fields (columns) they contain • Make additions and updates to the data in those tables • Retrieve information from one or more of the tables
SQL for data retrieval • A typical SQL statement for data retrieval would look something like this: SELECT <some fields> FROM <table> WHERE <condition>; • The condition effectively selects certain rows from the table. • Thus the result is often a smaller table than the one being queried. • Tables can be “joined” together to combine information from more than one table, for example when extracting a molecular sequence from one table and the bibliographic details of the reference to where it was published from another table.
Smallest ever guide to SQL • Database table definition: column names, data types, indexes, etc. • Data retrieval is based on the idea of selecting columns and rows to obtain a subset of a larger stored table, e.g. SELECT name, salary FROM Employee WHERE name LIKE ‘Smith%’; • Data may be retrieved from linked tables using “joins” • Data records may be inserted, altered or deleted
Database use 2: predefined operations Alternatively, you might have forms and queries already set up for you, which you can just run in order to perform predefined kinds of searches. These predefined operations can be made directly available to you by: • A special local database interface program, e.g. forms in Microsoft Access • Browsing a web page, typically containing a form, which gives you access to a database somewhere else. You’ve done this if you’ve ever bought anything on the Internet.
Downloading data files • In Session 1 Peter Kille discussed downloading data sets from data servers such as EMBL (EBI), GenBank (NCBI) • This is easy to do - you can find their web sites, which direct you to an FTP server, from where you can download DNA and protein sequence files in FASTA format, or in newer XML format which makes it easier to use associated metadata.
A problem • You’ve downloaded a load of files and got them sitting prettily on your very own PC. • You’re all set to amaze the world with your bioinformatic analyses, etc. • What happens to them next?
A problem • They get out-of-date, that’s what. • The sequence database is continually being updated, but your copy isn’t. • Of course, you can keep downloading new copies at intervals, but that’s tedious. Labs lucky enough to have tame lab technicians often do this.
Downloaded files • This is potentially risky, not only because of the problem of the files going stale, but also because the data source might change their format without warning, which would break your nice information extraction program (at least in principle; there would be uproar if they actually did this!)
Solution • Don’t keep your own stale copies lying around. • Get the data “live” from the database whenever you need it. • In the Computing module, you’ve already seen how to access a live database from Perl and Java programs
Downloaded files In Peter Kille’s lecture in session 1, he highlighted the disadvantages of downloaded files: • Local copies have to be kept up to date • Flat file downloads can lose important relational data structures • Interrogating large data sets is dependent on local resources • Local copies of unwanted data are retained in case they are needed
Database querying Conversely, database access has advantages: • The database is kept up-to-date (usually by someone else!) • The relational data structure is preserved • Interrogating large data sets is done on a server, not on your local PC • Dynamic querying reflects up-to-date information and means you do not need local copies of unwanted data
Database use 3: writing a program Using or even writing a small program (sometimes called a script to make it seem less scary) to fetch the data for you. This allows you to process the data in useful ways: • to search for features you’re interested in, • to summarise the data in the way you want, or • to extract data for statistical analysis to test hypotheses.
Database use 3: using predefined operations The predefined operations may be packaged as CGI programs or Web Services or in a variety of other ways, but basically you just send a request to the service, optionally with some ‘parameters’ to specify what you want, and wait for the reply. The reply may come back, usually, • in HTML (as a web page containing the data requested) or • as some other sort of file to be downloaded (i.e. stored on your PC), either • in one of a number of formats invented by the data providers, • in XML, a standard but flexible (and verbose) way to structure a data file, so that other programs (rather than humans) can process it easily.
NCBI Entrez In other sessions, you’ll be introduced to a number of bioinformatics databases, but today we will look at a popular way to make use of some of them, because you will explore this in this course this afternoon. • NCBI web site http://www.ncbi.nlm.nih.gov/ • Entrez utilities
Practical session • In the “Computing for Bioinformatics” module, some of you have already accessed bioinformatics and biodiversity databases using Perl and Java programs • Today, we’re going to do the same but • using Perl locally • to connect to remote database retrieval software • to access a sequence database • with emphasis on interpreting the results • In later sessions you’ll download and analyse other data sets
More information • For • a copy of this presentation • the programs to be used in the practical class • other information • Go to http://users.cs.cf.ac.uk/R.J.White/ • Or Blackboard http://cue.cf.ac.uk/ • And choose the module “Informatics for Omic Biosciences”, Session 2: Introduction to Biological Data Retrieval