File and Database Concepts

File and Database Concepts

Match the definitions

Learning Objectives • Explain how data is stored in files in the form of fixed length records comprising items in fields, with the key as a unique identifier • Describe the characteristics, advantages and disadvantages of serial, sequential, indexed sequential and random access to data, showing an understanding of the type of application suitable for each • Describe how data can be organised in secondary memory to facilitate different modes of access, including an explanation of the nature of hashing algorithms (with an example, illustrating what happens if there are collisions) to create addresses from keys, thereby facilitating direct access and the use of an index or set of indexes to facilitate indexed sequential access • Select appropriate data types for a given set of data, and explain the advantages and disadvantages of alternative data types • Describe the difference between flat files, relational and hierarchical database systems, discussing the comparative benefits and drawbacks of each • Be able to describe the following terms: table, key, field, record, relationship, primary key, foreign key, referential integrity, entity, attribute • Describe the different access levels required for on-line files and databases, identifying the need for the different levels of access with reference to user/supervisor modes and user IDs and passwords • Source: OCR

Records • “A record is an organised collection of information about an object or item” • Ted Postlethwaite10 West HillHighgateLondonN6 3PY • Fixed or variable length records: pros and cons?

Pros and Cons: Fixed and Variable Length Records • Fixed-Length Records • Easy to implement • Can predict where records will start and end so faster access • Wastes space • Can be inflexible as space allocated may become too small • Variable-Length Records • Uses disk space economically • Flexible • Difficult to implement • Slower access

Data types • Boolean • True/False • Integer • Whole numbers, e.g. 1, 2, 999 • Real • Fractional numbers, e.g. 123.456 • DateTime • Any date- or time-based value, e.g. 29/11/2005, 13:08, 0.001 seconds • String (text) • Names, addresses, emails, etc.

Why use data types? • Storage space and format • A Boolean type is often just one bit of storage space! • A 64-bit integer type will store 264 numbers. That’s anything from 0 to 18446744073709551615 or -9223372036854775808 to 9223372036854775807! • If you use a 64-bit integer to represent something that can only be between 1 and 10 then you’re wasting a lot of space! • Differences in operations • You can’t do fractional arithmetic with integers – it gives you nonsense results, e.g. 10  6 = 1 • And with text “x” + “y” = “xy”! (concatenation)

Five-Minute Task • What data types would you use for the following: • A field which holds your account balance • A field which shows how many times you’ve logged in to this website • A field which says whether you are a supervisor or not • A field which contains your phone number

Five-Minute Task Feedback • Account balance must be Real as it holds fractional numbers, e.g. £123.87 • How many times you’ve logged in will be an Integer, since you can’t log in, say, 6.34 times! • Supervisor field should be Boolean – you either are or you aren’t • Telephone numbers should be Strings • You will never do any arithmetic with them (when did you last add two telephone numbers?) • They start with a “0” which would be removed in a numeric type

Serial Access • Records not in order. • Search starts at the beginning of the file. • Records read in the order that they are stored.

Sequential Access • Records are in some order • Search starts at the beginning of the file

Serial vs Sequential? • Imagine you have a two files, each of 100,000 surnames, many of which are repeated. • One file is unordered, the other is alphabetical. • You want to find all occurrences of the name “Smith”.

Five-Minute Task • How much of the serial (unordered) file do you have to search? • How much of the sequential (ordered) file do you have to search? • Geek question: In terms of search time, how much more efficient are sequential files than serial files on average in this type of search?

Five-Minute Task Feedback: Serial File • With a serial file you would always have to search all of the file. • Why? Because you have no way of knowing that the last record of the file isn’t a “Smith”!

Five-Minute Task Feedback: Sequential File • You would only have to search until the end of the “Smiths” • Why? Because the file is in order, you know there are no more “Smiths” to look for.

Five-Minute Task Feedback: Geek Question • Sequential files would be twice as efficient on average. • Why? Of 100,000 sequenced records, you would sometimes need to search 1 records, sometimes 2, sometimes 3… sometimes 99,999, sometimes 100,000. The average of 1 … 100,000 is (100,001 / 2) ≈ 50,000. So on average you would have to search 50,000 records to find the block you want. With a serial file we know we always have to search all 100,000 records. That’s twice as many as the sequential file, so a sequential file is twice as efficient for this type of search!

Random Access • Direct access to a single record with no need to search. • Hashing algorithm creates disk address from record key. • “Collision” occurs when hashing algorithm tries to put two things in the same place!

Dealing with Collisions • Hashing algorithm gives ‘bucket’ address rather than record address. • Overflow to next bucket if target bucket is full. ‘Bucket’ FULL • Performance deteriorates over time as more overflows happen. Overflow

How Hashing Works • There are lots of ways, but we’ll look at N Mod M. • Modulo (Mod) is a mathematical operation like , , , , which gives the remainder when one number is divided by another. • Question: What is the highest possible remainder when a number is divided by M?

N Mod M We have 11 buckets,so M = 11 1234 1235 1236 1237 1238 1239 Record Keys Buckets 1240 1241 1242 1243 1234 Mod 11 = 2, so the record with the key 1234 goes into bucket number 2. 1244

Indexed Sequential • Supports direct access and sequential access. • A table with more than one column can have more than one index. • A very large index may have its own index. INDEX

The Power of Indexes • Contestant number one find… • From the phone book • Laguardia, G • Contestant number two find… • 020 7624 6921

Five-Minute Task • In groups, think carefully about exactly what you do when you look up someone’s name in the phone book. • Which access methods do you use? • What real-world objects correspond to: • The key • The record

Five-Minute Task Feedback • You open the book somewhere in the middle: Direct Access • You skip a few pages till you find the right one and then run your finger down the list to find the right name: Sequential Access • These two add up to: Indexed Sequential Access • The key must be the surname • The record could either be the address and phone number or, if you consider the address as a pointer, the house itself!

Review • So far we have looked at: • Serial Access • Sequential Access • Random Access (using hashing) • Sequential Access

Five-minute task: Fill in the table

Five-minute task feedback:File Organisation Summary

3 Types of Database • Flat-file database • One file • One record per line • Essentially a long list • Hierarchical database • Data held in a tree structure • Data held at “leaf nodes” • One-to-many relationship between parent and child nodes • Relational database • Data held in separate tables • Each table relates to an “entity” in the data model • Tables linked by foreign key relationships

Pros and Cons: Flat-file Databases • Can be very fast • Simple to design and implement • Inflexible – only useful for the application it was designed for • Data redundancy (same data repeated throughout the database)

Pros and Cons: Hierarchical Databases • Fast access to data • Especially good for hierarchical data, e.g. family tree • Inflexible structure, hard to reorganise • Can be hard to do “leaf traversal”  

Pros and Cons: Relational Databases • Extremely flexible, can model almost any system • No data redundancy • Complex software • Requires skilled administration

Relational Database Concepts • Table • Field • Record • Relationship • Foreign Key • Referential Integrity • Entity • Attribute • Primary Key You must know the definitions for these terms before the exam!

Access Rights and Restrictions • Different levels of access: • Read • Write • Append • Create • Delete • Different user types: • User • Day-to-day information use • Supervisor • Information maintenance • Technical skill required Discussion Point: The Managing Director is the boss of the whole company. Does that mean she should have Supervisor rights?

Five-Minute Task Define these access levels.

Five-Minute Task Feedback

File and Database Concepts