330 likes | 505 Views
Chapter 5 – Managing Files of Records. What’s Up for This Chapter?. This Chapter’s Material Accessing records in files Record structures for access File access methods vs. file organizations Some real-world examples of file structures File portability issues. The Central Problem.
E N D
What’s Up for This Chapter? • This Chapter’s Material • Accessing records in files • Record structures for access • File access methods vs. file organizations • Some real-world examples of file structures • File portability issues
The Central Problem • Locating Stored Data • Once the data has been stored into a file, how do you find it to retrieve it? • What does “find the data” even mean? • How do you decide what you want to find? • How do you look for it? • What if it’s not there? • What if something very much like it is there? • What if there are lots of “it” there? • And, of course, there are efficiency considerations • How fast is your search algorithm? • What would you have to do to the file to use a faster one? • Which will you do more often, add records or find them? • Bringing you back to the design of the file itself
Record Keys • What Is a Key? • Data stored in a record by which you look for the record • Can be one field or a set of fields • Examples – { name } or {last name + first_name } • Two Types of Keys • Primary key • Key value, unique in entire file, by which an individual record can be located or determined to be absent • Secondary key • Key value by which one or more records can be located
Primary Keys • Required Characteristics • Unique across the entire file • Can never have 2 records with same primary key • Error to try to add record with duplicate primary key • In “canonical” form • Format precisely known, so search candidates can be brought into that same format before the search • Example – words (names, etc.) in all upper-case • Not often used any more: rather, program the system to do the search independently of case • Unchanging • Value for given record should never change • Given primary key value should always identify same record • Example – Texas Driver’s License number stays with you, even if you move away from Texas, then come back
Primary Keys, cont’d. • Implication on File Design • Don’t use possibly non-unique field(s) as primary key • Bad – name, birth date, etc. • Don’t use anything that can possibly change • Bad – name, address, etc. • What can we use? • Best – artificial identifier • Student number • Driver’s license number • Other artificially created unique value
Secondary Keys • Not Such Stringent Rules • Duplicates allowed • Still have to define what “find” means if duplicates allowed • Usually real data, as opposed to primary keys • The kinds of thing you’d want to search for in real life • Not used to impose any order on the file • Can return results based on secondary key(s) • Selected by secondary key value(s) • Sorted on secondary key value(s)
Searching • From 2325 – Two Major Methods • Sequential • Start at beginning, look until you find what you’re after • Choices: • Non-unique keys allowed? • Return first match or all of them? • Binary • Start in middle, remove half the list each time through • Requires: • Primary key values unique across file • File sorted on primary • Records directly accessible • There are others, but …
Sequential Searching • Performance • It might take 1 try; it might take N tries • Average number of tries = N / 2 if: • Searching on a unique key • Returning first match • Average number of tries = N if: • Returning all matches
Sequential Searching • Performance • Big factor in disk access • Worst case: • File fragmented around the disk • Each program read takes one physical read • Best case: • File fairly contiguous on disk • I/O System buffers things so very few (1?) actual reads are done • In multi-user OSs, this seldom happens • However: • If read/write head didn’t move between accesses • Rotational latency & transfer times small compared to seek time • Multiple physical reads wouldn’t have as much of an impact • However, most OSs are multi-tasking now • Can’t rely on read/write head’s being where you left it • Must assume N physical reads take N full disk accesses
Improving Sequential Searches • Reduce Number of Physical Reads • We can’t do anything about: • File fragmentation • If file’s clusters scattered around disk, multiple seeks are necessary • Multi-tasking environment • Have to assume each program read causes a physical read • (May not be true, if I/O System has good internal caching) • So what do we do? • Increase the number of records pulled in by each physical read • Saw this with magnetic tape – group the records into blocks • Similar to way we collected fields into records, but … • Grouping fields into records is dependent on data characteristics • Grouping records into blocks is dependent on I/O system & disk • Block size should be: • Multiple of disk sector size • Compatible with I/O System’s ability to read
When to Use Sequential Searching • Sequential Searching is Good for: • Text files where you’re looking for a pattern • Unix ‘grep’ (general regular expression processor) command • Small files • Like you use in labs here • Files that are searched very infrequently • Not worth the effort to sort to make binary search work • When you expect a large number of matches • Example – searching on a secondary key • It’s Not so Good for: • Binary files • Sorted files • Big files
Unix Tools for Sequential Access • cat • Seen this one – concatenate files • cat F1 F2 >F3 • wc • Word count (also character & line count) • wc article.txt • grep • Search file for occurrences of regular expression pattern • grep “Ames" personlist.txt • od • Octal dump – or hex, or … • od -ch list.dat
Direct Access • What is it? • Go straight to the record you want in the file • No searching • No unnecessary disk accesses • What’s its “order”? • Time to find a record is independent of number of records • However, it can be harder to do
Direct Access • How to Do It? • At I/O System level, seek to record • C++ seek operations go to relative byte address (RBA) in file • Variants: • Seek with “get” pointer vs. seek with “put” pointer • Relative to start or end of file (default: start) • But that still doesn’t answer the question • How do we know what RBA a particular record starts at? • We’ve talked about index files – but that’s for later • We could move the problem up one level • Use relative record number (RRN) • But that’s no real help • Still need some kind of index – way to find record’s RRN • Also requires use of fixed-length records: RBA = RRN * Record_Size (assuming, of course, that the first RRA is 0)
Building a File of Records • Like Building a Record of Fields • Same problem, up one level • Fixed-length or specified-length records? • How to directly access records? • But wait – there’s more: • Want to require software to know as few details about file as possible • To do that, those details need to be stored with (in) the file • File header records • Store file-specific information at start of file • Header record format • Constant across all file types within one system • Why?
File Header Records • Things a Header Record Might Contain • File structure • Type of record structure • Number of data records • Length of records (if fixed-length) • Record delimiter (if delimited) • Record structure (if records have consistent structure) • Number of fields • Length of each field or delimiter between each field • Format of each field • Key information – if needed • Primary key field • Secondary key field(s), if any • Date/time of most recent access • Date/time of most recent update
File Header Records, continued • Header Record Format • Binary or character? • Depends – is it important for people to read it? • Here’s a place where HTML-style format might work • Lets files of different formats have different headers (in some ways) • Only invokes that parse overhead once per file
What’s the Difference? • File Organization • Format of the file itself • Fixed-length, specified-length, or delimited records • ASCII or binary character encoding • File Access Method • Way(s) software can get at contents of file • Sequential vs. direct • Indexed sequential
Designing a File • Access Affects Organization • If sequential access is all we need • Pretty much any organization is OK • Subject, of course, to application needs • If we need direct access • Need fixed-length records • Can also use indexed files, but that’s for later on • But Organization Also Affects Access • What if data to be stored in a record is wildly variable? • Fixed-length records would be extremely wasteful • But if we use specified-length records, how to do direct access? • Just about have to use indexing then
Metadata • Data About Data • Usually in the form of a file header • Example in text • Astronomy image storage format • HTML format (name = value) • But look on page 177: coding style makes a BIG difference • Parsing this kind of data • Read field name; read field value • Convert ASCII value to type required for storage & use • Store converted value into right variable • Why use this type of header?
More Metadata • PC Graphics Storage Formats • Data • Color values for each pixel in image • Data compression often used (GIF, JPG) • Different color “depth” possibilities • Metadata • Height, width • Number of bits per pixel (color depth) • If not true color (24 bits / pixel) • Color look-up table • Normally 256 entries • Indexed by values stored for each pixel (normally 1 byte) • Contains R/G/B values for color combination • Formatted to be loaded directly into PC graphics RAM
Mixing Data Objects in a File • Objective • Store different types of data in the same file • Textbook example – mix of astronomy data • “File” header (HTML-style) • “File” of notes – lines of ASCII text • “File” of image data – in whatever format • So our data file becomes a file of files • Each individual “file” (header, notes, or image) looks like a record in this new “mega-file” • These “mega-records” are of varying length • How do we store the “records” in the “mega-records”? • Could use another level of specified-length record software • Or, …
Our “Mega-File” • Organization Mega-fileHeader NotesSub-file ImageSub-file NotesSub-file ImageSub-file … Notes Header Text line Text line Text line Text line Text line … Text line Terminator line ImageHeader ImageData
More on Our Mega-File • Access • Can we just read it sequentially? • Why or why not? • What if we wanted to skip a notes sub-file? • What if some image didn’t even have a notes sub-file? • Can we access it directly? • What would the header have to include to allow that? • An index of the “records” in the file • We call the entries in that index “tags” • Each tag in the tag list has: • Type of sub-file referred to • Special-case type: end of file • RBA of sub-file in mega-file • Length of sub-file (not necessary, but helpful) • Key information, if any, for sub-file
More on Our Mega-File • Access, continued • So how do we access the mega-file now? • Read and process the header • Get whole-file information • Build in-memory tag table for sub-files • Sequential access • Same as before • May be able to program in some speed-ups from tag table • Direct access • Locate sub-file in tag table • Go right to it
Extensibility • Look at Our “Mega-File” Format Again • Header tells us things about the sub-files: • What kinds of files they are • Where to find them • Files themselves • To the mega-file processor, just random bytes • To the sub-file processor, meaningful information • What if we need a new type of sub-file? • Define a new type of header entry • Extend header processor to understand that entry • Write (or borrow or buy) code to handle new sub-file • Cardinal Rule: • Everything changes –file types, data types, ...
Factors Affecting Portability - 1 • Operating System Differences • Example – text lines • End with line-feed character • End with carriage-return and line-feed • Prefixed by a count of characters in the line • Natural Language Differences • Example – character coding • Single-byte coding – ASCII, EBCDIC • Double-byte coding – Unicode • Programming Language Differences • Pascal can’t directly process varying-length records • Different C++ compilers use different byte lengths for the standard data types
Factors Affecting Portability - 2 • Computer Architecture Differences • Byte order in 16-bit and 32-bit integer values • Big-endian – leftmost byte is most significant • Little-endian – rightmost byte is most significant • Storage of data in memory • Some architectures require values that are N bytes long to start at a byte whose address is divisible by N 0x15 0x32 Big-endian Little-endian interpretation: interpretation: 0x1532 0x3215
How to Port Files • Define Your Format C*A*R*E*F*U*L*L*Y • Once a format is defined, never change it • If you need a new format, add it so as not to invalidate the existing formats • If you need to change a format, add a new one instead, and let programs that need the new version use it • Decide on a standard format for data elements • Text lines • ASCII , EBCDIC, or Unicode? • Which character(s) to end lines? • Binary • Tightly packed or multiple-of-N addressing? • Which “endian”? • You can always write code to convert to & from the standard format on a new language, computer, etc.
The Conversion Problem IBM VAX • Few Environments – can do directly • Many Env’ts. – need intermediate form VAX IBM IBM IBM VAX VAX ... XML IA-32 IA-32 IA-64 IA-64 (or some otherstandard format)