150 likes | 174 Views
Managing Files of Records. CS 3050, Spring 2007 4/4/2007 Dr Melanie Martin. Assume:. We have a file The file is made up of records The records are made up of fields We want to access a specific record. Identifying the Record. RRN (relative record number) Saw previously
E N D
Managing Files of Records CS 3050, Spring 2007 4/4/2007 Dr Melanie Martin
Assume: • We have a file • The file is made up of records • The records are made up of fields • We want to access a specific record
Identifying the Record • RRN (relative record number) • Saw previously • Access fixed length records directly • Byte offset = RRN * size of record in bytes • Variable length • Use index • Fixed length records • At RRN j, index contains byte offset in data file • Adds an extra look-up
Identifying the Record • Key • Field or set of fields • Canonical • Rule for exact format • All caps • Remove or add ‘-’ in SSN or phone # • Distinct (unique) • Required for primary key • ISBN, SSN, Phone #
Identifying the Record • Keys come in two main flavors • Primary • Uniquely identifies a single record • Ex: your specific bank account • Secondary • Identifies a group of records • Ex: all bank customers in Turlock • Ex: all bank customers overdrawn
Finding the Record • Two extremes • Direct access • Sequential search • Lots of algorithms in between, but we’ll start with the extremes
Measuring Algorithm Performance • In general we’ll count reads (seeks) • “Big O” • Asymptotic upper bound - worst case • g(n) = O(f(n)) means c*f(n) is an upper bound for g(n), if there exist constants c, n0 such that to the right of n0 the value of g(n) is always below c*f(n) • Draw Picture
Direct Access • Just go get the record we want • O(1) • No matter how large the file we can get the record in one seek • See previous discussion of using RRN for fixed length or index + RRN for variable length
Sequential Access • Go through the records in the file sequentially until we find the one we’re looking for • RRN or Key • Read one record at a time from disk • O(n) where n is the number of records in the file • I.e.time is proportional to the number of records in the file (average and worst case) • BUT what if we use blocks and read 100 records at a time • STILL proportional to number of records in the file
Why would we ever do this? • Sequential search can be good when • There are few records • Rarely need to search • Ascii files where looking for patterns (grep) • Lots of records that will match a secondary key
Pros and Cons • Sequential search + easy to program + only requires simple file structures - takes too long • Soon we will start looking at ways to get around this and get closer to direct access
Some Miscellaneous Topics • Structure and length • Fixed length fields (think inventory example) • Make sure record size fits evenly into sectors • Ex: 512 byte sectors • 30 byte records -> increase to 32 bytes • Records never span sectors • More challenging with variable length fields (records) • Estimate longest possible field values (waste issues if too big, truncation/data loss if too small) • Averaging effect • Longest name unlikely to occur with longest address in mailing list
Some Miscellaneous Topics • Distinguishing data from unused space • Read length at beginning • Special delimiter at end • Count fields
Some Miscellaneous Topics • Header records • Commonly used • At beginning of file • Might contain • # records • Length of records • Date and time of last update • Name of file • Need to be able to distinguish it from data
Some Miscellaneous Topics • Metadata • Data that describes the primary data in the file • Ex: Astronomer with image data generated by telescopes • Mostly interested in the image • Need info about image • Where and when taken • Which telescope • Names of related files/images • Etc.