260 likes | 366 Views
LEARNING OBJECTIVES. Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting. Indexing. Data Compression. Data compression is a technique for encoding information in a file is such way as to take up less space Why perform data compression?
E N D
LEARNING OBJECTIVES • Data compression. • Reclaiming space in files. • Compaction. • Searching. • Sorting, Keysorting. • Indexing. CPSC 231 Organizing Files for Performance (D.H.)
Data Compression • Data compression is a technique for encoding information in a file is such way as to take up less space • Why perform data compression? • using less storage results in cost savings • using less storage saves time accessing data CPSC 231 Organizing Files for Performance (D.H.)
Data Compression Techniques • Using a Different Notation • Suppressing Repeating Sequences • Assigning Variable-length codes • Irreversible Compression Techniques CPSC 231 Organizing Files for Performance (D.H.)
Redundancy Reduction • Example: struct Person { char firstName[15]; char lastName[15]; char address[24]; char city[15]; char state[3]; }; In this example field state takes up only three bytes per record. But since the are only 50 states there is no need to use 24 bits (=3*8) but it is sufficient to use 6. WHY? CPSC 231 Organizing Files for Performance (D.H.)
Redundancy reduction -cont. • Thus we can use one byte (instead of 3 or 2) to encode the state name and save 2/3 or 1/2 space for this field. • Fixed length fields are always good candidates for the use of this technique. CPSC 231 Organizing Files for Performance (D.H.)
Pros and Cons ofthe Redundancy Reduction • Cons: • Encoding is binary thus unreadable by humans • Encoding/decoding modules are required when processing data which adds complexity to the processing software • Pros: • It can prove very beneficial (i.e. it can save a lot of space) for a particular application : • if the files are large • if the files are mostly processed by just one application CPSC 231 Organizing Files for Performance (D.H.)
Suppressing Repeating Sequences • Example: • a black and white image of the sky • most of the sky is black • if the picture is represented as an array of pixels than the black parts would be represented by 0’s (no color, or no brightness) • instead of repeating 0’s a lot of times we can use an encoding that keeps track of the number of 0’s. • this picture is a sparse array (array in which most entries are 0’) CPSC 231 Organizing Files for Performance (D.H.)
Suppressing Repeating Sequences -cont. • Run-length encoding is a compression method in which runs of repeated codes are replaced by count of the number of repetitions of the code, followed by the code that is repeated. • Example : • 13 14 00 00 00 00 00 00 00 0016 17 • can be encoded as: • 13 14 ff 00 08 16 17 • where ff is a run-length encoding indicator, 00 is the (pixel) value repeated, and 08 is the number of repetitions. CPSC 231 Organizing Files for Performance (D.H.)
Pros and Cons of Run-Length Encoding • Cons • it does not guarantee space savings • Pros • simple • in some application (such as image processing) space savings could be substantial CPSC 231 Organizing Files for Performance (D.H.)
Variable - Length Encoding • Variable length encoding is a scheme in which the codes are of different lengths. More frequently occurring codes are given shorter length and more frequently occurring codes are given longer lengths. • Example: • Morse Code (letter e and t, are most frequent in English thus they are assigned a dot (.) and a dash (-)) • Huffman encoding - a variable length encoding in which the lengths of the codes are based on the probability of the their occurrence (binary tree structure) CPSC 231 Organizing Files for Performance (D.H.)
Irreversible Compression • Irreversible Compression techniques are based on losing (sacrificing) some information. • Example: 400-by-400 pixel image is compressed to 100-by-100 size. • The original information cannot be restored once the data have been compressed using an irreversible compression technique. CPSC 231 Organizing Files for Performance (D.H.)
Reclaiming Space in Files • Problem • Once a variable length record is deleted from a file, the space left but it cannot be easily used. WHY? • File modification can take one of the following forms: • Record addition • Record updating • Record deletion CPSC 231 Organizing Files for Performance (D.H.)
Record Deletion and Storage Compaction • This method consists of two steps: • marking the record for deletion • compacting the file later on once there is a number of deleted records • Example: (file or records storing colors) • Original file: blue|magenta|red|green|yellow • After deleting magenta record : blue|*agenta|red|green|yellow • After deleting green record: blue|*agenta|red|*reen|yellow • After compaction: blue|red|yellow CPSC 231 Organizing Files for Performance (D.H.)
External Fragmentation and Compaction • External fragmentation is a wasted space in a file that occurs outside or in between records. (See the previous example.) • Compaction is a method of eliminating external fragmentation by sliding all the records together so there is no space between them. (See the previous example.) CPSC 231 Organizing Files for Performance (D.H.)
Deleting Fixed-Length Records • This method consists of the following steps: • marking the record for deletion • placing the deleted record on the list of available records by using: • a linked list (e.g. implemented as a queue or a stack) • it is possible to use RRNs since the records are of fixed size • Example: • Head -> RRN=3->RRN=5->RRN=-1 (EOL) CPSC 231 Organizing Files for Performance (D.H.)
Deleting Fixed-Length Records-Cont. • A list of available records is called an avail list. • You can dedicate the first field of the deleted record to indicate that the record is deleted by placing a special character there (e.g. “*”) and you can use another field to keep a pointer to (or an RRN of) the next available record. CPSC 231 Organizing Files for Performance (D.H.)
Internal Fragmentation • Internal fragmentation is wasted (unused) space inside of records or sectors. • Fixed length records structures often result in internal fragmentation. CPSC 231 Organizing Files for Performance (D.H.)
Deleting Variable-Length Records • This method consists of the following steps: • marking the record for deletion (e.g. use “*”) • placing the deleted record on the list of available records by using • recording the size of the record in the avail list • using the offset in the file to locate the record (not RRN) WHY? • Head -> (Offset, size)->(offset, size) ->(-1, -1). CPSC 231 Organizing Files for Performance (D.H.)
Storage Fragmentation • As stated before fixed size record structure causes internal fragmentation. • Variable size record structure does not cause internal fragmentation but is causes external fragmentation. CPSC 231 Organizing Files for Performance (D.H.)
Eliminating and reducing external fragmentation • Compaction (explained earlier) • Coalescing the holes = combining adjacent records to create a new record • Using a successful placement policy: • first fit (use the first available record that is big enough) - (O.K. when dealing with internal frag.) • best fit (use the smallest record that is big enough) - O.K. when dealing with internal fragmentation. • worst fit (use the biggest available record, and put the rest of this record on the avail list) CPSC 231 Organizing Files for Performance (D.H.)
Finding records quickly in files using keys • Sequential search = reading records in the file in the serial order until the searched record is found. • slow, good for small files • requires on the average reading of n/2 records before the sought record is found (For n=2000, this is 1000). CPSC 231 Organizing Files for Performance (D.H.)
Binary Search • Binary search =locating the searched record in a sorted list of records by repeatedly selecting the middle element of the list, and dividing the list in half until the sought record is found. • much faster than sequential search • requires on the average reading of1+ log2nrecords before the sought record is found. (For n=2000 this is 11). • COMPARE THIS WITH SEQUENTIAL SEARCH! CPSC 231 Organizing Files for Performance (D.H.)
Cons of Binary Search • It may require a lot of seek time because the read records are NOT sequential. • The file has to be sorted - keeping it sorted might prove expensive, especially if a lot of new records are being added. • A memory sort can be performed on relatively small files. CPSC 231 Organizing Files for Performance (D.H.)
Keysort • Keysort -a method of sorting a file that holds only keys and pointers to the records in main memory, NOT the entire file. The sorted list of keys is used to sort the file on the disk by rewriting it to a new file. • Keysorting’s main disadvantage is that rearranging the entire file on the disk can be slower than reading a sequential file. CPSC 231 Organizing Files for Performance (D.H.)
Pinned Records • A record is pinned when there are other records pointing to its physical location. • Another disadvantage of Keysorting is that it might move the pinned records thus resulting in the phenomenon called “dangling pointers”,i.e. the pointers that point to nonexistent records. CPSC 231 Organizing Files for Performance (D.H.)
Indexing • Instead of rearranging the entire file it is sufficient to write the sorted list of keys with pointers of records to secondary storage. This list of sorted keys with pointers to the records in the data file is called an index. • Indexing solved most of problems associated with binary searching and keysorting. CPSC 231 Organizing Files for Performance (D.H.)