410 likes | 428 Views
Learn about data compression, storage management, internal sorting, and binary search to optimize file performance. Topics include reclaiming space, record deletion, and dynamic space reclaiming.
E N D
Chapter 6 Organizing File for Performance Objectives: To get familiar with: Data compression Storage management Internal sorting and binary search
Outline • Data compression • Reclaiming space in files • Record deletion • Dynamic space reclaiming for fixed-length record • Dynamic space reclaiming for variable-length record • Storage fragmentation • Internal sorting and binary search • Keysorting
Dish Party • Bring something delicious for yourself and 2 more people • Can be anything at any budget • Cannot ? Forgot ? Bring yourself • Come and join us لقمة هنية تكفى مية طعام الواحد يكفي الاثنين وطعام الاثنين يكفي الأربعة وطعام الأربعة يكفي الثمانية ( حديث شريف)
Improving Performance • Less Space (Ch. 6) • lessstorage, • reclaiming space • defragmentation • Less Time (Ch. 7) • Indexing
Compression • Definition • Reduce size of data (number of bits needed to represent data) • Benefits • Reduce storage needed • Reduce transmission cost / latency / bandwidth
Sources of Compressibility • Redundancy • Recognize repeating patterns • Exploit using • Dictionary • Variable length encoding • Human perception • Less sensitive to some information • Can discard less important data
Types of Compression • Lossless • Preserves all information • Exploits redundancy in data • Applied to general data • Lossy • May lose some information • Exploits redundancy&human perception • Applied to audio, image, video
Effectiveness of Compression • Metrics • Bits per byte (8 bits) • 2 bits / byte ¼ original size • 8 bits / byte no compression • Percentage • 75% compression ¼ original size
Effectiveness of Compression • Depends on data • Random data hard • Example: 1001110100 ? • Organized data easy • Example: 1111111111 110 • Corollary • No universally best compression algorithm
Data Compression • Data compression: to organize files into smaller size. • Use lessstorage, • Can be transmitted faster, • Can be processedfaster sequentially. 1) Encoding with a different notation • The “State” field in the address file requires two bytes. However, 50 states can be encoded using 6 bits. 50% space saving for each occurrence of the state field. • The compact notationis a redundancy reduction technique. • Costs: • The file is notreadable by humans. • The overhead of encoding and decoding operations.
Example • State Two Letters Encoding • New York NY 1 000001 • California CA 2 000010 • Florida FL 3 000011 • …. • …. • Los Angelos LA 50 110010
Data Compression (cont’d) 2) Suppressing repeating sequences • Suitable for sparse arrays or images with regions of same colors. • Run-length encoding: choose an unused byte value to indicate that a run-length code follows that byte. • Encoding algorithm: • Read through the data (pixels or values) that make up the image or data content, copying the data values to the file in sequence, except where the same data value occurs more the once in the succession,
Data Compression (cont’d) 2) Suppressing repeating sequences • Where the same value occurs more than once in succession, substitute the following three entries: • The special run-length code indicator, • The data value that is repeated, and • The number of times that the value is repeated. • Example, 50 51 52 52 52 52 52 53 54 54 54 54 54 54 54 55 52 52 53 53 53 54 The encoded sequence is: 50 51 ff 52 05 53 ff 54 07 55 ff 52 02 ff 53 03 54
Data Compression (cont’d) 3) Variable length encoding • Letters with high frequency are encoded using shorter symbols. • Letters with low frequency are encoded using longer symbols. • Huffman code (for a set of seven letters): • four bits per letter (minimum 3 bits). • The string “abefd” is encoded as “1010000100100000”. • Huffman codes are used in some UNIX systems for data compression.
Huffman Code (Was explained in lecture. Read about it.) • Approach • Variable length encoding of symbols • Exploit statistical frequency of symbols • Efficient when symbol probabilities vary widely • Principle • Use fewer bits to represent frequent symbols • Use more bits to represent infrequent symbols A A B A A A B A
Data Compression (cont’d) 4) Irreversible compression techniques • Voice coding • Some image coding scheme that change pixel granularity or reduce color quality
Reclaiming Space in Files • File organization with the following operations: • record insertion • record deletion • record modification • Space reclaiming is needed when • deleting fixed-length and variable-length records • modifying variable-length records • can be treated as a deletion followed by an insertion
Record Deletion • Identifying deleted records • Place a special mark in each deleted record. Eg., place an asterisk (*) as the first field in a deleted record. • Before deletion Ames|John|123 Maple|Stillwater|OK|74075|... Morrison|Sebastian|9035 South Hillcrest|Forest Village|OK|78420| Brown|Martha|625 Kimbark|Des Moines|IA|50311|... • After deletion Ames|John|123 Maple|Stillwater|OK|74075|... *|rrison|Sebastian|9035 South Hillcrest|Forest Village|OK|78420| Brown|Martha|625 Kimbark|Des Moines|IA|50311|…
Record Deletion • Keep the deleted records around for sometimes. • Delay the disk compaction. • Programs must be able to ignore the deleted records. • Allow to “undelete” records.
Record Deletion (cont’d) • Space reclamation: • Happens after accumulating a number of deleted records. • A simple solution is to copy the file by skipping the deletedrecords. • Suitable for both fixed-length and variable-length records. • After space reclamation Ames|John|123 Maple|Stillwater|OK|74075|... Brown|Martha|625 Kimbark|Des Moines|IA|50311|... • In place (not copying a file) space reclamation is more complicated and time consuming.
Dynamic Space Reclaiming -- Fixed-Length Records • A naive approach: When inserting a new record, • searching the file record by record; • if a deleted record is found, insert the new record in the place of the deleted record; • otherwise, insert the new record at the end of the file.
Head pointer pointer pointer ... deleted record deleted record deleted record pointer -1 Dynamic Space Reclaiming -- Fixed-Length Records • Issues on reclaiming space quickly: • How to know immediately if there are empty slots in the file? • How to jump to one of those slots, if they exist? • Linking all deleted records together using a linkedlist:
Head pointer 2 RRN 5 RRN 2 -1 Head pointer 5 2 RRN 3 RRN 5 RRN 2 -1 Head pointer 2 RRN 5 RRN 2 -1 Dynamic Space Reclaiming -- Fixed-Length Records (cont’d) • Use the link list of the deleted records as a stack: • Add (push) a recently deleted record of RRN 3 to the top of the stack: • Remove a free space of RRN from the top of the stack for an inserted record:
Dynamic Space Reclaiming -- Fixed-Length Records (cont’d) • Use the link list of the deleted records as a stack: • Add (push) a recently deleted record of RRN 3 to the top of the stack: • Insert three new records to the space of the deleted records:
Dynamic Space Reclaiming -- Variable-Length Records • An available list to store the deleted variable-length records: • How to link the deleted records together into a list? • How to add newly deleted records to the available list? • How to find and remove records from the available list when space is reclaimed?
Dynamic Space Reclaiming -- Variable-Length Records • An available list of variable-length records HEAD.FIRST_AVAILABLE:-1 40 Ames|John|123 Maple|Stillwater|OK|74075|64 Morrison|Sebastian|9035 South Hillcrest|Forest Village|OK|78420|45 Brown|Martha|625 Kimbark|Des Moines|IA|50311| • Delete the second record: HEAD.FIRST_AVAILABLE:43 40 Ames|John|123 Maple|Stillwater|OK|74075|64 *|-1.............................................................................................|45 Brown|Martha|625 Kimbark|Des Moines|IA|50311|
Size 47 Size 38 Size 72 Size 68 -1 New Link Size 47 Size 38 Size 68 -1 Size 72 removed record: Dynamic Space Reclaiming -- Variable-Length Records (cont’d) • When inserting a new record, we need to search the available list for a deleted record with large enough record length: • The current available list: • Insert a record of 55 bytes:
Storage Fragmentation • Internal fragmentationcaused by fixed-length records: Ames|John|123 Maple|Stillwater|OK|74075|................................... Morrison|Sebastian|9035 South Hillcrest|Forest Village|OK|78420| Brown|Martha|625 Kimbark|Des Moines|IA|50311|......................... • Internal fragmentation caused by variable-length records: • The inserted records is shorter than the deleted record HEAD.FIRST_AVAILABLE:-1 40 Ames|John|123 Maple|Stillwater|OK|74075|64 Ham|Al|28 Elm| Ada|OK|70332|.....................................................|45 Brown|Martha| 625 Kimbark|Des Moines|IA|50311| • Reclaim the used part of the deleted record: HEAD.FIRST_AVAILABLE:43 40 Ames|John|123 Maple|Stillwater|OK|74075|35 *|-1.................. ..............26 Ham|Al|28 Elm|Ada|OK|70332|45 Brown|Martha|625 Kimbark|Des Moines|IA|50311|
Storage Fragmentation (cont’d) • External fragmentation caused by continuing to insert records so some space becomes too fragmented to be useful: • Insert a record of 25 bytes HEAD.FIRST_AVAILABLE:43 40 Ames|John|123 Maple|Stillwater|OK|74075|8 *|-1.....25 Lee|Ed |Rt 2|Ada|OK|7482026 Ham|Al|28 Elm|Ada|OK|70332|45 Brown |Martha|625 Kimbark|Des Moines|IA|50311| • How to handle external fragmentation: • storagecompaction: regenerate the file when external fragmentation becomes intolerable. • coalescing the holes:combine two record slots on the available list if they are physically adjacent. • placement strategy: adopt a placement strategy to minimize fragmentation.
Placement Strategies • First-fit placement strategy: search the first available space which is large enough for the inserted record. • Least amount of work when we place a newly available space on the list. • Best-fit placement strategy: search the smallest available which is large enough for the inserted record. • Order the available list in ascending order by size, then use the first-fit placement strategy. • After inserting the new record, the free area left over may be too small to be useful. May cause serious external fragmentation. • The small free slots are placed at the beginning of the available list. Make the search of the first-fit space increasingly long as time goes on. • Worst-fit placement strategy: • Order the available list in descending order by size, then use first-fit placement strategy. • Always insert the new record to the first slot. If the first slot is not large enough. The new record is inserted to the end of the file. • Decrease the chance of external fragmentation.