490 likes | 841 Views
File structures by Folk, Zoellick and Ricarrdi. Chap6. Organizing Files for Performance. 서울대학교 컴퓨터공학부 객체지향시스템연구실 SNU-OOPSLA-LAB 교수 김 형 주. Chapter Objectives(1). Look at several approaches to data compression Look at storage compaction as a simple way of reusing space in a file
E N D
File structures by Folk, Zoellick and Ricarrdi Chap6. Organizing Files for Performance 서울대학교 컴퓨터공학부 객체지향시스템연구실 SNU-OOPSLA-LAB 교수 김 형 주 SNU-OOPSLA Lab.
Chapter Objectives(1) • Look at several approaches to data compression • Look at storage compaction as a simple way of reusing space in a file • Develop a procedure for deleting fixed-length records that allows vacated file space to be reused dynamically • Illustrate the use of linked lists and stacksto manage an avail list • Consider several approaches to the problem of deleting variable-length records • Introduce the concepts associated with the terms internal fragmentation and external fragmentation SNU-OOPSLA Lab.
Chapter Objectives(2) • Outline some placement strategies associated with the reuse of space in a variable-length record file • Provide an introduction to the idea underlying a binary search • Undertake an examination of the limitations of binary searching • Develop a keysort procedure for sorting larger files; investigate the costs associated with keysort • Introduce the concept of a pinned record SNU-OOPSLA Lab.
Contents 6.1 Data compression 6.2 Reclaiming space in files 6.3 Finding things quickly: An Introduction to internal sorting and binary searching 6.4 Keysorting SNU-OOPSLA Lab.
6.1 Data Compression Data Compression(1) • Reasons for data compression • less storage • transmitting faster, decreasing access time • processing faster sequentially SNU-OOPSLA Lab.
6.1 Data Compression Data Compression(2):Using a different notation • Fixed-Length fields are good candidates • Decrease the # of bits by finding a more compact notation ex) original state field notation is 16bits, but we can encode with 6bit notation because of the # of all states are 50 • Cons. • unreadable by human • cost in encoding time • decoding modules => increase the complexity of s/w => used for particular application SNU-OOPSLA Lab.
6.1 Data Compression Data Compression(3):Suppressing repeating sequences • Run-length encoding algorithm • read through pixels, copying pixel values to file in sequence, except the same pixel value occurs more than once in succession • when the same value occurs more than once in succession, substitute the following three bytes • special run-length code indicator((ex) ff) • pixel value repeated • the number of times that value is repeated • ex) 22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24 • 22 23 ff 24 07 25 ff 26 06 25 24 SNU-OOPSLA Lab.
컴퓨터내 image의 표현 pixel 화면 각 pixel당 빛의 세기 수치화 (digital) 전기 신호 (analog) SNU-OOPSLA Lab.
컴퓨터내image의 표현 화면 12 8 12 33 99 12 56 7 13 44 66 23 12 4 34 57 99 12 …... SNU-OOPSLA Lab.
컴퓨터내 color 영상의 표현 12 4 34 57 99 12 …... 화면 56 7 13 44 66 23 12 4 34 57 99 12 …... 12 8 12 33 99 12 56 7 13 44 66 23 12 4 34 57 99 12 …... ** 동영상 --- 초당 25 - 30 개의 정지화상을 교체 (video) (image) SNU-OOPSLA Lab.
6.1 Data Compression Data Compression(3):Suppressing repeating sequences • Run-length encoding (cont’d) • example of redundancy reduction • cons. • not guarantee any particular amount of space savings • under some circumstances, compressed image is larger than original image • Why? Can you prevent this? SNU-OOPSLA Lab.
6.1 Data Compression Data Compression(4):Assigning variable-length codes • Morse code: oldest & most common scheme of variable-length code • Some values occur more frequently than others • that value should take the least amount of space • Huffman coding • base on probability of occurrence • determine probabilities of each value occurring • build binary tree with search path for each value • more frequently occurring values are given shorter search paths in tree SNU-OOPSLA Lab.
6.1 Data Compression Data Compression(5):Assigning variable-length codes • Huffman coding • Letter: a b c d e f g • Prob: 0.4 0.1 0.1 0.1 0.1 0.1 0.1 • Code: 1 010 011 0000 0001 0010 0011 • ex) the string “abde” • 101000000001 SNU-OOPSLA Lab.
6.1 Data Compression Huffman Tree 0 a(1) 00 01 b(010) c(011) 000 001 d(0000) e(0001) g(0011) f(0010) SNU-OOPSLA Lab.
6.1 Data Compression Data Compression(6):Irreversible compression techniques • Some information can be sacrificed • Less common in data files • Shrinking raster image • 400-by-400 pixels to 100-by-100 pixels • 1 pixel for every 16 pixels • Speech compression • voice coding (the lost information is of no little or no value) SNU-OOPSLA Lab.
6.1 Data Compression Compression in UNIX • System V • pack & unpack use Huffman codes • after compress file, appends “.z” to end of packed file • Berkeley UNIX • compress & uncompress use Lempel-Ziv method • after compress file, appends “.Z” to end of compressed file SNU-OOPSLA Lab.
6. 2 Reclaiming Space in Files Record Deletion and Storage Compaction • Storage compaction • record deletion : just marks each deleted record • reclamation of all deleted records => pros : delete/undelete operation with little effort Ex) • Ames|123|OK|…...| • Morrison|9035|OK| • Brown|625|IA|…...| Delete second record • Ames|123|OK|…| • *|rrison|9035|OK| • Brown|625|IA|…| After compaction • Ames|123|OK|…| • Brown|625|IA|…| SNU-OOPSLA Lab.
6. 2 Reclaiming Space in Files Deleting Fixed-length Records for Reclaiming Space Dynamically(1) • Reuse the space from deleted records as soon as possible • deleted records must be marked in special way • we could find the deleted space • To make record reuse quickly, we need • a way to know immediately if there are empty slots in the file • a way to jump directly to one of those slots if they exist => Linked lists or Stacksfor avail list * avail list : a list that is made up of deleted records SNU-OOPSLA Lab.
6. 2 Reclaiming Space in Files Deleting Fixed-length Records for Reclaiming Space Dynamically(2) The Linked List Head pointer ptr -1 ptr ptr ptr The Stack Head pointer RRN 5 RRN 2 (a) (2) 2 -1 Head pointer RRN 3 PRN 5 RRN 2 (b) (3) 5 2 -1 after pushing record of RRN 3 SNU-OOPSLA Lab.
6. 2 Reclaiming Space in Files Deleting Fixed-length Records for Reclaiming Space Dynamically(3) • Linking and stacking deleted records • arranging and rearranging links are used to make one available record slot point to the next • second field of deleted record points to next record SNU-OOPSLA Lab.
6. 2 Reclaiming Space in Files • Sample file showing linked list of deleted records • List head(first available record) 5(delete 3, 5 ) 0 1 2 3 4 5 6 (a) Edwards... Betas... Wills... *-1 Masters.. *3 Chavez... • List head(first available record) 1 (delete 1) 0 1 2 3 4 5 6 (b) Edwards... *5 Wills... *-1 Masters.. *3 Chavez... • List head(first available record) -1(insert three new records) 0 1 2 3 4 5 6 (c) Edwards.. 1st new rec Wills... 3rd new rec Masters.. 2nd new rec Chavez... SNU-OOPSLA Lab.
6. 2 Reclaiming Space in Files Deleting Variable-length Records • Avail list of variable-length records • it has byte count of record at beginning of each record • use byte offset instead of RRN • Adding and removing records • in adding records, search through avail list for right size (=>big enough) SNU-OOPSLA Lab.
6. 2 Reclaiming Space in Files Removal of a record from an avail list with variable-length records Size 47 Size 38 Size 72 Size 68 -1 (a)Before removal Size 47 Size 38 Size 68 New Link -1 (b)After removal Size 72 Removed record SNU-OOPSLA Lab.
6. 2 Reclaiming Space in Files Storage Fragmentation • Internal fragmentation (in fixed-length record) • waste space within a record • in variable-length records, minimize wasted space by doing away with internal fragmentation • External fragmentation (in variable-length record) • unused space outside or between individual records • three possible solutions • storage compaction • coalescing the holes: a single, larger record slot • minimizing fragmentation by adopting placement strategy SNU-OOPSLA Lab.
6. 2 Reclaiming Space in Files Internal Fragmentationin Fixed-length Records Unused space -> Internal fragmentation Ames | John | 123 Maple | Stillwater | OK | 740751 |................................... Morrison | Sebastian | 9035 South Hillcrest | Forest Village | OK | 74820 | Brown | Martha | 625 Kimbark | Des Moines | IA | 50311 | ......................... 64-byte fixed-length records SNU-OOPSLA Lab.
6. 2 Reclaiming Space in Files External Fragmentationin Variable-length Records Record[1] Record[2] 40 Ames | Jone | 123 Maple | Stillwater | OK | 740751 | 64 Morrison | Sebastian | 9035 South Hillcrest | Forest Village | OK | 74820 | 45 Brown | Martha | 625 Kimb bark | Des Moines | IA | 50311 | record length Record[3] External fragmentation ex) Delete Record[2] and Insert New Record[i] : 12-byte unused space 52 Adams | Kits | 3301 Washington D.C | Forest Village | IA | 43563 | Record[i] SNU-OOPSLA Lab.
6. 2 Reclaiming Space in Files Placement Strategies • First-fit • select the first available record slot • suitable when lost space is due to internal fragmentation • Best-fit • select the available record slot closest in size • avail list in ascending order • suitable when lost space is due to internal fragmentation • Worst-fit • select the largest record slot • avail list in descending order • suitable when lost space is due to external fragmentation SNU-OOPSLA Lab.
6.3 Finding Things Quickly : An Introduction to Internal Sorting and Binary Searching Finding Things Quickly(1) • Goal: Minimize the number of disk accesses • Finding things in simple field and record files may have many seeks • Binary search algorithm for fixed-sized record int BinarySearch(FixedRecordFile &file, RecType &obj, KeyType &key) // binary search for key. { int low = 0; int high = file.NumRecs() - 1; while (low <= high){ int guess = (high - low)/2; file.ReadByRRN(obj, guess); if(obj.Key () == key) return 1; // record found if*obj.Key() < key) high = guess - 1; // search before guess else low = guess + 1; // search after guess } return 0; // loop ended without finding key } SNU-OOPSLA Lab.
Classes and Methods for Binary Search Class KeyType {public int operator == (KeyType &); int operator < (KeyType &); }; class RecType {public: KeyType Key();}; class FixedRecordFile{public: int NumRecs(); int ReadByRRN (RecType & Record, int RRN); }; SNU-OOPSLA Lab.
6.3 Finding Things Quickly : An Introduction to Internal Sorting and Binary Searching Finding Things Quickly(2) • Binary search vs. Sequential search • binary search • O(log n) • list is sorted by key • sequential search • O(n) SNU-OOPSLA Lab.
6.3 Finding Things Quickly : An Introduction to Internal Sorting and Binary Searching Finding Things Quickly(3) • Sorting a disk file in RAM • read the entire file from disk to memory • use internal sort (=sort in memory) • UNIX sort utility uses internal sort • Limitations of binary search & internal sort • binary search requires more than one or two access c.f.) single access by RRN • keeping a file sorted is very expensive • an internal sort works only on small files SNU-OOPSLA Lab.
6.3 Finding Things Quickly : An Introduction to Internal Sorting and Binary Searching Internal Sort Read the entire file unsorted file sorted file unsorted file Sort in memory disk memory SNU-OOPSLA Lab.
6.4 Keysorting Key Sorting & Its Limitations • So called, “tag sort” : sorted thing is “key” only • Sorting procedure • Read only the keys into memory • Sort the keys • Rearrange the records in file by the sorted keys • Advantage • less RAM than internal sort • Disadvantages(=Limitations) • reading records in disk twice is required • a lot of seeking for records for constructing a new(sorted) file SNU-OOPSLA Lab.
6.4 Keysorting KEYNODES array Records KEY RRN HARRISON 1 2 Harrison|Susan|387 Eastern.... Kellog|Bill|17 Maple.... KELLOG HARRIS Harris|Margaret|4343 West.... 3 Conceptual view before sorting . . . . k BELL Bell|Robert|8912 Hill.... In RAM On secondary storage Records KEY RRN k 3 BELL Harrison|Susan|387 Eastern.... HARRIS Kellog|Bill|17 Maple.... Conceptual view after sorting keys in RAM Harris|Margaret|4343 West.... HARRISON 1 . . . . 2 Bell|Robert|8912 Hill.... KELLOG SNU-OOPSLA Lab.
6.4 Keysorting Pseudocode for keysort(1) • Program: keysort • open input file as IN_FILE • create output file as OUT_FILE • read header record from IN_FILE and write a copy to OUT_FILE • REC_COUNT := record count from header record • /* read in records; set up KEYNODES array */ • for i := 1 to REC_COUNT • read record from IN_FILE into BUFFER • extract canonical key and place it in KEYNODES[i].KEY • KEYNODES[i].KEY = i • (continued....) SNU-OOPSLA Lab.
6.4 Keysorting Pseudocode for keysort(2) • /* sort KEYNODES[].KEY, thereby ordering RRNs correspondingly */ • sort(KEYNODES, REC_COUNT) • /* read in records according to sorted order, and write them out in this order */ • for i := 1 to REC_COUNT • seek in IN_FILE to record with RRN of KEYNODES[I].RRN write BUFFER contents to OUT_FILE • close IN_FILE and OUT_FILE • end PROGRAM SNU-OOPSLA Lab.
6.4 Keysorting Two Solutions:why bother to write the file back? • Write out sorted KEYNODES[] array without writing records back in sorted order • KEYNODES[] array is used as index file SNU-OOPSLA Lab.
6.4 Keysorting Relationship between the index file and the data file Records KEY RRN k 3 BELL Harrison|Susan|387 Eastern.... HARRIS Kellog|Bill|17 Maple.... Harris|Margaret|4343 West.... HARRISON 1 . . . . 2 Bell|Robert|8912 Hill.... KELLOG Index file Original file SNU-OOPSLA Lab.
6.4 Keysorting Pinned records(1) • Records that are referenced to physical location of themselves by other records • Not free to alter physical location of records for avoiding dangling references • Pinned records make sorting more difficult and sometimes impossible • solution: use index file, while keeping actual data file in original order SNU-OOPSLA Lab.
6.4 Keysorting Pinned records(2) Record(i) dangling pointer Record (i+1) Pinned Record delete pinned record Pinned Record File with pinned records SNU-OOPSLA Lab.
Let’s Review !!! 6.1 Data compression 6.2 Reclaiming space in files 6.3 Finding things quickly: An Introduction to internal sorting and binary searching 6.4 Keysorting SNU-OOPSLA Lab.