480 likes | 639 Views
File Structures by Folk, Zoellick, and Ricarrdi. Chap 7 . Indexing. 서울대학교 컴퓨터공학과 객체지향시스템연구실 SNU-OOPSLA-LAB 김 형 주 교수. Chapter Objectives(1). Introduce concepts of indexing that have broad applications in the design of file systems
E N D
File Structures by Folk, Zoellick, and Ricarrdi Chap 7. Indexing 서울대학교 컴퓨터공학과 객체지향시스템연구실 SNU-OOPSLA-LAB 김 형 주 교수 SNU-OOPSLA Lab.
Chapter Objectives(1) • Introduce concepts of indexing that have broad applications in the design of file systems • Introduce the use of a simple linear index to provide rapid access to records in an entry-sequenced, variable-length record file • Investigate the implementation of the use of indexes for file maintenance • Introduce the template features of C++ for object I/O • Describe the object-oriented approach to indexed sequential files SNU-OOPSLA Lab.
Chapter Objectives(2) • Describe the use of indexes to provide access to records by more than one key • Introduce the idea of an inverted list, illustrating Boolean operations on lists • Discuss of when to bind an index key to an address in the data file • Introduce and investigate the implications of self-indexing files SNU-OOPSLA Lab.
Contents(1) 7.1 What is an Index? 7.2 A Simple Index for Entry-Sequenced Files 7.3 Using Template Classes in C++ for Object I/O 7.4 Object-Oriented Support for Indexed, Entry- Sequenced Files of Data Objects 7.5 Indexes That Are Too Large to Hold in Memory SNU-OOPSLA Lab.
Contents(2) 7.6 Indexing to Provide Access by Multiple Keys 7.7 Retrieval Using Combinations of Secondary Keys 7.8 Improving the Secondary Index Structure: Inverted Lists 7.9 Selective Indexes 7.10 Binding SNU-OOPSLA Lab.
7.1 What Is an Index? Overview: Index(1) • Index: a data structure which associates given key values with corresponding record numbers • It is usually physically separate from the file (unlike for indexed sequential files tight binding). • Linear indexes (like indexes found at the back of books) • Index records are ordered by key value as in an ordered relative file • Best algorithm for finding a record with a specific key value is binary search • Addition requires reorganization SNU-OOPSLA Lab.
7.1 What Is an Index? Index File k1 k2 k4 k5 k7 k9 k1 k2 k4 k5 k7 k9 AAA ZZZ CCC XXX EEE FFF Data File Overview: Index(2) SNU-OOPSLA Lab.
7.1 What Is an Index? Overview: Index(3) • Tree Indexes (like those of indexed sequential files) • Hierarchical in that each level • Beginning with the root level, points to the next record • Leaves POINTs only the data file • Indexed Sequential File • Binary Tree Index • AVL Tree Index • B+ tree Index SNU-OOPSLA Lab.
7.1 What Is an Index? Roles of Index? • Index: keys and reference fields • Fast Random Accesses • Uniform Access Speed • Allow users to impose order on a file without actually rearranging the file • Provide multiple access paths to a file • Give user keyed access to variable-length record files SNU-OOPSLA Lab.
7.2 A Simple Index for E-S Files A Simple Index(1) • Datafile • entry-sequenced, variable-length record • primary key : unique for each entry in a file • Search a file with key (popular need) • cannot use binary search in a variable-length record file(can’t know where the middle record) • construct an index object for the file • index object : key field + byte-offset field SNU-OOPSLA Lab.
7.2 A Simple Index for E-S Files Datafile Indexfile Reference Address of Key Actual data record field record ANG3795 167 LON|2312|Romeo and Juliet|Prokofiev . . . 32 COL31809 353 RCA|2626|Quarter in C Sharp Minor . . . 77 DG139201 396 WAR|23699|Touchstone|Corea . . . 132 COL38358 211 ANG|3795|Sympony No. 9|Beethoven . . . 167 DG18807 256 COL|38358|Nebeaska|Springsteen . . . 211 FF245 442 DG|18807|Symphony No. 9|Beethoven . . . 256 LON2312 32 MER|75016|Coq d'or Suite|Rimsky . . . 300 MER75016 300 COL|31809|Symphony No. 9|Dvorak . . . 353 RCA2626 77 DG|139201|Violin Concerto|Beethoven . . . 396 WAR23699 132 FF|245|Good News|Sweet Honey In The . . . 442 A Simple Index (2) SNU-OOPSLA Lab.
7.2 A Simple Index for E-S Files Key Reference field A Simple Index (3) • Index file: fixed-size record, sorted • Datafile: not sorted because it is entry sequenced • Record addition is quick (faster than a sorted file) • Can keep the index in memory • find record quickly with index file than with a sorted one • Class TextIndex encapsulates the index data and index operations SNU-OOPSLA Lab.
7.2 A Simple Index for E-S Files Let’s See Figure 7.4 Class TextIndex{ public: TextIndex(int maxKeys = 100, int unique = 1); int Insert(const char*ckey, int recAddr); //add to index int Remove(const char* key); //remove key from index int Search(const char* key) const; //search for key, return recAddr void Print (ostream &) const; protected: int MaxKeys; // maximum num of entries int NumKeys;// actual num of entries char **Keys; // array of key values int* RecAddrs; // array of record references int Find (const chat* key) const; int Init (int maxKeys, int unique); int Unique;// if true --> each key must be unique } SNU-OOPSLA Lab.
Index Implementation • Page 638, 639, 640 • G.1 Recording.h • G.2 Recording.cpp • G.3 Makere.cpp • Page 641, 642 • G.4 Textind.h • G.5 Textind.cpp SNU-OOPSLA Lab.
RetrieveRecording with the Index • RetrieveRecording(KEY...)procedure : retrieve a single record by key from datafile. And puts together the index search, file read, and buffer unpack operations into single function int RetriveRecording (Recording & recording, char * key, TextIndex & RecordingIndex, BufferFile & RecordingFile) // read and unpack the recording, return TRUE if succeeds { int result; result = RecordingFile . Read (RecordingIndex.Search(key)); if (result == -1) return FALSE; result = recording.Unpack (RecordingFile.GetBuffer()); return result; } SNU-OOPSLA Lab.
7.3 Using Template Classes in C++ for Object I/O Template Class for I/O Object(1) • Template Class RecordFile • we want to make the following code possible • Person p; RecordFile pFile; pFile.Read(p); • Recording r; RecordFile rFile; rFile.Read(r); • difficult to support files for different record types without having to modify the class • Template class which is derived from BufferFile • the actual declarations and calls • RecordFile <Person> pFile; pFile.Read(p); • RecordFile <Recording> rFile; rFile.Read(p); SNU-OOPSLA Lab.
7.3 Using Template Classes in C++ for Object I/O Template Class for I/O Object(2) template <class RecType> class RecordFile : public BufferFile{ public: int Read(RecType& record, int recaddr = -1); int Write(const RecType& record, int recaddr = -1); int Append(const RecType& record); RecordFile(IOBuffer& buffer) : BufferFile(buffer) {} }; //The template parameter RecType must have the following methods //int Pack(IOBuffer &); pack record into buffer //int Unpack(IOBuffer &); unpack record from buffer • Template Class RecordFile SNU-OOPSLA Lab.
7.3 Using Template Classes in C++ for Object I/O Template Class for I/O Object(3) • Adding I/O to an existing class RecordFile • add methods Pack and Unpack to class Recording • create a buffer object to use in the I/O • DelimFieldBuffer Buffer; • declare an object of type RecordFile<Recording> • RecordFile<Recording> rFile (Buffer); • Declaration and Calls • Recording r1, r2; • rFile.Open(“myfile”); • rFile.Read(r1); • rFile.Write(r2); Directly open a file and read and write objects of class Recording SNU-OOPSLA Lab.
7.4 OO Support for Indexed, E-S Files of Data Objects Object-Oriented Approach to I/O • Class IndexedFile • add indexed access to the sequential access provided by class RecordFile • extends RecordFile with Update, Append and Read method • Update & Append : maintain a primary key index of data file • Read : supports access to object by key • TextIndex, RecordFile ==> IndexedFile • Issues of IndexedFile • how to make a persistent index of a file • how to guarantee that the index is an accurate reflection of the contents of the data file SNU-OOPSLA Lab.
7.4 OO Support for Indexed, E-S Files of Data Objects Basic Operations of IndexedFile(1) • Create the original empty index and data files • Load the index file into memory • Rewrite the index file from memory • Add records to the data file and index • Delete records from the data file • Update records in the data file • Update the index to reflect changes in the data file • Retrieve records SNU-OOPSLA Lab.
7.4 OO Support for Indexed, E-S Files of Data Objects Basic Operations of TextIndexedFile (1) • Creating the files • initially empty files (index file and data file) created as empty files with header records • implementation ( makeind.cpp in Appendix G ) Create method in class BufferFile • Loading the index into memory • loading/storing objects are supported in the IOBuffer classes • need to choose a particular buffer class to use for an index file ( tindbuff.cpp in Appendix G ) • define class TextIndexBuffer as a derived class of FixedFieldBuffer to support reading and writing of index objects SNU-OOPSLA Lab.
7.4 OO Support for Indexed, E-S Files of Data Objects Basic Operations of TextIndexedFile(2) • Rewriting the index file from memory • part of the Close operation on an IndexedFile • write back index object to the index file • should protect the index when failure • write changes when out-of-date(use status flag) • Implementation • Rewind and Write operations of class BufferFile • Record Addition Add a new record to data file using RecordFile<Recording>::Write Add an entry to the index Requires rearrangement if in memory, no file access using TextIndex.Insert + SNU-OOPSLA Lab.
7.4 OO Support for Indexed, E-S Files of Data Objects Basic Operations of TextIndexedFile(3) • Record Deletion • data file: the records need not be moved • index: delete entry really or just mark it • using TextIndex::Delete • Record Updating (2 categories) • the update changes the value of the key field • delete/add approach • reorder both the index and the data file • the update does not affect the key field • no rearrangement of the index file • may need to reconstruct the data file SNU-OOPSLA Lab.
7.4 OO Support for Indexed, E-S Files of Data Objects Class TextIndexedFile(1) • Members • methods • Create, Open, Close, Read (sequential & indexed), Append, and Update operations • protected members • ensure the correlation between the index in memory (Index),the index file (IndexFile), and the data file (DataFile) • char* key() • the template parameter RecType must have the key method • used to extract the key value from the record SNU-OOPSLA Lab.
7.4 OO Support for Indexed, E-S Files of Data Objects Class TextIndexedFile(2) Template <class RecType> class TextIndexedFile { public: int Read(RecType& record); // read next record int Read(char* key, RecType& record) // read by key int Append(const RecType& record); int Update(char* oldKey, const RecType& record); int Create(char* name, int mode=ios::in|los::out); int Open(char* name, int mode=ios::in|los::out); int Close(); TextIndexedFile(IOBuffer & buffer, int keySize, int maxKeys=100); ~TextIndexedFile(); // close and delete protected: TextIndex Index; BufferFile IndexFile; TextIndexBuffer IndexBuffer; RecordFile<RecType> DataFile; char * FileName; // base file name for file int SetFileName(char* fName, char*& dFileName, char*&IdxFName); }; SNU-OOPSLA Lab.
7.4 OO Support for Indexed, E-S Files of Data Objects Enhancements to TextIndexedFile(1) • Support other types of keys • Restriction: the key type is restricted to string (char *) • Relaxation: support a template class SimpleIndex with parameter for key type • Support data object class hierarchies • Restriction: every object must be of the same type in RecordFile • Relaxation: the type hierarchy supports virtual pack methods SNU-OOPSLA Lab.
7.4 OO Support for Indexed, E-S Files of Data Objects Enhancements to TextIndexedFile(2) • Support multirecord index files • Restriction: the entire index fit in a single record • Relaxation: add protected method Insert, Delete, and Search to manipulate the arrays of index objects • Active optimization of operations • Obvious: the most obvious optimization is to use binary search in the Find method • Active: add a flag to the index object to avoid writing the index record back to the index file when it has not been changed SNU-OOPSLA Lab.
Where are we going? • Plain Stream File • Persistency ==> Buffer support ==> BufferFile <incremental approach> Deriving BufferFile using various other classes • Random Access ==> Index support => IndexedFile <incremental approach> : Deriving TextIndexedFile using RecordFile and TextIndex SNU-OOPSLA Lab.
7.5 Indexes That Are Too Large to Hold in Memory Too Large Index(1) • On secondary storage (large linear index) • Disadvantages • binary searching of the index requires several seeks(slower than a sorted file) • index rearrangement requires shifting or sorting records on second storage • Alternatives (to be considered later) • hashed organization • tree-structured index (e.g. B-tree) SNU-OOPSLA Lab.
7.5 Indexes That Are Too Large to Hold in Memory Too Large Index (2) • Advantages over the use of a data file sorted by key even if the index is on the secondary storage • can use a binary search • sorting and maintaining the index is less expensive than doing the data file • can rearrange the keys without moving the data records if there are pinned records SNU-OOPSLA Lab.
7.6 Indexing to Provide Access by Multiple Keys Index by Multiple Keys(1) • DB-Schema = ( ID-No, Title, Composer, Artist, Label) • Find the record with ID-NO “COL38358” (primary key - ID-No) • Find all the recordings of “Beethoven” (2ndary key - composer) • Find all the recordings titled “Violin Concerto” (2ndary key - title) SNU-OOPSLA Lab.
BEETHOVEN DG18807 7.6 Indexing to Provide Access by Multiple Keys Index by Multiple Keys(2) • Most people don’t want to search only by primary key • Secondary Key • can be duplicated • Figure --> • Secondary Key Index • secondary key --> consult one additional index (primary key index) SNU-OOPSLA Lab.
7.6 Indexing to Provide Access by Multiple Keys Secondary Index:Basic Operations(1) • Record Addition • similar to the case of adding to primary index • secondary index is stored in canonical form • fixed length (so it can be truncated) • original name can be obtained from the data file • can contain duplicate keys • local ordering in the same key group SNU-OOPSLA Lab.
7.6 Indexing to Provide Access by Multiple Keys Secondary Index:Basic Operations (2) • Record Deletion (2 cases) • Secondary index references directly record • delete both primary index and secondary index • rearrange both indexes • Secondary index references primary key • delete only primary index • leave intact the reference to the deleted record • advantage : fast • disadvantage : deleted records take up space SNU-OOPSLA Lab.
7.6 Indexing to Provide Access by Multiple Keys Secondary Index: Basic Operations (3) • Record Updating • primary key index serves as a kind of protective buffer • Secondary index references directly record • update all files containing record’s location • Secondary index references primary key (1) • affect secondary index only when either primary or secondary key is changed Continued. SNU-OOPSLA Lab.
7.6 Indexing to Provide Access by Multiple Keys Secondary Index: Basic Operations (4) • Secondary index references primary key(2) • when changes the secondary key • rearrange the secondary key index • when changes the primary key • update all reference field • may require reordering the secondary index • when confined to other fields • do not affect the secondary key index SNU-OOPSLA Lab.
7.7 Retrieval Using Combinations of Secondary Keys Retrieval of Records • Types • primary key access • secondary key access • combination of above • Combination of keys • using secondary key index, it is easy • boolean operation (AND, OR) SNU-OOPSLA Lab.
7.8 Improving the Secondary Index Structure Inverted Lists(1) • Inverted List • a secondary key leads to a set of one or more primary keys • Disadvantages of 2nd-ary index structure • rearrange when adding • repeated entry when duplicating • Solution A: by an array of references • Solution B: by linking the list of references SNU-OOPSLA Lab.
Revised composer index Secondary key Set of primary key references BEETHOVEN ANG3795 DG139201 DG18807 RCA2626 COREA WAR23699 DVORAK COL31809 PROKOFIEV LON2312 RIMSKY-KORSAKOV MER75016 SPRINGSTEEN COL38358 SWEET HONEY IN THE R FF245 7.8 Improving the Secondary Index Structure Array of References • * no need to rearrange • * limited reference array • * internal fragmentation SNU-OOPSLA Lab.
PROKOFIEV ANG36193 LON2312 7.8 Improving the Secondary Index Structure Inverted Lists (2) • Guidelines for better solution • no reorganization when adding • no limitation for duplicate key • no internal fragmentation • Solution B: by Linking the list of references • A list of primary key references • secondary key field, relative record number of the first corresponding primary key reference SNU-OOPSLA Lab.
Improved revision of the composer index Secondary Index file Label ID List file BEETHOVEN 3 LON2312 -1 0 0 1 2 -1 COREA RCA2626 1 7 2 DVORAK WAR23699 -1 2 PROKOFIEV 3 ANG23699 10 8 3 4 4 RIMSKY-KORSAKOV COL38358 6 -1 5 SPINGSTEEN DG18807 4 1 5 6 SWEET HONEY IN THE R MER75016 9 -1 6 COL31809 -1 7 DG139201 5 8 FF245 -1 9 10 ANG36193 0 7.8 Improving the Secondary Index Structure Linking List of References (1) SNU-OOPSLA Lab.
7.8 Improving the Secondary Index Structure Linking List of References (2) • The primary key references in a separate, entry-sequenced file • Advantages • rearranges only when secondary key changes • rearrangement is quick • less penalty associated with keeping the secondary index file on secondary storage (less need for sorting) • Label ID List file not need to be sorted • reusing the space of deleted record is easy SNU-OOPSLA Lab.
7.8 Improving the Secondary Index Structure Linking List of References (3) • Disadvantage • same secondary key references may not be physically grouped • lack of locality • could involve a large amount of seeking • solution: reside in memory • same Label ID list can hold the lists of a number of secondary index files • if too large in memory, can load only a part of it SNU-OOPSLA Lab.
7.9 Selective Indexes Selective Indexes • Selective Index: Index on a subset of records • Selective index contains only some part of entire index • provide a selective view • useful when contents of a file fall into several categories • e.g. 20 < Age < 30 and $1000 < Salary SNU-OOPSLA Lab.
7.10 Binding Index Binding(1) • When to bind the key indexes to the physical address of its associated record? • File construction time binding (Tight, in-the-data binding) • tight binding & faster access • the case of primary key • when secondary key is bound to that time • simpler and faster retrieval • reorganization of the data file results in modifications of all bound index files SNU-OOPSLA Lab.
7.10 Binding Index Binding (2) • Postpone binding until a record is actually retrieved (Retrieval-time binding) • minimal reorganization & safe approach • mostly for secondary key • Tight, in-the-data binding is good when • static, little or no changes • rapid performance during retrieval • mass-produced, read-only optical disk SNU-OOPSLA Lab.
Let’s Review (1) 7.1 What is an Index? 7.2 A Simple Index for Entry-Sequenced Files 7.3 Using Template Classes in C++ for Object I/O 7.4 Object-Oriented Support for Indexed, Entry- Sequenced Files of Data Objects 7.5 Indexes That Are Too Large to Hold in Memory SNU-OOPSLA Lab.
Let’s Review(2) 7.6 Indexing to Provide Access by Multiple Keys 7.7 Retrieval Using Combinations of Secondary Keys 7.8 Improving the Secondary Index Structure: Inverted Lists 7.9 Selective Indexes 7.10 Binding SNU-OOPSLA Lab.