The Development of File Structures

The Development of File Structures Chin-Chen Chang, Ph.D. Chair Professor Dept. of Information Engineering and Computer Science, Feng Chia University, Taiwan

Fig.1 Disk organization

Fig.2 Seven records in the file

Fig.3 Index organization SIR can read

Fig.4 Index sequential file

Fig.5 Index non-sequential file

Ant Dinosaur Kangaroo Sheep Fig.6 Inserting records in an index non-sequential file (Fig.2)

Fig.7 Index sequential file with reserved overflow area

Fig.8 Index Sequential file with empty positions left among the data records

Fig.9 Difference file

Fig.10 Binary tree

Fig.11 A simple file structure

Fig.12 A simple chained file

Fig.13 Chained file

Fig.14 A multi-list organization

Cell 0 Cell 1 Cell 2 Cell 3 Fig.15 A multi-list organization with cellular chains

Fig.16 Inverted list

Fig.17 Inverted list with indirect addressing

Bucket 0 Bucket 1 Bucket 2 Bucket 3 Bucket 4 Bucket 5 Bucket 6 Bucket 7 Bucket 8 Bucket 9 Fig.18 A bucket-resolved inverted-list with organization

1. Introduction • Definitions • Multi-attribute file system • A file system whose records are characterized by more than one attribute. • Partial-match queries • Queries of the following form: Retrieve all records where , , …, , .

Examples 1. Table 1(a)

Table 1(b) ANB = 4

2. Table 2(a)

Table 2(b) ANB = 2

3. Table 3(a)

Table 3(b) ANB = 2.5

Problem • Given a set of records, our job is to arrange the records in such a way that the average number of buckets to be examined, over all possible partial match queries, is minimized.

2. The String-homomorphism Hashing (SHH) Method Example D1 = D2 = {a, b, c, d} = D BZ = 4 = 22 = zN D'1 = {a, b} Divide D into D'2 = {c, d} BK1 : D'1× D'1= {(a, a), (a, b), (b, a), (b, b)} BK2 : D'1× D'2= {(a, c), (a, d), (b, c), (b, d)} BK3 : D'2× D'1= {(c, a), (c, b), (d, a), (d, b)} BK4 : D'2× D'2= {(c, c), (c, d), (d, c), (d, d)} The ANB is the minimum. ( why ? )

Theorem [Rivest 1976] • Conditions • (1) Domains D1, D2, …, DN D1 = D2 = …= DN = D. • (2) The bucket size = zN where z is an integer and |D| / z = p = integer. • (3) All of the possible records are present.

Theorem [Rivest 1976] (Cont.) • Procedures • (1) Divide D into D'1, D'2, …, D'p where , |D'i| = z. • (2) Store the set of records into one bucket. ANB is minimized.

Extension [Lin, Lee and Du 1979] Example D1 = {a, b, c, d}, D2 = {1, 2, 3, 4}, z = 2 and N = 2 i.e. BZ = zN = 22 = 4 D11 = {a, b} D21 = {1, 2} D1 D2 D12 = {c, d} D22 = {3, 4} BK1 : {(a, 1), (a, 2), (b, 1), (b, 2)} BK2 : {(a, 3), (a, 4), (b, 3), (b, 4)} BK3 : {(c, 1), (c, 2), (d, 1), (d, 2)} BK4 : {(c, 3), (c, 4), (d, 3), (d, 4)} The ANB is still minimized. ※Note : D1 ≠ D2 but |D1|=|D2|.

3. The Multi-key Hashing (MKH) Method ■ Example

g1(r) = 0 if 18 ≤ r ≤ 19, = 1 if 20 ≤ r ≤ 21, g3(r) = 0 if A ≤ r ≤ B, = 1 if C ≤ r ≤ D, m1 = 2 m3 = 2 g2(r) = 0 if 0 ≤ r ≤ 70, = 1 if 71 ≤ r ≤ 100, m2 = 2 • Steps : • (1) Choose a hash function gi : Di→{ 0, 1, 2, …, mi – 1 } • (2) Associate with each N-tuple [s1, s2, …, sN] a bucket, 0 ≤ si ≤ mi – 1. • (3) Assign the record R = (a1, a2, …, aN) into Bucket [g1(a1), g2(a2), …, gN(aN)] • Disadvantage • “overflow” problem

2400 ╳ ╳ ╳ ╳ ╳ ╳ ╳ 2200 ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ 2000 ╳ ╳ ╳ ╳ ╳ ╳ 1800 ╳ ╳ 1600 ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ 1400 1200 ╳ ╳ ╳ ╳ ╳ ╳ ╳ 1000 ╳ ╳ c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 D1 (skill-code) m1 = 4, m2 = 3 4. The Multi-dimensional Directory (MDD) Method ■ Example Example D2 (salary)

1st degree cube • Steps: • (1) Divide D1 into D11, D12, …, D1m1, s.t. each subspace D1i1×D2×…×DN contains approximately records. • (2) Divide each D1i1×D2×…×DN into s.t. each subspace contains approximately records. Nth-degree cube are generated. • (3) Assign each Nth-degree cube into a bucket. • (4) Generate the corresponding directory. 2nd degree cube

5. The Multi-key Sorting (MKS) Method • Sorting • single-key sorting • a1, a2, …, aL are sorted, iff a1≤ a2 ≤ …≤ aL or a1≥ a2 ≥ …≥ aL. ( or iff |a1-a2|+|a2-a3|+…+|aL-1-aL| is minimal ). • multi-key sorting • A set of records R1, R2, …, RL are sorted, iff d(R1, R2)+d(R2, R3)+…+ d(RL-1, RL) is minimal.

Hamming distance • For records R=(a1, a2, …, aN), R'=(a'1, a'2, …, a'N), the Hamming distance between R and R' will be whereδ(ai, a'i)=0, if ai = a'i, δ(ai, a'i)=1, if ai≠a'i.

Example 1

Example 2 Advantage : Practical Disadvantage: Cannot obtain the optimal solution

6. Optimal Cartesian Product Files • Problem : minimize where zi is the subdivision size of the ith domain

Example • Consider the case d1=8, d2=4, d3=9 and NB=6. There are two feasible solutions (1) z1=8, z2=2, z3=3 (2) z1=4, z2=4, z3=3 By (1) z1+z2+z3+z1z2+z1z3+z2z3 = 59. By (2) z1+z2+z3+z1z2+z1z3+z2z3 = 51. Therefore, we conclude (2) is the optimum solution. In this case, m1=8/4=2, m2=4/4=1, m3=9/3=3. Divide D1 into D11 and D12. D2 into one subset. D3 into D31, D32 and D33. BK1 : D11×D2×D31 BK2 : D11×D2×D32 BK3 : D11×D2×D33 BK4 : D12×D2×D31 BK5 : D12×D2×D32 BK6 : D12×D2×D33

Definitions • 1. A 2-tuple (a1, a2) is called minimal 2-tuple if for every other 2-tuple (a'1, a'2) where a1a2= a'1 a'2, a1+ a2 < a'1+a'2. • 2. An N-tuple (a1, a2, …, aN) is called a minimal N-tuple of C, if and for 1 ≤ i, j ≤ N, (ai, aj) is a minimal 2-tuple.

Theorem • A CPF is an optimal CPF if the records of each bucket are of the form of where the size of is zi and zi’s (1) z1z2…zN=C (2) = an integer. (3) (z1, z2, …, zN) is the only minimal N-tuple of BZ.

The Development of File Structures

The Development of File Structures

Presentation Transcript

File Structures CIS 256

Data File Structures

MATLAB: Structures and File I

Chap1 . Introduction to File Structures

File Structures

File System and Structures

Understanding the File Structures

Comp 335 – File Structures

CS 231 File Structures

Promoting Understanding of Logical Structures of File Systems

Comp 335 File Structures

Unix File System Internal Structures

Comp 335 File Structures

Introduction to File Structures

File Systems Control Structures

Comp 335 – File Structures

Basic File Structures and Hashing

Comp 335 File Structures

The Fun That Is File Structures

Advance Data Structures FILE STRUCTURE AND FILE ORGANIZATION