600 likes | 742 Views
The Development of File Structures. Chin-Chen Chang, Ph.D. Chair Professor Dept. of Information Engineering and Computer Science, Feng Chia University, Taiwan. Fig.1 Disk organization. Fig.2 Seven records in the file. Fig.3 Index organization. SIR can read. Fig.4 Index sequential file.
E N D
The Development of File Structures Chin-Chen Chang, Ph.D. Chair Professor Dept. of Information Engineering and Computer Science, Feng Chia University, Taiwan
Fig.3 Index organization SIR can read
Ant Dinosaur Kangaroo Sheep Fig.6 Inserting records in an index non-sequential file (Fig.2)
Fig.8 Index Sequential file with empty positions left among the data records
Cell 0 Cell 1 Cell 2 Cell 3 Fig.15 A multi-list organization with cellular chains
Bucket 0 Bucket 1 Bucket 2 Bucket 3 Bucket 4 Bucket 5 Bucket 6 Bucket 7 Bucket 8 Bucket 9 Fig.18 A bucket-resolved inverted-list with organization
1. Introduction • Definitions • Multi-attribute file system • A file system whose records are characterized by more than one attribute. • Partial-match queries • Queries of the following form: Retrieve all records where , , …, , .
Examples 1. Table 1(a)
Table 1(b) ANB = 4
2. Table 2(a)
Table 2(b) ANB = 2
3. Table 3(a)
Table 3(b) ANB = 2.5
Problem • Given a set of records, our job is to arrange the records in such a way that the average number of buckets to be examined, over all possible partial match queries, is minimized.
2. The String-homomorphism Hashing (SHH) Method Example D1 = D2 = {a, b, c, d} = D BZ = 4 = 22 = zN D'1 = {a, b} Divide D into D'2 = {c, d} BK1 : D'1× D'1= {(a, a), (a, b), (b, a), (b, b)} BK2 : D'1× D'2= {(a, c), (a, d), (b, c), (b, d)} BK3 : D'2× D'1= {(c, a), (c, b), (d, a), (d, b)} BK4 : D'2× D'2= {(c, c), (c, d), (d, c), (d, d)} The ANB is the minimum. ( why ? )
Theorem [Rivest 1976] • Conditions • (1) Domains D1, D2, …, DN D1 = D2 = …= DN = D. • (2) The bucket size = zN where z is an integer and |D| / z = p = integer. • (3) All of the possible records are present.
Theorem [Rivest 1976] (Cont.) • Procedures • (1) Divide D into D'1, D'2, …, D'p where , |D'i| = z. • (2) Store the set of records into one bucket. ANB is minimized.
Extension [Lin, Lee and Du 1979] Example D1 = {a, b, c, d}, D2 = {1, 2, 3, 4}, z = 2 and N = 2 i.e. BZ = zN = 22 = 4 D11 = {a, b} D21 = {1, 2} D1 D2 D12 = {c, d} D22 = {3, 4} BK1 : {(a, 1), (a, 2), (b, 1), (b, 2)} BK2 : {(a, 3), (a, 4), (b, 3), (b, 4)} BK3 : {(c, 1), (c, 2), (d, 1), (d, 2)} BK4 : {(c, 3), (c, 4), (d, 3), (d, 4)} The ANB is still minimized. ※Note : D1 ≠ D2 but |D1|=|D2|.
3. The Multi-key Hashing (MKH) Method ■ Example
g1(r) = 0 if 18 ≤ r ≤ 19, = 1 if 20 ≤ r ≤ 21, g3(r) = 0 if A ≤ r ≤ B, = 1 if C ≤ r ≤ D, m1 = 2 m3 = 2 g2(r) = 0 if 0 ≤ r ≤ 70, = 1 if 71 ≤ r ≤ 100, m2 = 2 • Steps : • (1) Choose a hash function gi : Di→{ 0, 1, 2, …, mi – 1 } • (2) Associate with each N-tuple [s1, s2, …, sN] a bucket, 0 ≤ si ≤ mi – 1. • (3) Assign the record R = (a1, a2, …, aN) into Bucket [g1(a1), g2(a2), …, gN(aN)] • Disadvantage • “overflow” problem
2400 ╳ ╳ ╳ ╳ ╳ ╳ ╳ 2200 ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ 2000 ╳ ╳ ╳ ╳ ╳ ╳ 1800 ╳ ╳ 1600 ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ ╳ 1400 1200 ╳ ╳ ╳ ╳ ╳ ╳ ╳ 1000 ╳ ╳ c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 D1 (skill-code) m1 = 4, m2 = 3 4. The Multi-dimensional Directory (MDD) Method ■ Example Example D2 (salary)
1st degree cube • Steps: • (1) Divide D1 into D11, D12, …, D1m1, s.t. each subspace D1i1×D2×…×DN contains approximately records. • (2) Divide each D1i1×D2×…×DN into s.t. each subspace contains approximately records. Nth-degree cube are generated. • (3) Assign each Nth-degree cube into a bucket. • (4) Generate the corresponding directory. 2nd degree cube
5. The Multi-key Sorting (MKS) Method • Sorting • single-key sorting • a1, a2, …, aL are sorted, iff a1≤ a2 ≤ …≤ aL or a1≥ a2 ≥ …≥ aL. ( or iff |a1-a2|+|a2-a3|+…+|aL-1-aL| is minimal ). • multi-key sorting • A set of records R1, R2, …, RL are sorted, iff d(R1, R2)+d(R2, R3)+…+ d(RL-1, RL) is minimal.
Hamming distance • For records R=(a1, a2, …, aN), R'=(a'1, a'2, …, a'N), the Hamming distance between R and R' will be whereδ(ai, a'i)=0, if ai = a'i, δ(ai, a'i)=1, if ai≠a'i.
Example 2 Advantage : Practical Disadvantage: Cannot obtain the optimal solution
6. Optimal Cartesian Product Files • Problem : minimize where zi is the subdivision size of the ith domain
Example • Consider the case d1=8, d2=4, d3=9 and NB=6. There are two feasible solutions (1) z1=8, z2=2, z3=3 (2) z1=4, z2=4, z3=3 By (1) z1+z2+z3+z1z2+z1z3+z2z3 = 59. By (2) z1+z2+z3+z1z2+z1z3+z2z3 = 51. Therefore, we conclude (2) is the optimum solution. In this case, m1=8/4=2, m2=4/4=1, m3=9/3=3. Divide D1 into D11 and D12. D2 into one subset. D3 into D31, D32 and D33. BK1 : D11×D2×D31 BK2 : D11×D2×D32 BK3 : D11×D2×D33 BK4 : D12×D2×D31 BK5 : D12×D2×D32 BK6 : D12×D2×D33
Definitions • 1. A 2-tuple (a1, a2) is called minimal 2-tuple if for every other 2-tuple (a'1, a'2) where a1a2= a'1 a'2, a1+ a2 < a'1+a'2. • 2. An N-tuple (a1, a2, …, aN) is called a minimal N-tuple of C, if and for 1 ≤ i, j ≤ N, (ai, aj) is a minimal 2-tuple.
Theorem • A CPF is an optimal CPF if the records of each bucket are of the form of where the size of is zi and zi’s (1) z1z2…zN=C (2) = an integer. (3) (z1, z2, …, zN) is the only minimal N-tuple of BZ.