Discovering Functional Dependencies in Relational Databases Using Data Mining Techniques

Discovering Functional Dependencies in Relational Databases Using Data Mining Techniques AbrarFawazAlAbed-AlHaq Kent State University aalabeda@kent.edu October 28, 2011

Outline • Introduction. • Problem Statement. • Data Mining Techniques. • Discovering Functional Dependencies: Example. • Experimental Results. • Conclusion. • References.

Introduction • Nowadays there is a fast growing amount of data that are collected and stored in large databases. As a result, the databases may contain redundant or inconsistent data. • An important concept in relational schema design is that of a functional dependency. Functional dependencies (FD) plays key role in the design of relational databases, FD is a property of the semantic or meaning of the attributes. It helps in simplifying the structure of databases. Discovering functional dependencies (FDs) from an existing relational instance is an important technique in data mining and database design.

Introduction (cont.) • Functional dependencies are relationships between attributes of a database relation; a functional dependency states that the value of an attribute is uniquely determined by the value of some other attributes.

Problem Statement • The problem is to find all functional dependencies among attribute in a relation database. The early methods for discovering functional dependencies is based on repeatedly sorting and comparing tuples to determine whether or not these tuples meet FD definition, so this approach does not utilize the discovered FDs as knowledge to obtain new knowledge. As a result this approach is highly sensitive to the number of tuples and attributes, it is not practical for large database. Using FD_Mine and TANE algorithms, by using these algorithms, there is no need to sort on any attribute or compare any value.

Data Mining Techniques • Data mining is the process of producing useful knowledge and information, it uses an analysis tools to discover patterns and a relationships in large data that could be use. We can benefit from data mining algorithm for discovering functional dependencies in large databases to get useful information. Data mining have several techniques that we can use it for discovering functional dependencies such as Apriori algorithm. • We have two algorithms here: • FD_Mine • TANE

The FD_Mine Algorithm • The mechanism of the FD_Mine is as follow. It uses a level-wise search, where results from level k are used to explore level k +1. First, at level 1, all FDs X →Y where X and Y are single attributes are found and stored in FD_SET F1. The set of candidates that are considered at this level is denoted L1. F1 andL1 are used to generate the candidates Xi Xj of L2. At level 2, all FDs of the form Xi Xj → Y are found and stored in FD_SET F2, F1,F2, L1 and L2 are used to generate the candidates of L3, and so on, until the candidates at level Ln-1 have been checked or no candidates remain.

The TANE Algorithm • TANE searches the set containment lattice in a levelwise manner. A level LI is the collection of attributes sets of size I. TANE starts with L1 = {{A} | A R}, and computes L2 from L1, L3 from L2 and so on according to the information obtained during the algorithm. To computing partitions; in the beginning, partitions with respect to the singleton attribute sets are computed straight from relation r. A partition ∏{A}, where ∏denoted partition, is computed from the column r[A] as follows. First, the values of the column are replaced with integers 1, 2, 3… so the same values are replaced by same integers and different values with different integers. Then the value t[A] is the identifier of the equivalence class [t]{A} of ∏{A}, and ∏ {A} is then easy to construct.

Discovering functional dependencies: Example • Suppose that FD_Mine is applied to database D, as shown in Table, with R = {A, B, C, D, E}.

Discovering functional dependencies: Example • Next table Summarizes the actions of FD_Mine. In iteration 1, since |ПA| = |ПAD| = 2, Closure’(A) is set to D, and A→D is deduced. In same way, D→A is discovered, so the equivalence A↔D is obtained. As a result, we only need to combine A, B, C, and E to generate the next level candidates {AB, AC, AE, BC, BE, CE}. At the same time, the nontrivial closure of each generated candidate is computed. For example, the closure’(AB)=closure’(A) U closure’(B) = {D} U φ = {D}. In iteration 2, for candidate AB, only AB→C and AB→E need to be checked, because R–{A, B}–Closure’(AB)={A, B, C, D, E}–{A, B}–{D}={C, E}. Since |ПAB| = |ПABE| = 6, then AB→E is obtained. In the same way, at this level, BE→A and CE→A are also discovered, so the equivalence AB↔BE is obtained. As a result, we only need to combine AB, AC, AE, BC, and CE to form the level 3 candidates, which are {ABC, ACE}. Since CE→A, ACE is pruned by pruning rule 4. Since AB→E, then ABC→E. Since A↔D, then ABC→D, so ABC is a key, and ABC is also pruned by pruning rule 2. No other candidate remains, so the algorithm halts.

Discovering functional dependencies: Example

Discovering functional dependencies: Example • The result after removal is shown in table (b). (a) Before (b) After

Outline • Introduction. • Problem Statement. • Data Mining Techniques. • Discovering functional dependencies: Example. • Experimental Results. • Conclusion. • References.

Experimental Results • FD_Mine was applied to fifteen datasets, obtained from the UCI Machine Learning Repository and the results were compared to TANE. TANE was selected for comparison because it establishes the theoretical framework for the problem. For the dataset given in Table, Figure (a) shows the semi-lattice for FD_Mine, and Figure (b) shows that for TANE. Each node represents a combination of attributes. If an edge is shown between nodes X and XY, then X → Y needs to be checked. Hence, the number of edges is the number of FDs that need to be checked. Both semi-lattices shown in Figure(a) have fewer edges than the lattice. In addition, the semi-lattice for FD_Mine has fewer edges than that for TANE.

Experimental Results Figure ( a ) FD_Mine Figure ( b ) TANE Yao, H., Hamilton, H., and Butz, C. (2002), FD_Mine: Discovering Functional Dependencies in a Database Using Equivalences, Canada.

Experimental Results • Table 5.1 compares the number of FDs that are checked on data by FD_Mine and TANE for 15 UCI datasets. Figure 5.2 shows more detailed results for the Imports-85 dataset. At levels 1 through 5, both algorithms check approximately the same number of FDs, but at levels 6 through 11, FD_Mine checks fewer FDs than TANE, because it prunes more unnecessary candidates than TANE by using the equivalences and FDs discovered at previous levels. For more results.

Experimental Results

Conclusion • Discovering Functional Dependency (FD) is employed to extract the Functional Dependency from a datasets. This addresses two issues; one related to the discovering process of the FDs from the datasets and the second issues is to measure the discovering Functional Dependency (FDs) algorithm. • The empirical comparison on 15 UCI datasets between FD_Mine and TANE shows FD_Mine examined fewer FDs than TANE, because it prunes the unnecessary candidates than TANE by using the equivalences and FDs discovered at previous levels. The result based on the experiments on 15 UCI datasets show that the FD_Mine algorithm can prune more candidates than TANE algorithm and it also show that FD_Mine can discovering FDs more than TANE algorithm.

References [1] Yao, H., Hamilton, H., and Butz, C. (2002), FD_Mine: Discovering Functional Dependencies in a Database Using Equivalences, Canada. [2] Wyss, C., Giannella, C., and Robertson, E. (2001), FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances, U.S.A. [3] Huhtala, Y., Karkkainen, J., Porkka, P., and Toivonen, H., (1999), TANE: An Efficient Algorithm for discovering Functional and Approximate Dependencies, Computing Journal, V.42, No.20, pp.100-107. [4] Elmasri, R. and Navathe, S. (2004), Fundamentals of Database Systems, Addison Wesley, Fourth edition. [5] Mannila, H. (2000), Theoretical Frameworks for Data Mining, SIGKDD Explorations, V.1, No.2, pp.30-32. [6] Huhtala, Y., Karkkainen, J., Porkka, P., and Toivonen, H. (2000), Efficient Discovery of Functional and Approximate Dependencies Using Partitions, University of Helsinki, Finland. Thank You

Discovering Functional Dependencies in Relational Databases Using Data Mining Techniques