240 likes | 268 Views
GUI implementation for Supervised and Unsupervised SUBDUE System. Introduction. Data Mining is a very vast field, which is rapidly developing today. It is the use of artificial intelligence to find common, interesting, or previously unknown patterns in large databases.
E N D
GUI implementation for Supervised and Unsupervised SUBDUE System
Introduction • Data Mining is a very vast field, which is rapidly developing today. It is the use of artificial intelligence to find common, interesting, or previously unknown patterns in large databases. • One method for discovering knowledge in structural data is the identification of common substructures (concepts represented as graphs) within the data. 2
Introduction (Contd..) • The SUBDUE system, developed by Cook and Holder [Cook and Holder, 1999] performs data mining on databases represented as graphs, i.e. it discovers interesting substructures in structural data. • This project deals with the conversion of a textual representation of the data into a graphical visualization of the data The input data given to the program is the outputs provided by the existing SUBDUE. 3
The Subdue System • The SUBDUE system is a data-mining tool that discovers interesting substructures in structural data • By compressing previously discovered substructures in the data multiple passes of SUBDUE produce a hierarchical description of the structural regularities in the data. 4
Unsupervised Subdue Example of the input graph to the unsupervised version of the subdue and the discovered substructure. 5
Unsupervised (Contd..) • The substructure discovery algorithm used by SUBDUE is beam search. A substructure consists of a definition and a set of instances. The substructure definition is a connected set of vertices and edges that define a SUBGRAPH within G. An instance of a substructure is a SUBGRAPH of G that matches, graph theoretically, to the definition. The algorithm for the discovery is given below. 6
Unsupervised Contd.) • SUBDUE (Graph, BeamWidth, MaxBest, MaxSubSize, Limit) • ParentList={} • ChildList={} • BestList={} • ProcessedSubs=0 • Create a substructure from each unique vertex label and its single-vertex instances; insert the resulting substructures in ParentList • While ProcessedSubs < = Limit and ParentList is not empty do • While ParentList is not empty do • Parent = RemoveHead (ParentList) • Extend each instance of Parent in all possible ways • Group the extended instances into Child substructures • For each Child do • If SizeOf (Child) <= MaxSubSize then • Evaluate the Child 7
Unsupervised Contd.) • Insert Child in ChildList in order by value • If Length (ChildList)>BeamWidth then • Destroy the substructure at the end of the ChildList • ProcessedSubs = ProcessedSubs + 1 • Insert Parent in BestList in order by value • If Length (BestList) > MaxBest then • Destroy the substructure at the end of the BestList • Switch ParentList and ChildList • return BestList 8
Supervised • SUBDUE performs supervised graph based relational concept learning. The SUBDUE concept learner accepts both a positive graph and a negative graph , and evaluates substructures based on their compression of the positive graph and lack of compression on the negative graph. 9
Supervised (Contd.) • In the figure above each object has a shape and is related to other objects using the binary relation on and shape. The discovered substructure is as shown in the figure. As seen above the SUBDUE discovered substructure is from the positive example and not from the negative example. For this example the best substructure, which gives the maximum compression, is triangle on a square. 11
Subdue - Chemical Compounds • A DNA sequence can be represented as a very simple linear graph, and higher-level relationships between different parts of a sequence can be mapped to additional edges in graph. • SUBDUE system discovers patterns in the input graph in polynomial time. • SUBDUE system is capable of discovering known patterns in the DNA sequence of yeast,as well as patterns in yeast DNA that are known to be important in other organism, but which have not yet been shown to play a role in yeast. 12
Subdue - Chemical Compounds • A DNA sequence can be represented as a very simple linear graph, and higher-level relationships between different parts of a sequence can be mapped to additional edges in graph. • SUBDUE system discovers patterns in the input graph in polynomial time. • SUBDUE system is capable of discovering known patterns in the DNA sequence of yeast,as well as patterns in yeast DNA that are known to be important in other organism, but which have not yet been shown to play a role in yeast. 13
Subdue - Chemical Compounds • The figure shown above is the backbone representation which gives more meaningful graphs then the linear representation. • This representation separated the base names (A, C, T, G) from the vertices representing themselves. • The backbone representation mimics the actual chemical structure of the DNA molecule, in which the DNA bases are connected by deoxyribose sugars to a linear phosphate backbone. 14
GUI Design • Requirements • The requirements of the GUI are as follows. • File Dialog Boxes should be added for better user access to the input files. • The entire visual representation of the graphs needs to be shown on the screen. Sometimes these representations exceed the length of the screen. To accommodate these large graphs, scrollbars need to be incorporated into the design. 15
GUI Design • User interfaces must be provided so that the user can interact with the GUI for displaying the results of each and every iteration. So a button called “ Next Iteration”, which activates the display of substructures on the screen, needs to be incorporated. • A Button called the “Compress” button should be provided to the GUI. This button enables the user to see the compressed graph. 16
GUI Design • For the supervised version of SUBDUE, both negative and positive graphs need to be displayed. • The vertices of the graph should display their labels inside the vertices. • Since directed edges are used, the arrows with appropriate directions should be displayed. • The language to be used for implementation should be portable and be able to run from a browser. It should also have good GUI components. So JDK 1.2 was used to implement the program. 17
Implemenation • Unsupervised SUBDUE • Unsupervised SUBDUE GUI requires two input files. A position file which determines the graph position and the output file from the SUBDUE which is parsed as per the requirements of the GUI by the conversion program. • The driver class of the applet is the unsupervised class. This class initializes the applet. When user clicks the next iteration button canvas2 class is invoked and best substructures found in that iteration is displayed. These substructures are arranged by their MDL value. 18
Implemenation • Unsupervised SUBDUE • When user clicks the compress button then canvas1 class is invoked. This class compresses the input graph by replacing the instances of the best substructure of the iteration by single vertices. The compressed graph will be further compressed when the using the results of that iteration.when-compressed button is clicked. The flow diagrams are as shown below. 19
Implemenation • Supervised GUI Implemenation • Supervised SUBDUE GUI requires three input files.Two position files which determines the positive graph position and the negative graph position and the output file from the SUBDUE which is parsed as per the requirements of the GUI by the conversion program. • The driver class of the applet is the supervised class. This class initializes the applet. When user clicks the next iteration button canvas2 class is invoked and best substructures found in that iteration is displayed. These substructures are arranged by their MDL value. 20
Implemenation • Supervised GUI Implemenation • When user clicks the compress button then canvas1 class is invoked. This class compresses the input graph by replacing the instances of the best substructure of the iteration by single vertices. The compressed graph will be further compressed when the using the results of that iteration.when-compressed button is clicked. The flow diagrams are as shown below. 21
Implemenation • Parser for the GUI input. • The Conversion program parses the output from the subdue to make it compatible to the GUI program. This can parse both supervised and unsupervised GUI.The conversion program takes in longer phrases and replaces it by shorter ones It eliminates blank lines and arranges the output in the format required for the GUI input. • Each line starts with a small phrase for E.g. It(iteration number ), C(for compression), v(vertices),e(edges),Val(Value), It(instances). 22
Implementation • Parser for the GUI input. • The phrase “Best substructures” indicates the starting of the iteration and the phrase “Graph is compressed using best substructure.” Indicates a compressed graph information 23