260 likes | 418 Views
March 19, 2004 Duke University. D2K – Data To Knowledge. Outline. Overview of Data Mining Overview of D2K Functionality D2K Toolkit MAIDS – Mining Streaming Data D2K Driven Application ThemeWeaver – Mining Text Data MAEViz – Visualizing Earthquake Damage Analysis D2K Streamline (SL)
E N D
March 19, 2004 Duke University D2K – Data To Knowledge
Outline • Overview of Data Mining • Overview of D2K Functionality • D2K Toolkit • MAIDS – Mining Streaming Data • D2K Driven Application • ThemeWeaver – Mining Text Data • MAEViz – Visualizing Earthquake Damage Analysis • D2K Streamline (SL) • EMO – Finding Optimal Decisions • D2K Web Service • Phylomat – Finding Motifs in Sequences
ALG Mission The specific mission of the Automated Learning Group is: • To collaborate with researchers to develop novel computer methods and the scientific foundation for using historical data to improve future decision making • To work closely with industrial, government, and academic partners to explore new application areas for such methods, and • To transfer the resulting software technology into real world applications
Overview of Knowledge Discovery What is It? Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data • The understandable patterns are used to: • Make predictions about or classifications of new data • Explain existing data • Summarize the contents of a large database to support decision making • Create graphical data visualization to aid humans in discovering complex patterns
Overview of Knowledge Discovery Why Do We Need Data Mining ? • Data volumes are too large for classical analysis approaches: • Large number of records (108 – 1012 bytes) • High dimensional data ( 102 – 104 attributes) How do you explore millions of records, tens or hundreds or thousands of fields, and find patterns? • As databases grow, the ability to use traditional query languages for the decision support process becomes infeasible • Many queries of interest are difficult to state in a query language (query formulation problem) • “Find all cases of fraud” • “Find all individuals likely to by Ford Explorer” • “Find all documents that are similar to this customers problem”
Overview of Knowledge Discovery Knowledge Discovery Process
6 0 5 0 4 0 Effort (%) 3 0 2 0 1 0 0 D a t a P r e p a r a t i o n D a t a M i n i n g Interpretation/Evaluation O b j e c t i v e s D e t e r m i n a t i o n Overview of Knowledge Discovery Required Effort for each KDD Step Arrows indicate the direction we want the effort to go
Overview of Knowledge Discovery Three Primary Paradigms • Predictive Modeling – supervised learning approach where classification or prediction of one of the attributes is desired • Classification is the prediction of predefined classes • Naive Bayesian, Decision Trees, and Neural Networks • Regression is the prediction of continuous data • Neural Networks, and Decision (Regression) Trees • Discovery – unsupervised learning approach for exploratory data analysis • Association Rules and Link Analysis • Clustering and Self Organizing Maps • Deviation Detection – identifying outliers in the data • Visualization
Importance of Data Mining Framework • Provides capability to build custom applications • Provides access to data management tools • Contains data mining algorithms for prediction and discovery • Provides data transformations for standard operations • Supports an extensible interface for creating one’s own algorithms • Provides means for building and applying models • Provides integrated visualizations components • Provides access to distributed computing capabilities
D2K Overview D2K - Data To Knowledge D2K is a flexible data mining system that integrates effective analytical data mining methods for prediction, discovery, and anomaly detection with data management and information visualization
D2K Overview D2K and Its Many Components • D2K Infrastructure D2K API, data flow environment, distributed computing framework and runtime system • D2K Modules Computational units written in Java that follow the D2K API • D2K Itineraries Modules that are connected to form an application • D2K Toolkit User interface for specification of itineraries and execution that provides the rapid application development environment • D2K-Driven Applications Applications that use D2K modules with a custom user interface • D2K Streamline (SL) Task driven system that uses D2K modules • D2K Web/Grid Services Enables web deployment
D2K Overview D2K Toolkit Major features that D2K provides to an application developer include: • Visual programming system employing a data flow paradigm • Scalable distributed computing capabilities • Flexible and extensible software development environment • Multi-layered learning strategies • Integrated environment for models and visualization • Web service capabilities for deployment
D2K Overview D2K Modules Input Module: Loads data from the outside world • Flat files, database, etc. Data Prep Module: Performs functions to select, clean, or transform the data • Binning, Normalizing, Feature Selection, etc. Compute Module: Performs main algorithmic computations • Naïve Bayesian, Decision Tree, Apriori, etc. User Input Module: Requires interaction with the user • Data Selection, Input and Output selection, etc. Output Module: Saves data to the outside world • Flat files, databases, etc. Visualization Module: Provides visual feedback to the user • Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D Scatterplot, 3D Surface Plot
Module Progress Bar Appears during execution to show the percentage of time that this module executed over the entire execution time. It is green when the module is executing and red when not Input Port Rectangular shapes on the left side of the module represent the inputs for the module. They are colored according to the data type that they represent Properties Symbol If a “P” is shown in the lower left corner of the module, then the module has properties that can be set before execution D2K Overview D2K Module Icon Description Output Port Rectangular shapes on the right side of the module represent the outputs for the module. They are colored according to the data type that they represent
Current ALG Projects MAIDS: Mining Alarming Incidents in Data Streams Stream Characteristics • Huge volumes of continuous data, possibly infinite • Fast changing and requires fast, real-time response • Data stream captures nicely our data processing needs of today • Random access is expensive—single linear scan algorithm (can only have one look) • Store only the summary of the data seen thus far • Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing
Using D2K Toolkit MAIDS
Current ALG Projects Text Mining • Information Retrieval • Indexing and retrieval of textual documents • Finding a set of (ranked) documents that are relevant to the query • Information Extraction • Extraction of partial knowledge in the text • Web Mining • Indexing and retrieval of textual documents and extraction of partial knowledge using the web • Classification • Predict a class for each text document • Clustering • Generating collections of similar text documents
Using D2K Driven Application Text Mining: Views from T2K and ThemeWeaver
Using D2K Driven Application MAEViz: Damage Synthesis Visualization • Displays terrain map • Loads hazard, inventory, and fragility data • Shows contour map of ground acceleration (hazard) • Displays cones/bars to indicate level of damage • Overlays shapefiles of different information • Uses VTK for 3D • Uses CUBE at BI
D2K SL D2K Streamline (D2K SL) • Provides step by step interface to guide user in data analysis • Supports return to earlier steps to run with different parameters • Uses the D2K infrastructure transparently • Uses same D2K modules • Provides way to capture different experiments
Identify tradeoffs among complex objectives Apply a genetic algorithm (GA) optimization in a general framework Guide the user through discrete steps to defining decision variables, fitness functions, constraints, and setting up GA parameters Using D2K SL EMO – Evolutionary Multiobjective Optimization
D2K Web Service Architecture • Any web enabled client can connect to and use the D2K Web Service by sending SOAP messages over HTTP. • Itineraries and modules are stored on the web service machine and loaded over the network by the D2K Servers. • Job results are also stored in the web service tier. • Results are returned to clients upon request. • A relational database is used by the web service to lookup accounts, itineraries, servers, and jobs. • Remote D2K Servers handle itinerary processing. If possible, modules should load any data from remote locations.
Using D2K Web Service Phylomat (Motif Analysis Tool for Phylogenomics)
Staff Loretta Auvil Peter Bajcsy Colleen Bushell Dora Cai David Clutter Lisa Gatzke Vered Goren Chris Navarro Greg Pape Tom Redman Duane Searsmith Andrew Shirk Anca Suvaiala David Tcheng Michael Welge Students Ritesh Agrawal Tyler Alumbaugh John Cassel Sang-Chul Lee Xiaolei Li Jeff Ng Scott Ramon Martin Urban Bei Yu Hwanjo Yu The ALG Team
Licensing D2K • Faculty, staff and students at US academic institutions will be able to license and use D2K for free by downloading from alg.ncsa.uiuc.edu • Private Sector Partners who have provided funding for projects related to D2K will be able to license and use D2K for free • Private Sector Partners who have not provided funding will be able to license and use D2K for a discounted fee Contact John McEntire Office of Technology Management 308 Ceramics Building, MC-243 105 South Goodwin Avenue Urbana, Illinois 61801-2901 (217) 333-3715 jmcentir@uiuc.edu