Database Research: Data Mining & Other Areas

Database Research: Data Mining & Other Areas Dr. Aparna Varde Ph.D., Computer Science, WPI, MA Assistant Professor, Computer Science, VSU, VA Presentation at Montclair State University, NJ May 2, 2008

Agenda • Database Systems • Introduction to Databases and Research Areas • Data Mining • Research Problem in Graphical Data Mining • Other Areas • Data Warehousing • Web Databases

Data in Various Forms Flat Files (Unprocessed) Documents (Processed) Raw Data (Handwritten) Images (Complex) Human Mind (Too much data) Simple Tables (Organized)

Need for Databases • Integration of data • Efficient storage • Fast retrieval • Ease of modification • Security of information • Recovery from failures

Database System Environment Users Application Programs/ Queries Database System DBMS (Database Management System) Database

Roles in the Database World Database Administrator Database Application Programmer Database User Database Researcher

Examples of Database Research Areas • Query Processing and Optimization • Privacy and Security • Storage and Indexing • Data Mining • Data Warehousing • Web Databases

Data Mining • Discovering knowledge from data • Non-trivial process of finding novel and interesting patterns in large datasets to guide future decisions • Types of Data • Numbers • Graphs • Images • Text

Data Mining Techniques • Association Rule Mining • Discovering relationships of the type A => B • Clustering • Grouping objects based on similarity • Classification • Predicting the class of a target

Graphical Data Mining Problem • Experimental results in scientific domains plotted as graphs • Users pose queries for predictive analysis: • Given input conditions, predict most likely graph • Given desired graph, predict most likely conditions • Need for mining graphical data to discover knowledge

Proposed Approach: AutoDomainMine

AutoDomainMine: Prediction of Graph

AutoDomainMine: Prediction of Conditions

Main Tasks Task 1 AutoDomainMine Learning Strategy of Integrating Clustering and Classification [AAAI-06 Poster, ACM SIGART’s ICICIS-05] Task 2 Learning Domain-Specific Distance Metrics for Graphs [ACM KDD’s MDM-05, MTAP-06 Journal] Task 3 Designing Semantics-Preserving Representatives for Clusters [ACM SIGMOD’S IQIS-06, ACM CIKM-06]

Learning Distance Metrics for Graphs • Various distance metrics • Absolute position of points • Statistical observations • Critical features • Issues • Not known what metrics apply • Multiple metrics may be relevant • Need for distance metric learning in graphs Example of domain-specific problem

Proposed Distance Metric Learning Approach: LearnMet • Given • Training set with actual clusters of graphs • Additional Input • Components: distance metrics applicable to graphs • LearnMet Metric • D = ∑wiDi

Evaluate Accuracy • Use pairs of graphs • A pair (ga,gb) is • TP - same predicted, same actual cluster: (g1, g2) • TN - different predicted, different actual clusters: (g2,g3) • FP -same predicted cluster, different actual clusters: (g3,g4) • FN - different predicted, same actual clusters: (g4,g5)

Evaluate Accuracy (Contd.) • How do we compute error for whole set of graphs? • For all pairs • Error Measure • Failure Rate FR • FR = (FP+FN) / (TP+TN+FP+FN) • Error Threshold (t) • Extent of FR allowed • If (FR < t) then clustering is accurate

Adjust the Metric • Weight Adjustment Heuristic: for each Di • New wi = wi – sfi (DFNi/DFN + DFPi/DFP) [KDD’s MDM-05]

Testing of LearnMet • Details: MTAP-06 • Effect of pairs per epoch (ppe) • G = number of graphs, e.g., = 25 • GC2 = total number of pairs, e.g., = 300 • Select subset of GC2 pairs per epoch • Observations • Highest accuracy with middle range of ppe • Learning efficiency best with low ppe Accuracy of Learned Metrics over Test Set Learning Efficiency over Training Set

User Surveys of the AutoDomainMine System • Formal user surveys in different applications • Evaluation Process • Compare estimation with real data in test set • If they match estimation is accurate • Observations • Estimation Accuracy around 90 to 95 % Accuracy: Estimating Conditions Accuracy: Estimating Graphs

Related Work • Similarity Search [HK-01, WF-00] • Non-matching conditions could be significant • Mathematical Modeling [M-95, S-60] • Existing models not applicable under certain situations • Case-based Reasoning [K-93, AP-03] • Adaptation of cases not feasible with graphs • Learning nearest neighbor in high-dimensional spaces: [HAK-00] • Focus is dimensionality reduction, do not deal with graphs • Distance metric learning given basic formula: [XNJR-03] • Deal with position-based distances for points, no graphs involved • Similarity search in multimedia databases [KB-04] • Use various metrics in different applications, do not learn a single metric • Image Rating: [HH-01] • User intervention involved in manual rating • Semantic Fish Eye Views: [JP-04] • Display multiple objects in small space, no representatives • PDA Displays in Levels of Detail: [BGMP-01] • Do not evaluate different types of representatives

Data Warehousing DW View • Data Warehouse • Subject-oriented, integrated repository of relevant data from various information sources Mediator R11 R12 R21 R22 R23 R31 IS1 IS2 IS3

Research Problem in Data Warehousing • View Maintenance (VM) • Keeping warehouse view consistent with respect to change in sources • Incremental VM • Update warehouse as the source data changes • Propagate only the updates, not all data • Concurrency Conflicts • Two or more sources / relations try to send updates at the same time • Problem • Solve concurrency conflicts in view maintenance in multi-source multi-relation environments

Proposed Solution: MEDWRAP (MEDiator WRAPper compensation) V Data Warehouse Mediator (Multi-Source VM Algorithm) rIS1 rIS2 rIS3 Wrapper (Single-Source VM Algorithm) Wrapper (Single-Source VM Algorithm) Wrapper (Single-Source VM Algorithm) rR31 rR11 rR21 rR22 R11 R21 R22 R23 R31 R32 IS1 IS2 IS3

Advantages of MEDWRAP • Generic for any compensation based algorithms • Allows sources to be semi-autonomous • Sources do not participate in maintenance beyond processing queries and reporting updates • No locking needed • Low Storage Cost • Additional views not stored at wrappers • Copies of source relations not stored at warehouse • Efficient Processing Time • No need to re-compute whole view • Details in DEXA-2002 paper

Related Work • RV: Re-computation of View (Traditional) • Rewrite all tuples, not only affected ones • Highly inefficient if done for every update • SM: Self Maintenance [Q-96, G-96] • DW stores copies of source relations for maintenance • Huge storage at warehouse • Version Control: [K-99, C-00] • Versions of transactions / tuples stored at wrappers • Latest version used to answer queries • Huge storage at source wrappers

Web Databases • Management of Data on the Web • XML, the eXtensible Markup Language • Widespread standard in storing and publishing data • Domain-specific markup languages designed with XML tag sets • Standardization bodies extend these to include additional semantics • Aspects such domain knowledge, XML constraints are important

Domain-specific Markup Language • Medium of communication for potential users of the domain • Follows XML syntax • Encompasses the semantics of the domain • Examples • MML: Medical Markup Language • ChemML: Chemical Markup Language Industries Markup Language Publishers Consumers Research Organizations Universities

Markup Language Development Steps 1. Acquisition of Domain Knowledge - Familiarity with related markups 2. Data Modeling - E.g.,Entity Relationship models 3. Requirements Specification - E.g.,Interviews with Domain Experts 4. Ontology Creation - Analogous to pilot version of software 5. Revision of Ontology - Alpha version 6. Schema Definition - Beta version 7. Reiteration of Schema until Standardization - Release Version Snapshot of Final Schema with data storage

Desired Features of Markup Languages • Avoidance of Redundancy • No duplicate information • Non-Ambiguous Presentation of Data • Issues such as synonymy & polysemy • Easy Interpretability of Data • E.g. in scientific domains, store experimental input conditions before results • Incorporation of Domain-Specific Requirements • E.g. conflicts such as: in financial domains, a person can be either insolvent or asset-holder but not both • Extensibility of the Markup • Users should be able to capture additional semantics

Application of XML Constraints • Sequence Constraint • To control the order of tags • Choice Constraint • To use either one tag or the other • Key Constraint • To identify an attribute as a unique primary key • Occurrence Constraint • To declare minimum and maximum occurrences

Convenient Access to Information • Data stored using XML based markup languages can be easily accessed using languages such as • XQuery: XML Query Language • XSLT: XML Stylesheet Language Transformations • XPath: XML Path Language • Details on markup language development • Chapter on “XML Based Markup Languages for Specific Domains” by Varde et al. in book “XML Based Support Systems”, Springer 2008

Related Work • Semantic Extensions of XML for Advanced Applications [YKB-2001] • Versions and Standards of HTML [B-95] • The Latest MML (Medical Markup Language) Version 2.3 - XML based Standard for Medical Data Exchange/ Storage [GATSSTSNY-2003] • XQuery 1.0: An XML Query Language [BFFRS-2003] • Handbook of Modern Finance [SL-2004] • Propagating XML Constraints to Relations [DFHQ-2003]

Conclusions and Ongoing Work • Data Mining • Graphical Data Mining Area, AutoDomainMine approach • Ongoing Work • Feature Selection in Image Mining (with colleagues in VSU and WPI: NSF Grants involved) • Mining Genomic and Proteomic Data (with ISB: Institute of Systems Biology) • Data Warehousing • View Maintenance Area, MEDWRAP approach • Ongoing Work • Data Warehouse Maintenance in real time environments (with researchers at Microsoft Search Labs) • Web Databases • Book Chapter on XML Based Markup Languages for Specific Domains • Ongoing Work • Development of Domain-specific markups (with NIST: National Institute of Standards and Technology)

Database Research: Data Mining & Other Areas