350 likes | 485 Views
Database Research: Data Mining & Other Areas. Dr. Aparna Varde Ph.D., Computer Science, WPI, MA Assistant Professor, Computer Science, VSU, VA Presentation at Montclair State University, NJ May 2, 2008. Agenda. Database Systems Introduction to Databases and Research Areas Data Mining
E N D
Database Research: Data Mining & Other Areas Dr. Aparna Varde Ph.D., Computer Science, WPI, MA Assistant Professor, Computer Science, VSU, VA Presentation at Montclair State University, NJ May 2, 2008
Agenda • Database Systems • Introduction to Databases and Research Areas • Data Mining • Research Problem in Graphical Data Mining • Other Areas • Data Warehousing • Web Databases
Data in Various Forms Flat Files (Unprocessed) Documents (Processed) Raw Data (Handwritten) Images (Complex) Human Mind (Too much data) Simple Tables (Organized)
Need for Databases • Integration of data • Efficient storage • Fast retrieval • Ease of modification • Security of information • Recovery from failures
Database System Environment Users Application Programs/ Queries Database System DBMS (Database Management System) Database
Roles in the Database World Database Administrator Database Application Programmer Database User Database Researcher
Examples of Database Research Areas • Query Processing and Optimization • Privacy and Security • Storage and Indexing • Data Mining • Data Warehousing • Web Databases
Data Mining • Discovering knowledge from data • Non-trivial process of finding novel and interesting patterns in large datasets to guide future decisions • Types of Data • Numbers • Graphs • Images • Text
Data Mining Techniques • Association Rule Mining • Discovering relationships of the type A => B • Clustering • Grouping objects based on similarity • Classification • Predicting the class of a target
Graphical Data Mining Problem • Experimental results in scientific domains plotted as graphs • Users pose queries for predictive analysis: • Given input conditions, predict most likely graph • Given desired graph, predict most likely conditions • Need for mining graphical data to discover knowledge
Main Tasks Task 1 AutoDomainMine Learning Strategy of Integrating Clustering and Classification [AAAI-06 Poster, ACM SIGART’s ICICIS-05] Task 2 Learning Domain-Specific Distance Metrics for Graphs [ACM KDD’s MDM-05, MTAP-06 Journal] Task 3 Designing Semantics-Preserving Representatives for Clusters [ACM SIGMOD’S IQIS-06, ACM CIKM-06]
Learning Distance Metrics for Graphs • Various distance metrics • Absolute position of points • Statistical observations • Critical features • Issues • Not known what metrics apply • Multiple metrics may be relevant • Need for distance metric learning in graphs Example of domain-specific problem
Proposed Distance Metric Learning Approach: LearnMet • Given • Training set with actual clusters of graphs • Additional Input • Components: distance metrics applicable to graphs • LearnMet Metric • D = ∑wiDi
Evaluate Accuracy • Use pairs of graphs • A pair (ga,gb) is • TP - same predicted, same actual cluster: (g1, g2) • TN - different predicted, different actual clusters: (g2,g3) • FP -same predicted cluster, different actual clusters: (g3,g4) • FN - different predicted, same actual clusters: (g4,g5)
Evaluate Accuracy (Contd.) • How do we compute error for whole set of graphs? • For all pairs • Error Measure • Failure Rate FR • FR = (FP+FN) / (TP+TN+FP+FN) • Error Threshold (t) • Extent of FR allowed • If (FR < t) then clustering is accurate
Adjust the Metric • Weight Adjustment Heuristic: for each Di • New wi = wi – sfi (DFNi/DFN + DFPi/DFP) [KDD’s MDM-05]
Testing of LearnMet • Details: MTAP-06 • Effect of pairs per epoch (ppe) • G = number of graphs, e.g., = 25 • GC2 = total number of pairs, e.g., = 300 • Select subset of GC2 pairs per epoch • Observations • Highest accuracy with middle range of ppe • Learning efficiency best with low ppe Accuracy of Learned Metrics over Test Set Learning Efficiency over Training Set
User Surveys of the AutoDomainMine System • Formal user surveys in different applications • Evaluation Process • Compare estimation with real data in test set • If they match estimation is accurate • Observations • Estimation Accuracy around 90 to 95 % Accuracy: Estimating Conditions Accuracy: Estimating Graphs
Related Work • Similarity Search [HK-01, WF-00] • Non-matching conditions could be significant • Mathematical Modeling [M-95, S-60] • Existing models not applicable under certain situations • Case-based Reasoning [K-93, AP-03] • Adaptation of cases not feasible with graphs • Learning nearest neighbor in high-dimensional spaces: [HAK-00] • Focus is dimensionality reduction, do not deal with graphs • Distance metric learning given basic formula: [XNJR-03] • Deal with position-based distances for points, no graphs involved • Similarity search in multimedia databases [KB-04] • Use various metrics in different applications, do not learn a single metric • Image Rating: [HH-01] • User intervention involved in manual rating • Semantic Fish Eye Views: [JP-04] • Display multiple objects in small space, no representatives • PDA Displays in Levels of Detail: [BGMP-01] • Do not evaluate different types of representatives
Data Warehousing DW View • Data Warehouse • Subject-oriented, integrated repository of relevant data from various information sources Mediator R11 R12 R21 R22 R23 R31 IS1 IS2 IS3
Research Problem in Data Warehousing • View Maintenance (VM) • Keeping warehouse view consistent with respect to change in sources • Incremental VM • Update warehouse as the source data changes • Propagate only the updates, not all data • Concurrency Conflicts • Two or more sources / relations try to send updates at the same time • Problem • Solve concurrency conflicts in view maintenance in multi-source multi-relation environments
Proposed Solution: MEDWRAP (MEDiator WRAPper compensation) V Data Warehouse Mediator (Multi-Source VM Algorithm) rIS1 rIS2 rIS3 Wrapper (Single-Source VM Algorithm) Wrapper (Single-Source VM Algorithm) Wrapper (Single-Source VM Algorithm) rR31 rR11 rR21 rR22 R11 R21 R22 R23 R31 R32 IS1 IS2 IS3
Advantages of MEDWRAP • Generic for any compensation based algorithms • Allows sources to be semi-autonomous • Sources do not participate in maintenance beyond processing queries and reporting updates • No locking needed • Low Storage Cost • Additional views not stored at wrappers • Copies of source relations not stored at warehouse • Efficient Processing Time • No need to re-compute whole view • Details in DEXA-2002 paper
Related Work • RV: Re-computation of View (Traditional) • Rewrite all tuples, not only affected ones • Highly inefficient if done for every update • SM: Self Maintenance [Q-96, G-96] • DW stores copies of source relations for maintenance • Huge storage at warehouse • Version Control: [K-99, C-00] • Versions of transactions / tuples stored at wrappers • Latest version used to answer queries • Huge storage at source wrappers
Web Databases • Management of Data on the Web • XML, the eXtensible Markup Language • Widespread standard in storing and publishing data • Domain-specific markup languages designed with XML tag sets • Standardization bodies extend these to include additional semantics • Aspects such domain knowledge, XML constraints are important
Domain-specific Markup Language • Medium of communication for potential users of the domain • Follows XML syntax • Encompasses the semantics of the domain • Examples • MML: Medical Markup Language • ChemML: Chemical Markup Language Industries Markup Language Publishers Consumers Research Organizations Universities
Markup Language Development Steps 1. Acquisition of Domain Knowledge - Familiarity with related markups 2. Data Modeling - E.g.,Entity Relationship models 3. Requirements Specification - E.g.,Interviews with Domain Experts 4. Ontology Creation - Analogous to pilot version of software 5. Revision of Ontology - Alpha version 6. Schema Definition - Beta version 7. Reiteration of Schema until Standardization - Release Version Snapshot of Final Schema with data storage
Desired Features of Markup Languages • Avoidance of Redundancy • No duplicate information • Non-Ambiguous Presentation of Data • Issues such as synonymy & polysemy • Easy Interpretability of Data • E.g. in scientific domains, store experimental input conditions before results • Incorporation of Domain-Specific Requirements • E.g. conflicts such as: in financial domains, a person can be either insolvent or asset-holder but not both • Extensibility of the Markup • Users should be able to capture additional semantics
Application of XML Constraints • Sequence Constraint • To control the order of tags • Choice Constraint • To use either one tag or the other • Key Constraint • To identify an attribute as a unique primary key • Occurrence Constraint • To declare minimum and maximum occurrences
Convenient Access to Information • Data stored using XML based markup languages can be easily accessed using languages such as • XQuery: XML Query Language • XSLT: XML Stylesheet Language Transformations • XPath: XML Path Language • Details on markup language development • Chapter on “XML Based Markup Languages for Specific Domains” by Varde et al. in book “XML Based Support Systems”, Springer 2008
Related Work • Semantic Extensions of XML for Advanced Applications [YKB-2001] • Versions and Standards of HTML [B-95] • The Latest MML (Medical Markup Language) Version 2.3 - XML based Standard for Medical Data Exchange/ Storage [GATSSTSNY-2003] • XQuery 1.0: An XML Query Language [BFFRS-2003] • Handbook of Modern Finance [SL-2004] • Propagating XML Constraints to Relations [DFHQ-2003]
Conclusions and Ongoing Work • Data Mining • Graphical Data Mining Area, AutoDomainMine approach • Ongoing Work • Feature Selection in Image Mining (with colleagues in VSU and WPI: NSF Grants involved) • Mining Genomic and Proteomic Data (with ISB: Institute of Systems Biology) • Data Warehousing • View Maintenance Area, MEDWRAP approach • Ongoing Work • Data Warehouse Maintenance in real time environments (with researchers at Microsoft Search Labs) • Web Databases • Book Chapter on XML Based Markup Languages for Specific Domains • Ongoing Work • Development of Domain-specific markups (with NIST: National Institute of Standards and Technology)