Intelligent Data Analysis: Discovery Techniques

CS245A – Syllabus (2005) • Knowledge Discovery in Databases • Query Processing With Domain Semantics • Capture Database Semantics by Rule Induction • Intentional Query Answering • Fault Tolerant DDBMS Via Data Inference • Intelligent Dictionary Directory • Uncertainty Management Using Rough Sets • Data Mining Techniques (Ch 4-7, H & K) • Active Databases • Mediators in Information Systems • KQML: A Language and Protocol for Knowledge and Information Exchange

CS 245A - Syllabus (cont’d) • CoBase • CoSent • Relaxation for XML Documents • Query Formation From High-level Concepts • Knowledge Acquisition for Query Relaxation • Principles of Case-based Reasoning • A Case-based Reasoning Approach to AQA • CoXML • Data Mining for Sequence Data • Extracting key features from Free Text • Knowledge based Approach for Free Text Retrieval • Content-based Information Retrieval • Digital Library

References • Course notes: Intelligent Information Systems, CS245A, Course Reader Material, 1141 Westwood Blvd, 310-443-3303 • Jiawei Han and Micheline Kamber, Data Mining: Concept and Techniques, Morgan Kaufmann, August 2000. • Wesley Chu & T.Y. Lin (ed.) Foundations and Advances in Data Mining. Springer, 2005

CS 245AIntelligent Information Systems Wesley W. Chu Computer Science Department U. of California Los Angeles, CA

Knowledge Discovery In Databases • Information Explosion • Information doubles every 20 months • Increase in the number and size of DBs • NASA - Earth observation satellites, 1 picture/sec • Human genome - several billion genetic bases • US census data - lifestyle and subculture of the US • How to analyze these databases (raw data) • There is a gap between • Data generation and data understanding • Intelligent data analysis will be useful and valuable • AA uses frequent flyer DB to find its better customers for specific market promotions

Knowledge Discovery In Databases (Cont’d) • Bank uses customers loan and credit information to derive better loan approval and bankrupt protection • Package-goods manufacturers use the scanned supermarket data to measure the effect of their promotions and to look for shopping patterns • Techniques • Machine Learning • Statistics • Information Theory • Fuzzy Set

Knowledge Discovery • Extraction of implicit, previously unknown and potentially useful information from Data Given a set of facts (Data) F, a language L, measure of certainty C, pattern: a statement S in L that describes the relationship among a subset Fs of F with certainty C, such that Fs is a simpler representation than the enumeration of all facts in Fs Discovered Knowledge: The output of a program that monitors the set of facts in a DB and produce patterns.

Patterns • Expressed by high level language • Understand and used directly by people • Able to input to another program (e.g. expert system) e.g. If age < 25 and Driver-Education-Course = No Then At-Fault-Accident = Yes with likelihood = 0.3

Patterns (Cont’d) Patterns that are completely unrelated to current goals are not considered as knowledge. e.g. Patterns that are relating at-fault-accident to a driver’s age is not useful to auto sales figures. Pattern + interesting results = knowledge Age > 16 is not an interesting pattern for driver since all drivers require age > 16.

Knowledge Discovery in DB Exhibits Four Main Characteristics: • High-Level Language • Understood by human users • Accuracy • Expressed by measure of uncertainty • Interesting Results • Patterns are novel and potentially useful • Efficiency • Running times for large-sized DB are predictable and acceptable

Efficiency The discovery process should be efficiently implemented on a computer. An algorithm is considered efficient if the run time and space used are a polynomial function of low degree of input length. e.g. efficient algorithms for restricted concept classes • Conjunctive concepts, (A B C) • Conjunction of classes of disjunctions of no more than k literals (A B) (C D) (E F) , k = 2.

Machine Learning • A learning algorithm takes the data set and its accompanying information as input and returns a statement (e.g., a concept) representing the results of the learning as output • Data sets can be a file of records in DB • Problems in learning DB • DB are • Dynamic • Incomplete • Noisy • Much larger than typical machine learning data sets • Much of work in learning DB focuses on overcoming these complications!

Related Approaches • DB Management • Integrity • Querying in DB • Deduction in DB • OODBM • Expert Systems • Expert generated knowledge usually are higher quality than the data in DB • Only cover the important cases • Experts are available to confirm the validity and usefulness of discovered patterns • Autonomy of discovery is lacking in expert systems

Related Approaches (Cont’d) • Statistics • Ill suited for the nominal and structured data types • Precluding the use of domain knowledge • Difficult to interpret • Require the guidance of the user to specify when and how to analyze the data

Scientific Discovery • DBKD is less purposeful and controlling than SD • Scientists can reformulate and rerun their experiment should they find the initial design was inadequate • Database manager rarely have the luxury of redesigning their data fields and recollecting the data

A Framework for Knowledge Discovery • Input • Raw data from DB • Information from data dictionary • Additional domain knowledge • User defined biases that provide high level focus • Output • New Domain Knowledge • Feedback of the discovered knowledge to generate new knowledge • DB issues • Dynamic data (time sensitive; e.g. weight & height pulse rate) • Irrelevant fields (zip codes, pulse rate, sex) • Missing data • Noise and uncertainty • Missing field

Translation Between Database Management and Machine Learning Terms

Conflicting Viewpoints Between Database Management and Machine Learning

A Framework for Knowledge Discovery in Databases

Database and Knowledge • Domain Knowledge assist in discovery by the searching scope • Data Dictionary • Inter-field Knowledge • e.g., weight and height • Inter-instance knowledge • e.g., age + height = seniority age + weight = seniority • Contradictory - rule out valuable discovery “Trucks don’t drive over water” eliminates potentially interesting solution, “Trucks drive over frozen lakes in winter.”

Discovered Knowledge • Form • Inter-field patterns - related values of field in the same record • e.g. (procedure = surgery implies days in hospital > 5) • Inter-record patterns - aggregated over group of records or identify useful clusters (e.g., profit making companies) • Rules: X > Y1, A = > B forms casual chains or network

Discovered Knowledge (cont’d) • Representation • Discovery must be represented in a form appropriate for the intended user. • Human: natural language, formal logic, visual depictions of information • Computer program (expert system shells): Programming language, declarative formalisms • Discovery System: Feedback as domain knowledge • Need common representation • Uncertainty • Patterns are often probabilistic rather than deterministic • missing and erroneous data • inherent indeterminism of the underlying real world causes (50% chance of rain tomorrow) • sampling

Discovered Knowledge (cont’d) • Measures • Proof of success • Standard deviation • Belief measures • Linguistic uncertainty - fuzzy sets • Visual presentations by density, size, and shading • Sampling technique for large DB accuracy of results depends on sample size

Discovery Algorithms • Machine Learning: • Unsupervised Learning • Supervised Learning • Unsupervised Learning: • Pattern identification: identifying interesting patterns and describing them in a concise and meaningful manner • Examples • customer with income > $25,000/yr • questionable insurance claims

Discovery Algorithms (Cont’d) • Methods: • Traditional Clustering • Minimized similarity between classes • Maximize similarity within classes Drawbacks • Based on Euclidean Distance, work well only on numerical data • Inability to use background information such as likely cluster shape • Conceptual clustering • Based on attributes similarity, conceptual cohesiveness (defined by background information) • Interactive clustering • Combines human user’s knowledge with computation power of the computer

Discovery Algorithms (Cont’d) • Supervised Learning: • Description process • Summaries relevant qualities of the identified class In discovery systems, user supervision can occur in either the identification or description process.

Concept Description(Supervised Concept Learning) Discovery in large, complex database requires both empirical methods to detect the statistical regularity of patterns and knowledge-based approaches to incorporate available domain knowledge. Discovery tasks • Summarization - Summarize class records by describing their common or characteristic features • Discrimination - Describe qualities sufficient to discriminate records of one class from another • Comparison - Describe the class in a way that facilitates comparison and analysis with other records

Future Directions • Domain Knowledge - how to effectively use domain knowledge to discover knowledge • Efficient Algorithms • Restrict rule type • Heuristic and approximate algorithms • Sampling • Parallel computing • OODBM • Deductive DB • Incremental methods • Efficiently keep pace with changes in Data • Incremental discovery system, reuse their discoveries and make more complex discoveries

Future Directions (cont’d) • Interactive systems • Knowledge analyst included in the discovery loop • Use human judgement, machine computation power • Need information to be presented on a human oriented form (text, sound, visuals) • Integration

Applications of Discovery in DB • Medicine • Finance • Agriculture • Social • Marketing & Sales • Insurance • Engineering • Physics & Chemistry • Military • Law Enforcement • Space Science • Publishing

Applications of Discovery in DB (Cont’d) • Discovery of Quantitative Laws • Data Driven Discovery of Quantitative Laws • Using Knowledge in Discovery • Data Summarization • Domain Specific Discovery Methods • Integrated & Multi-Paradigm Systems • Methodology and Application Issues

Query Processing WithDomain Semantics Wesley W. Chu

Query Optimization Problem To find a sequence of operations, which has the minimal processing cost.

Conventional Query Optimization (CQO) For a given query: • Generate a set of query that are equivalent to the given query • Determine the processing cost of each such query • Select the lowest cost query processing strategy among these equivalent queries

Limitations of CQO There are certain queries that cannot be optimized by Conventional Query Optimization. For example, given the query: “Which ships have deadweight greater than 200 thousand tons?” A search of entire the database may be required to answer this query.

The Use of Knowledge • ASSUMING EXPERT KNOWS THAT: 1. SHIP relation is indexed on ShipType. There are about 10 different ship types, and 2. the ship must be a “SuperTanker” (one of the ShipTypes) if the deadweight is greater than 150K tons. • AUGMENTED QUERY: “Which SuperTanker have deadweight greater than 200K tons?” • RESULT: About 90% time saved in searching the answers. The technique of improving queries with semantic knowledge is called Semantic Query Optimization.

Semantic Query Optimization (SQO) Uses domain knowledge to transform the original query into a more efficient query yet still yields the same answer. Assuming a set of integrity constraints is available as the domain knowledge, • Represent each integrity constraint as Pi Ci, where 1 < i < n. • Translate (Augment) original query Q into Q’ subject to C1, C2, ..., Cn, such that Q’ yields lower processing cost than Q. • Query Optimization Problem: Find C1, C2, ..., Cm that yields minimal query processing cost; that is, C(Q’) = min C(QLC1L ... LCm) Ci

Semantic Equivalence Domain knowledge of the database application maybe used to transform the original query into semantically equivalent queries. Semantic Equivalence: Two queries are considered to be semantically equivalent if they result in the same answer in any state of the database that conforms to the Integrity Constraints. Integrity Constraints: A set of if and then rules that enforce the database to be accurate instance of the real world database application. Examples of constraints include: • state snapshot constraints: e.g., if deadweight > 150K then ShipType = “SuperTanker.” • state transition constraints: e.g., salary can only be increased, i.e., salary (new) > salary (old)

Limitations of Current Approach Current approach of SQO using: • Integrity constraints as knowledge • Conventional data models

Limitations of Integrity Constraints • Integrity constraints are often too general to be useful in SQO, because: • Integrity constraints describe every possible database state • User is only concerned with the current database content. • Most database do not provide integrity checking due to: • Unavailability of integrity constraints • Overhead of checking the integrity Thus, the usefulness of integrity constraints in SQO is quite limited.

Limitations Of Conventional Data Models Conventional data models lack expressive capability for modeling conveniences. Many useful semantics are ignored. Therefore, limited knowledge are collected. FOR EXAMPLE: “Which employee earns more than 70K a year?” The integrity constraint: “The salary range of employee is between 20K to 90K.” is useless in improving this query.

Augmentation Of SQO With Semantic Data Models If the employees are divided into three categories: MANAGERS, ENGINEERS, STAFFS and each category is associated with some constraints: • The salary range of MANAGERS is from 35K to 90K. • The salary range of ENGINEERS is from 25K to 60K. • The salary range of STAFF is from 20K to 35K. A better query can be obtained: “Which managers earn more than 70K a year?”

CLASS = (Type, Class, Name, Displacement, Draft, Enlist)

Rule Set Rule Size CM Class  Type Name  Type Displacement  Type Draft  Type Enlist  Type 168 3 2 1 36 78 9 7 4 35 Rule Statistics

CQP SQP Type Hierarchy cpu (ms) 429 444 #dio 12 10 cpu (ms) 426 392 #dio 12 7 order by Class order by Type SQP Performance for Selected Database Structure

Performance Improvement for Selected Attributes CQP SQP attribute cpu (ms) 505 432 #dio 11 11 cpu (ms) 129 130 #dio 3 4 Class Enlist

Summary Contributions: Providing a model-based methodology for acquiring knowledge from the database by rule induction. Applications: 1. Semantic Query Processing – use semantic knowledge to improve query processing performance. 2. Deductive Database Systems - use induced rules to provide intentional answers. 3. Data Inference Applications - use rules to improve data availability by inferring inaccessible data from accessible data.

Capture Database SemanticsBy Rule Induction Wesley W. Chu & Rei-Chi Lee

Intelligent Data Analysis: Discovery Techniques