170 likes | 309 Views
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001. George Kollios Boston University. Prof. George Kollios Office: MCS 288 Office Hours: Monday 2:00pm-3:30pm Thursday 11:00am-12:30pm Mailing List: cs591g1 .
E N D
Advanced Database ApplicationsDatabase Indexing and Data MiningCS591-G1 -- Fall 2001 George Kollios Boston University
Prof. George Kollios Office: MCS 288 Office Hours: Monday 2:00pm-3:30pm Thursday 11:00am-12:30pm Mailing List: cs591g1
History of Database Technology • 1960s: • Data collection, database creation, IMS and network DBMS • 1970s: • Relational data model, relational DBMS implementation • 1980s: • RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) • 1990s—2000s: • Data mining and data warehousing, multimedia databases, and Web databases
Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB Structure of a RDBMS Modern Database Systems Extend these layers • A DBMS is an OS for data! • A typical RDBMS has a layered architecture. • This is one of several possible architectures; each system has its own variations.
Index Methods for RDBMS • Hashing Methods: • Linear Hashing, extendible hashing • B-tree family: • B+-trees and variations • Both of them are one-dimensional
Overview of the course • Spatial Database Systems • GIS, CAD/CAM : EOSDIS project NASA • Manages points, lines and regions • Temporal Database Systems • Billing, medical records • Spatio-temporal Databases • Moving objects, changing regions, etc
Overview of the course • Multimedia and medical databases • A multimedia system can store and retrieve objects/documents with text, voice, images, video clips, etc • Time series databases • Stock market, ECG, trajectories, etc
Multimedia databases • Applications: • Digital libraries, entertainment, office automation • Medical imaging: digitized X-rays and MRI images (2 and 3-dimensional) • Query by content: (or QBE) • Efficient • ‘Complete’ (no false dismissals)
What is Data Mining? • Data mining (knowledge discovery in databases): • The efficient discovery of : previously unknown,valid, potentially useful and understandable information or patterns from data in large databases • Alternative names: • Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, etc.
DM Applications • Database analysis and decision support • Market analysis: target marketing, market basket analysis, market segmentation • Fraud detection and management • Biology and medicine • Text mining (news group, email, documents) and Web analysis.
Data Mining: Confluence of Multiple Disciplines Database Technology Statistics Data Mining Machine Learning Visualization Information Science Other Disciplines
Overview of terms • Data: a set of facts (items) D, stored in a database • Pattern: an expression E in a language L, that describes a subset of facts • Attribute: a field in an item i in D. • Interestingness: a function ID,L that maps an expression to a measure space M
The Data Mining Task • For a given dataset D, language of facts L, interestingness function ID,L and threshold c, find the expression E that: ID,L(E) > c efficiently.
How Data Mining is used • Identify the problem • Use data mining techniques to transform the data into information • Act on the information • Measure the results
DM Functionalities • Concept description: • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions • Association (correlation and causality): • Multi-dimensional vs. single-dimensional association • age(X, “20..29”) ^ income(X, “20..29K”) à buys(X, “PC”) [support = 2%, confidence = 60%] • contains(T, “computer”) à contains(x, “software”) [1%, 75%]
DM Functionalities • Cluster analysis • Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
DM Functionalities • Classification and Prediction • Finding models (functions) that describe and distinguish classes or concepts for future prediction • E.g., classify countries based on climate, or classify cars based on gas mileage • Presentation: decision-tree, classification rule, neural network • Prediction: Predict some unknown or missing numerical values