320 likes | 354 Views
Introduction to Biometrics. Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #3 Information Management and Data Mining August 29, 2005. Objective of the Unit.
E N D
Introduction to Biometrics Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #3 Information Management and Data Mining August 29, 2005
Objective of the Unit • This unit gives an overview of various information management technologies. In addition some details of data mining will also be given.
Outline of the Unit • What is Information Management? • Some Information Management Technologies • Information management Applications • Data Mining
Revisiting the DM/IM/KM Framework Each layer builds on the technologies of the lower layers Knowledge Secure Semantic Secure Digital Secure Digital Secure Semantic Secure Digital Biometrics Biometrics Biometrics Biometrics Biometrics Knowledge Models Representation Libraries Libraries Libraries Web Web Digital Digital Digital Digital Knowledge Digital Forensics Forensics Mining Forensics Forensics Forensics Secure Knowledge Secure Knowledge Secure Knowledge Secure Knowledge Knowledge Creation and Acquisition Secure Knowledge Privacy Privacy Knowledge Portals Privacy Privacy Privacy Secure Secure Expert systems and Reasoning under uncertainty Knowledge Management Technologies Information Information Secure Secure Secure Management Management Information Information Information Data Mining Data Mining Data Mining Data Mining Knowledge Data Mining Dependable Dependable Dependable Dependable Dependable Technologies Technologies Management Management Management And Security And Security And Security And Security Sharing And Security Knowledge Manipulation Information Information Information Information Information Technologies Technologies Technologies Management Management Management Management Management Semantic Inference Inference Inference Inference Inference Sensor Sensor Sensor Sensor Sensor Sensor Problem Web Problem Problem Problem Problem Database Information Database Database Database Database Data Data Data Data Data Data Management Security Security Security Security Security Information Management Technologies Warehouse Warehouse Warehouse Systems Warehouse Warehouse Warehouse Security Security Security Security Security Object/Multimedia Object/Multimedia Multimedia Information System Object Database Object Database Object Database Security Security Security Security Security Relational Relational Relational Relational Relational Database Database Database Database Database Data Mining Security Security Security Security Security Web Web Web Web Web Web Database Database Information Database Database Database Peer-to-Peer Distributed/ Distributed and Distributed/ Distributed and Distributed and Management Security Security Security Security Security Heterogeneous Heterogeneous Heterogeneous Federated Data Federated Data Information Management Database Security Database Security Database Security Security Security Information Secure Information Information Information Information Relational Database Data Management Technologies Database Database Database Database Database Retrieval Retrieval Retrieval Retrieval Retrieval Database Systems Systems Systems Systems Systems Systems Systems Information and Information and Information and Information and Information and Heterogeneous Database Information Information Computer Computer Computer Object Database Computer Computer Information Information Information Security Security Security Security Security Management Management Management Management Management Management Distributed Knowledge Knowledge Knowledge Knowledge Knowledge Management Management Databases Management Management Management
What is Information Management? • Information management essentially analyzes the data and makes sense out of the data • Several technologies have to work together for effective information management • Data Warehousing: Extracting relevant data and putting this data into a repository for analysis • Data Mining: Extracting information from the data previously unknown • Multimedia: managing different media including text, images, video and audio • Web: managing the databases and libraries on the web
Data Warehouse Data Warehouse: Data correlating Employees With Medical Benefits and Projects Users Query the Warehouse Could be any DBMS; Usually based on the relational data model Oracle DBMS for Employees Sybase DBMS for Projects Informix DBMS for Medical
Information Harvesting Knowledge Mining Data Mining Knowledge Discovery in Databases Data Dredging Data Archaeology Data Pattern Processing Database Mining Knowledge Extraction Siftware The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data, often previously unknown, using pattern recognition technologies and statistical and mathematical techniques (Thuraisingham 1998) Data Mining
Video and Metadata Multimedia Information Management Broadcast News Editor (BNE) Video Source Broadcast News Navigator (BNN) Correlation Scene Change Detection Story GIST Theme Broadcast Detection Frame Classifier Key Frame Selection Commercial Detection Imagery Silence Detection Story Segmentation Multimedia Database Management System Audio Speaker Change Detection Closed Caption Text Token Detection Named Entity Tagging Closed Caption Preprocess Segregate Video Streams Analyze and Store Video and Metadata Web-based Search/Browse by Program, Person, Location, ...
TRUST P R I V A C Y Logic, Proof and Trust Rules/Query Other Services RDF, Ontologies XML, XML Schemas URI, UNICODE Semantic Web • Adapted from Tim Berners Lee’s description of the Semantic Web • Some Challenges: Security and Privacy cut across all layers; Integration of Services; Composability
Semantic Web Technologies • Web Database/Information Management • Information retrieval and Digital Libraries • XML, RDF and Ontologies • Representation information • Information Interoperability • Integrating heterogeneous data and information sources • Intelligent agents • Agents for locating resources, managing resources, querying resources and understanding web pages • Semantic Grids • Integrating semantic web with grid computing technologies
Secure Data Sharing Across Coalitions Data/Policy for Coalition Export Export Data/Policy Data/Policy Export Data/Policy Component Component Data/Policy for Data/Policy for Agency A Agency C Component Data/Policy for Agency B
Some Emerging Information Management Technologies • Visualization • Visualization tools enable the user to better understand the information • Peer-to-Peer Information Management • Peers communicate with each other, share resources and carry out tasks • Sensor and Wireless Information Management • Autonomous sensors cooperating with one another, gathering data, fusing data and analyzing the data • Integrating wireless technologies with semantic web technologies
Information Management for Applications: Examples • Decision Support • E-Commerce • Collaboration • Training • Knowledge Management • Virtual Organizations and Dynamic Coalitions
Outline of Data Mining • What is Data Mining • Steps to Data Mining • Need for Data Mining • Example Applications • Technologies for Data Mining • Why Data Mining Now? • Preparation for Data Mining • Data Mining Tasks, Methodology, Techniques • Commercial Developments • Status, Challenges , and Directions • Example Data Mining Technique
Information Harvesting Knowledge Mining Data Mining Knowledge Discovery in Databases Data Dredging Data Archaeology Data Pattern Processing Database Mining Knowledge Extraction Siftware The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data, often previously unknown, using pattern recognition technologies and statistical and mathematical techniques (Thuraisingham 1998) Data Mining
Steps to Data Mining Clean/ modify data sources Mine the data Integrate data sources Report final results Examine Results/ Prune results Data Sources
Need for Data Mining • Large amounts of current and historical data being stored • As databases grow larger, decision-making from the data is not possible; need knowledge derived from the stored data • Data for multiple data sources and multiple domains • Medical, Financial, Military, etc. • Need to analyze the data • Support for planning (historical supply and demand trends) • Yield management (scanning airline seat reservation data to maximize yield per seat) • System performance (detect abnormal behavior in a system) • Mature database analysis (clean up the data sources)
Example Applications • Medical supplies company increases sales by targeting certain physicians in its advertising who are likely to buy the products • A credit bureau limits losses by selecting candidates who are likely not to default on their payment • An Intelligence agency determines abnormal behavior of its employees • An investigation agency finds fraudulent behavior of some people
Integration of Multiple Technologies Artificial Intelligence Machine Learning Database Management Statistics Parallel Processing Visualization Data Mining
Why Data Mining Now? • Large amounts of data is being produced • Data is being organized • Technologies are developing for database management, data warehousing, parallel processing, machine intelligent, etc. • It is now possible to mine the data and get patterns and trends • Interesting applications exist
Preparation for Data Mining • Getting the data into the right format • Data warehousing • Scrubbing and cleaning the data • Some idea of application domain • Determining the types of outcomes • e.g., Clustering, classification • Evaluation of tools • Getting the staff trained in data mining
Some Types of Data Mining (Data Mining Tasks) • Classification – grouping records into meaningful subclasses • e.g., Marketing organization has a list of people living in Manhattan all owning cars costing over 20K • Sequence Detection • John always buys groceries after going to the bank • Data dependency analysis – identifying potentially interesting dependencies or relationships among data items • If John, James, and Jane meet, Bill is also present • Deviation detection – discovery of significant differences between an observation and some reference • Anomalous instances • Discrepancies between observed and expected values
Data Mining Methodology (or Approach) • Top-down • Hypothesis testing • Validate beliefs • Bottom-up • Discover patterns • Directed • Some idea what you want to get • Undirected • Start from fresh
Some Data Mining Techniques • Market Basket analysis • Decision Trees • Neural networks • Link Analysis • Genetic Algorithms • Automatic Cluster Detection • Inductive logic programming
Commercial Developments in Data Mining: Some Products • WizSoft - WhizWhy • Hugin - Hugin • IBM - Intelligent Miner • Red Brick - DataMind • Neo Vista - Decision Series • Reduct Systems - Datalogic/R • IDIS - Information Discovery • Lockheed Martin - Recon • Nicesoft – Nicel • SAS – Enterprise Miner
Current Status, Challenges and Directions • Status • Data Mining is now a technology • Several prototypes and tools exist; Many or almost all of them work on relational databases • Challenges • Mining large quantities of data; Dealing with noise and uncertainty, reasoning with incomplete data • Directions • Mining multimedia and text databases, Web mining (structure, usage and content), Mining metadata, Real-time data mining
Example Data Mining Technique:What is Market Basket Analysis? • Market basket analysis is a collection of techniques that will discover rules such as what items are purchased together • It has roots in point of sale transactions; but has gone beyond this applications • E.g., who travels together, who is seen with whom, etc. • Market basket analysis is used as a starting point when transactions data is available and we are not sure of the patterns we are looking for • Find items that are purchased together • Essentially market basket analysis produces association rules
Example • Person Countries Visited • John England, France • James Germany, England, Switzerland • William England, Austria • Mary England, Austria, France • Jane Switzerland, France Co-Occurrence Table England Switzerland Germany France Austria England 4 1 1 2 2 Switzerland 1 2 1 1 0 Germany 1 1 1 0 0 France 2 1 0 3 1 Austria 2 0 0 1 2
Example (Concluded) • England and France / England and Austria are more likely to be traveled together than any other two countries • Austria is never traveled together with Germany or Switzerland • Germany is never traveled together with Austria or France • Rule: • If a person travels to France then he/she also travels to England Support for this rule is 2 out of 5 and that is 40% since 2 trips out of five support this rule Confidence for this rule is 66% since two out of three trips that contain France also contains England That is, if France then England rule has support 40% and confidence 66% • Challenge: How to automatically generate the rules
Basic Process • Choosing the right set of items • Need to gather the right set of transaction data and the right level of detail, ensuring data quality • Generating rules from the data • Generate co-occurrence matrix for single items • Generate co-occurrence matrix with 2 items and use this to find rules with 2 items • Generate co-occurrence matrix with 3 items and use this to find rules with 3 items; etc. - - - • Overcoming practical limits imposed by thousand of items • Avoid combinatorial explosions
Association Rules • Rules that find associations in data • Example of a association rule is (x1, x2, x3} x4 meaning that if x1, x2, and x3 are purchased x4 is also purchased • Association rules have confidence values • Strong rules are rules with confidence value above a threshold • Challenge is to improve the algorithm • E.g., Partition-based approach, sampling
Challenges and Directions • Performance improvements • Applying techniques for web mining including web content mining, web structure mining and web usage mining • Finding associations in text • Associations between words in a document or multiple documents