340 likes | 351 Views
Explore the security issues, data mining techniques, and the integration of heterogeneous data sources in data warehousing. Learn about metadata, indexing, and multilevel security.
E N D
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #17 Data Warehousing, Data Mining and Security March 3, 2008
Outline • Background on Data Warehousing • Security Issues for Data Warehousing • Data Mining and Security
What is a Data Warehouse? • A Data Warehouse is a: • Subject-oriented • Integrated • Nonvolatile • Time variant • Collection of data in support of management’s decisions • From: Building the Data Warehouse by W. H. Inmon, John Wiley and Sons • Integration of heterogeneous data sources into a repository • Summary reports, aggregate functions, etc.
Example Data Warehouse Data Warehouse: Data correlating Employees With Medical Benefits and Projects Users Query the Warehouse Could be any DBMS; Usually based on the relational data model Oracle DBMS for Employees Sybase DBMS for Projects Informix DBMS for Medical
Some Data Warehousing Technologies • Heterogeneous Database Integration • Statistical Databases • Data Modeling • Metadata • Access Methods and Indexing • Language Interface • Database Administration • Parallel Database Management
Data Warehouse Design • Appropriate Data Model is key to designing the Warehouse • Higher Level Model in stages • Stage 1: Corporate data model • Stage 2: Enterprise data model • Stage 3: Warehouse data model • Middle-level data model • A model for possibly for each subject area in the higher level model • Physical data model • Include features such as keys in the middle-level model • Need to determine appropriate levels of granularity of data in order to build a good data warehouse
Distributing the Data Warehouse • Issues similar to distributed database systems Branch A Branch A Branch B Branch B Branch B Warehouse Branch A Warehouse Central Bank Central Bank Central Warehouse Central Warehouse Distributed Warehouse Non-distributed Warehouse
Indexing for Data Warehousing • Bit-Maps • Multi-level indexing • Storing parts or all of the index files in main memory • Dynamic indexing
Data Warehousing and Security • Security for integrating the heterogeneous data sources into the repository • e.g., Heterogeneity Database System Security, Statistical Database Security • Security for maintaining the warehouse • Query, Updates, Auditing, Administration, Metadata • Multilevel Security • Multilevel Data Models, Trusted Components
Security for Integrating Heterogeneous Data Sources • Integrating multiple security policies into a single policy for the warehouse • Apply techniques for federated database security? • Need to transform the access control rules • Security impact on schema integration and metadata • Maintaining transformations and mappings • Statistical database security • Inference and aggregation • e.g., Average salary in the warehouse could be unclassified while the individual salaries in the databases could be classified • Administration and auditing
Security Policy for the Warehouse Federated Policy Federated Policy for Federation for Federation F2 F1 Export Policy Export Policy Export Policy Export Policy for Component A for Component B for Component B for Component C Generic Policy Generic Policy Generic policy for Component A for Component B for Component C Component Policy Component Policy Component Policy for Component A for Component B for Component C Security Policy Integration and Transformation Federated policies become warehouse policies?
Multi-Tier Architecture Tier N: Data Warehouse Tier N: Secure Data Warehouse Builds on Tier N Builds on Tier N - - 1 1 * * Each layer builds on the Previous Layer Schemas/Metadata/Policies * * Tier 2: Builds on Tier 1 Tier 2: Builds on Tier 1 Tier 1:Secure Data Sources Tier 1:Secure Data Sources
Administration • Roles of Database Administrators, Warehouse Administrators, Database System Security officers, and Warehouse System Security Officers? • When databases are updated, can trigger mechanism be used to automatically update the warehouse? • i.e., Will the individual database administrators permit such mechanism?
Auditing • Should the Warehouse be audited? • Advantages • Keep up-to-date information on access to the warehouse • Disadvantages • May need to keep unnecessary data in the warehouse • May need a lower level granularity of data • May cause changes to the timing of data entry to the warehouse as well as backup and recovery restrictions • Need to determine the relationships between auditing the warehouse and auditing the databases
Multilevel Security • Multilevel data models • Extensions to the data warehouse model to support classification levels • Trusted Components • How much of the warehouse should be trusted? • Should the transformations be trusted? • Covert channels, inference problem
Status and Directions • Commercial data warehouse vendors are incorporating role-based security (e.g., Oracle) • Many topics need further investigation • Building a secure data warehouse • Policy integration • Secure data model • Inference control
Data Mining Needs for Counterterrorism: Non-real-time Data Mining • Gather data from multiple sources • Information on terrorist attacks: who, what, where, when, how • Personal and business data: place of birth, ethnic origin, religion, education, work history, finances, criminal record, relatives, friends and associates, travel history, . . . • Unstructured data: newspaper articles, video clips, speeches, emails, phone records, . . . • Integrate the data, build warehouses and federations • Develop profiles of terrorists, activities/threats • Mine the data to extract patterns of potential terrorists and predict future activities and targets • Find the “needle in the haystack” - suspicious needles? • Data integrity is important • Techniques have to SCALE
Data Mining for Non Real-time Threats Clean/ Integrate Build modify data Profiles data of Terrorists sources and Activities sources Mine Data sources the with information about terrorists data and terrorist activities Report Examine final results/ results Prune results
Data Mining Needs for Counterterrorism: Real-time Data Mining • Nature of data • Data arriving from sensors and other devices • Continuous data streams • Breaking news, video releases, satellite images • Some critical data may also reside in caches • Rapidly sift through the data and discard unwanted data for later use and analysis (non-real-time data mining) • Data mining techniques need to meet timing constraints • Quality of service (QoS) tradeoffs among timeliness, precision and accuracy • Presentation of results, visualization, real-time alerts and triggers
Data Mining for Real-time Threats Rapidly Integrate Build sift through data and data real - time discard models sources in irrelevant real - time data Mine Data sources the with information about terrorists data and terrorist activities Report Examine final Results in results Real - time
Example Success Story - COPLINK • COPLINK developed at University of Arizona • Research transferred to an operational system currently in use by Law Enforcement Agencies • What does COPLINK do? • Provides integrated system for law enforcement; integrating law enforcement databases • If a crime occurs in one state, this information is linked to similar cases in other states • It has been stated that the sniper shooting case may have been solved earlier if COPLINK had been operational at that time
Where are we now? • We have some tools for • building data warehouses from structured data • integrating structured heterogeneous databases • mining structured data • forming some links and associations • information retrieval tools • image processing and analysis • pattern recognition • video information processing • visualizing data • managing metadata
What are our challenges? • Do the tools scale for large heterogeneous databases and petabyte sized databases? • Building models in real-time; need training data • Extracting metadata from unstructured data • Mining unstructured data • Extracting useful patterns from knowledge-directed data mining • Rapidly forming links and associations; get the big picture for real-time data mining • Detecting/preventing cyber attacks • Mining the web • Evaluating data mining algorithms • Conducting risks analysis / economic impact • Building testbeds
IN SUMMARY: • Data Mining is very useful to solve Security Problems • Data mining tools could be used to examine audit data and flag abnormal behavior • Much recent work in Intrusion detection (unit #18) • e.g., Neural networks to detect abnormal patterns • Tools are being examined to determine abnormal patterns for national security • Classification techniques, Link analysis • Fraud detection • Credit cards, calling cards, identity theft etc. BUT CONCERNS FOR PRIVACY