470 likes | 814 Views
Data and Databases. The Data Basics. Data Facts concerning things such as people, objects, or events Information data that have been processed and presented in a form suitable for human interpretation Database a collection of interrelated , shared , and controlled data.
E N D
The Data Basics • Data • Facts concerning things such as people, objects, or events • Information • data that have been processed and presented in a form suitable for human interpretation • Database • a collection of interrelated, shared, and controlled data
Modern Database Systems Accounting Finance Sales
Modern Database Systems Accounting Application Programs Accounting Finance Finance Application Programs Sales Sales Application Programs
Modern Database Systems Accounting Application Programs Accounting Integrated Database Finance Finance Application Programs Sales Sales Application Programs
Modern Database Systems Accounting Application Programs Accounting Integrated Database Finance Finance Application Programs DBMS Sales Sales Application Programs
Advantages of Modern Database Environments • Minimal data redundancy • Data consistency • Integration of data • Data sharing • Ease of application development • Security, privacy, and integrity controls • Data accessibility and responsiveness • Data independence • Reduced program maintenance
Components of the Modern Database Environment User Interface
Components of the Modern Database Environment Data Administrators System Developers End-users User Interface
Components of the Modern Database Environment Data Administrators System Developers End-users Application Programs User Interface CASE Tools
Components of the Modern Database Environment Data Administrators System Developers End-users Application Programs User Interface CASE Tools DBMS
Components of the Modern Database Environment Data Administrators System Developers End-users Application Programs User Interface CASE Tools DBMS Repository Database
What is it? • Historically, the traditional database system was highly decentralized • Modern integrated databases brought back the concept of centralization • Distributed Database concept is now pushing back toward decentralized data. • Within the next 10 years, integrated databases may be an “antique curiosity”
Definitions • a collection of interrelated, shared, and controlled data.... • defines a database • Distributed database • logically interrelated collection of shared and controlled data distributed over a computer network • A Distributed DBMS (DDBMS) • software that manages a distributed database and makes that distribution transparent to the user
DDBMS • single database split into several fragments • each fragments stored on separate computer • each fragment controlled by a separate DBMS • each computer part of a single network • Users use applications which access the data • only local data ---> local applications • data located elsewhere ---> global application • A DDBMS contains at least one global application
A Sample topology Database 3 Database 1 Network Database 2
Homogenous Vs. Heterogeneous • Homogenous • easier to design • provides for incremental growth (adding new sites is easy) • Heterogeneous • occurs when integration is considered post facto • translations required to communicate between different DBMS • relational DBMS sites use a gateway
Data Representation • Binary digit (bit) • String of bits (Byte) • EBCDIC vs. ASCII • Picture Element (Pixel)
How much does your data weigh? • If an 8GB hard disk weighs approximately one pound …include the weight of shared enclosure, power supply, and electronics … • 8 TB would theoretically weight one ton ! • Some companies • Aetna: 21.8 tons (174.6 TB across 4100 DASD) • Boeing: 50-150 TB (6 – 19 tons) • Atos Origin 37.5 tons (300 TB) • Source: Computerworld, April 23, 2001
Data Storage • In Web-era, data is piling up quickly; space at a premium • Storage solutions • Server-hosted storage • SCSI Arrays • Network Attached Storage (NAS) • Storage Area Networks (SAN)
Server-hosted storage • Both applications and storage on same server • Advantage • Server, OS, and storage all from the same vendor • Easy to replicate • Disadvantages • Expansion limited by server architecture (may need to replace existing media) • Free space on one server not easily accessed by another server • Maintenance affects server and storage (CPUs become obsolete before storage)
In a survey by InfoWorld, 67% were using SCSI arrays for storage Often used with RAID Advantages Embedded computer to manage configuration and monitor performance Can be made fault-tolerant SCSI cable offers good throughput Disadvantages Expansion difficult once space is used Significant costs of layout (SCSI cable limited in distance) SCSI ArraysSmall Computer System Interface (“scuzzy”)
Network Attached Storage (NAS) • Devices that can be plugged into LAN using standard network cables • Advantages • Easiest and cheapest • Pre-configured with OS tailored for data handling • Can be few GB to several TB • Easy to connect • Faulty components can be changed without downtime • Disadvantages • Adds burden to LAN traffic • Access speed limited by bandwidth • Each NAS device has to managed independently
Storage Area Networks (SAN) • Dedicated network of servers and storage devices • Uses hubs and switches • No limit to number of storage servers • Uses fiber – can extend long distances; good bandwidth (fibre channel) • Easy to set up – needs special adaptors • Works with any OS • Easy migration from old systems
SAN – why so few? • In the InfoWorld survey, only 14% had SAN • Problems cited • Lack of internal knowledge • High cost (can be several million dollars depending on size) • Perception that it is only for large companies • Computerworld projects 70% corporate data on SANs by 2005 (Jan 28, 2002 issue) • Possible solution • Storage Service Provider (SSP) • Manage data for you
The future of storage • Fibre Channel • network technology designed for storage and server clustering • iSCSI • SCSI codes encapsulated in IP packets for transmission over Ethernet networks • Fibre Channel over IP (FCIP) • Tunnels data between geographically dispersed SANs over IP networks • Internet Fibre Channel Protocol • Hybrid version of FCIP that sends FC data over IP networks using iSCSI protocols (to interconnect exisiting SANs) • Infiniband • An I/O technology that allows overcoming problems with tradition PCI buses.
Storage Virtualization • Software that links different storage devices into one virtual pool • Can link NAS, SAN, and DASD • Helps with storage management by introducing new layer of abstraction • Excellent for creating sense of homogeneity • Example is SANsymphony by DataCore Software
Data Requirements • Organizations need access to • operational data • historical data • legacy data • subscription databases • internet data • Organizations need to • combine data, slice and dice, do complex analysis...
Analytical Processing Requirements • Database systems need to support at least 4 levels of analysis within the firm • simple queries • “what if” analysis • causation • prognostication
The levels... • Simple queries • using historical and current data • typically done with spreadsheets or SQL • “What-if” • if labor costs increase by 5% next year, and sales are stagnant, what will happen to profits? • spreadsheets and database tools
Levels... • Causation • step back and analyze the past to see what caused the current state of events • why did cough syrup sales increase in the Northeast in January when it stayed constant elsewhere... influenza? ... competitors go bust? • Prognostication • what current conditions must change to increase profits by 5% next year/
Data Warehouses • Aimed at supporting all levels of analysis and information formats • DSS’ have existed for many years • Labeled data warehouse in the 1990s and top executives began top pay notice • Many different definitions (some relating to data, others to people or processes)
Simple Definition A data warehouse is a collection of integrated, subject-oriented databases designed to support the decision support function, where each unit of data is relevant to some moment in time.
Four Defining Concepts • Subject-oriented • Integrated • Time-variant • Non-volatile
Concepts.... • Subject-oriented • requires database design • revolves around specific business entities • many companies simply pull together old files • Integrated data • data warehouse database designed using a proper methodology • consistency in naming conventions for keys, relationships etc. • warehouses require large design effort
Data Mining True genius resides in the capacity for evaluation of uncertain, hazardous, and often conflicting information - Sir Winston Churchill
What is data mining? • Large databases can be searched for relationships patterns, and trends, which prior to the search were not known to exist. • Data mining is the process of asking a processing engine to show answers to questions that we do not know how to ask.
Data Mining techniques • Four major types of processing algorithms (or rules): • associations • clustering • classification • sequential patterns
Associations (Link Analysis) • Find correlations between one set of items or events and another such set • eg: 78% of all people who buy a desktop PC will also buy add-ons • eg: large percentage of buyers will buy potato chips if they are stacked near the beverages aisle...
Clustering • Used to discover hitherto unknown or unsuspected class of data • Defect Analysis or Group affinity analysis • Some particular common characteristic between good customers that cancel their own credit cards
Classification • Identifies the process and must discover the rules that whether an item belongs to a particular subset of data (a subtype) • Eg: Credit card approval • do a variety of customer characteristics put him/her in a subset of customers who can charge?
Sequential Patterns • Mostly used for pattern analysis • uses historical data store of all transactions in a warehouse • Eg: Buyers who purchase window coverings and then buy linens within three months will purchase furniture within the next 12 months (new residence furnishings buying pattern)