110 likes | 292 Views
CS C446 Data Storage Technologies & Networks. Agenda. Course Motivation Storage Requirements (Models) Large Data Cases Data Explosion Data Characteristics & Storage Characteristics. Course Focus. Storage Requirements. From a (logical) computing perspective: Transitory data
E N D
CS C446 Data Storage Technologies & Networks Agenda Course Motivation Storage Requirements (Models) Large Data Cases Data Explosion Data Characteristics & Storage Characteristics
Course Focus Storage Requirements • From a (logical) computing perspective: • Transitory data • To be stored for the period of computation. • Persistent, Isolated data • To be stored across computations but useful only by (or through) a single computer • Persistent, Shared data • To be stored across computations and used by (or through) multiple computers • Persistent, Exportable data • To be stored beyond computations and possibly used by external (non-computing) systems Is sharing of transitory data meaningful? Sundar B.
Storage Requirements [2] • Persistent Isolated Data • Data is accessible to (or accessible through) a single computer and persistent across computations • (Input/Output on the) Storage is controlled by the computer • Types of (logical) data accesses: • Large streams – either text or binary (e.g. program code, multimedia) • Transactional units – records • Applications need not be aware of physical details of storage • Operating System provides a logical layer – File System • Special purpose logical layers are possible – Database System Sundar B.
Network Attached Storage vs. Storage Area Networks Storage Requirements [3] • Persistent Shared Data • Data is accessible to (or accessible through) multiple computers and persistent across computations • Storage is shared by multiple computers i.e. available on a network • Question: Is the network same as the “network of computers”? • Question: Is the network “transparent” to the computers? Sundar B.
Storage Requirements[4] • Network Attached Storage • Shared storage is on the same network as computers • Computers are aware of the network and the fact that storage is attached over the network • Translation: Data are accessed as files or logical units • Storage Area Network • Shared storage is on a different network from the computer network – • but these networks are connected. • Computers are not aware of this (storage) network • Translation: Data are accessed raw (as from direct storage) Sundar B.
Large Data – Cases • Case 1: Genome Database • http://csis/faculty/sundarb/courses/dstn/lectures/lec2-cases/genome.txt • Case 2: Mass General – X-Rays • http://csis/faculty/sundarb/courses/dstn/lectures/lec2-cases/hospital.txt • Case 3: Google’s replica of the web • http://csis/faculty/sundarb/courses/dstn/lectures/lec2-cases/google.txt Sundar B.
Data Explosion • Consider Case 2 (Mass General): • Mass General is moving to 3D images • (in technicolor?) • Increasing Resolution • Consider Case 3 (Google): • Number of websites/pages is ever increasing • Google Library (Books) • Google Earth (Geo/Carto graphic Images) • More generally, • Businesses collecting more information 24x365 • New (automated) technologies for data collection (e.g. RFID) Sundar B.
Data Explosion [2] • Examples of data collection*: • Birth certificates by hospitals in the U.S. • 1983: 280 bytes • 1996: 1864 bytes • Grocery store purchase entry • 1983: 32 bytes • 1996: 1272 bytes • Reference: • L. Sweeney, Information Explosion. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies, L. Zayatz, P. Doyle, J. Theeuwes and J. Lane (eds), Urban Institute, Washington, DC, 2001. Sundar B.
Data Explosion [3] • Data quality (resolution) • Refer to example data collections • Data Availability and Access • Copies/Replicas for simultaneous access or low latency access • All major websites are “Akamaized” • Copies/Replicas for fault tolerance / disaster recovery • Companies were up and running within a few hours after WTC collapse • Regulatory Compliance • HIPAA (U.S. Govt.) requires 7 years of data to be stored by hospitals • Sarbanes-Oaxley (U.S. Govt.) requires documentation on corporate governance • i.e. all corporate decisions / deliberations must be recorded Sundar B.
Data Explosion [4] • High volume requirements drive market • Low cost storage • Leads to Increased access to storage • Personal / Organizational storage is more affordable • Personal Multimedia content (mp3 songs, digicam/mobile pictures/videos) is increasing • All Course contents online and growing; every technology has its website (dhcp.com, snia.org, …) • Mass storage services are feasible • Gmail gives 2+x GB, x is monotonically increasing over time • Already > 1.2 billion email users (not all in Gmail but …) and growing • Blog sites, Photosharing sites, {naukri, monster}.coms, Pornographic sites, … • Nothing is ever deleted from websites – local or global! • Data is the new entropy!!! Sundar B.
Data & Storage Characteristics • Data • May be transactional or stream data • But 80% of data is “semi-structured” or “unstructured”: • X-Ray image does not have any structure • A website (in HTML) is semi-structured • Is business critical • Storage • Must be highly available • With redundancy/replication and across non-local networks • Must provide high data rates • Must support both streaming and transactional access! Sundar B.