190 likes | 316 Views
CS597A: Managing and Exploring Large Datasets. Kai Li. About This Seminar. Goal: Identify research directions and issues in managing and exploring large datasets Plan: Overview of a few of state-of-the-art storage systems
E N D
About This Seminar • Goal: • Identify research directions and issues in managing and exploring large datasets • Plan: • Overview of a few of state-of-the-art storage systems • Reading some papers on a few research systems in storage systems, data management and data exploration • Discussions on wild ideas • Define, work, and present course projects
Why Is This Area Interesting?(Where Are The Bottlenecks?) Network Create Transform Transmit Store and Retrieve
Computer Food Chains Supercomputer (Cray, etc) Mini-super (Convex, etc) Mainframe (IBM 370) Minicomputer (VAX) WS (SUN) PC (Computer systems in 1980s) Supercomputer (Cray, etc) Servers (IBM, SUN) PC Laptop PDA (Computer systems in 1990s and 2000s)
Storage Arrays of Food Chains? Direct Attached Storage (DAS) USB, Microdrive, Flash ATA disks “Super” SCSI RAID ATA RAID Storage Area Network (SAN) “Super” SAN storage (EMC, Hitachi, IBM) “MiniSuper” SAN storage (HPQ, Startups) iSCSI (Startups) Network Attached Storage (NAS) PC storage (Dell, Snap!, MSFT SAK boxes) “Super” NAS (NetApp, SUN) “MiniSuper” NAS (Startups)
Typical General Infrastructures File servers/wo disks Storage Area Network Network Backuptape library Mirroredstorage(e.g EMC) BCV or 3rd copy (e.g. EMC) Clients File servers/w disks Storage Area Network Network Backuptape library Clients
Exponential Growth(Courtesy Jim Gray, Turing Lecture 99) • Performance/Price doubles every 18 months • 100x per decade • Progress in next 18 months = ALL previous progress • New storage = sum of all old storage (ever) • New processing = sum of all old processing. 15 years ago
Disk drives beat tapes in 2002 in $/TB (IDC) Disk $/TB declines 50% / year Tape $/TB declines 29% / year But, ATA arrays ($/TB) beat tape libraries in 2006 (Gartner) Disk system $/TB declines 40%/year Tape library $/TB declines 29%/year Raw Storage Is Cheap 2006 $/TB 2002 (Source: Gartner and IDC)
Summary of Storage Trends • Disk density beats Moore’s Law • Data growth rate follows Moore’s law • Raw disks are cheap while storage systems are very expensive • Crossover from tapes to disks
How Much Information Is there?(Courtesy Jim Gray, Turing Lecture 99) Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded • Soon everything can be recorded and indexed • Most data never be seen by humans • Precious Resource: Human attentionAuto-Summarization Auto-Searchis key technology.www.lesk.com/mlesk/ksg97/ksg.html All BooksMultiMedia All LoC books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
How Much Information Is There?(Hal Varian, Peter Lyman et al. 2001) • Web has a lot of documents • “Surface” web had 2.5B docs, adding 7.5M pages/day • “Deep” web had 550B docs, 95% publicly accessible • Most websites are in English • 78% all websites and 96% e-commerce • E-mail generates a large amount of information • A “white-collar” worker receives ~40 messages/day • E-mail information is 500x of web every year
How Much Information Is There?(Hal Varian, Peter Lyman et al. 2001)
Challenges In Managing and Exploring Datasets • Disk’s behavior is like a big tape • Storage is indeed “infinitely” large • Ability to get information is slow • Reliability is far from what we need • Disks do fail • Software and human corrupt data • Managing storage is difficult • Storage and data are both growing • Retrieving data is difficult • Get what you want • See what you get
Properties of A Research Goal(Jim Gray, 1999) • Simple to state • Not obvious how to do it • Clear benefit • Progress and solution is testable • Can be broken in to smaller steps • So that you can see intermediate progress
Systems Challenges(Lampson, SOSP Keynote 99) • Systems that work • Meeting their specs • Always available • Adapting to changing environment • Evolving while they run • Made from unreliable components • Growing without practical limit • Credible simulations or analysis • Writing good specs • Testing • Performance • Understanding when it doesn’t matter
What Should the “New World” Focus Be?(Hennessy, FCRC keynote 99) • Availability • Both appliance & service • Maintainability • Two functions: • Enhancing availability by preventing failure • Ease of SW and HW upgrades • Scalability • Especially of service • Cost • per device and per service transaction • Performance • Remains important, but its not SPECint
Tentative Syllabus • Today: About the Course • Week 2: Read several vision papers • Week 3: Guest lecture on archival storage • Week 4: Commercial storage systems (EMC, Veritas, NetApp) • Week 5: Global-scale storage (OceanStore and the like) • Week 6: Managing personal (Coda, Bayou, Personal RAID) • Week 7: Managing geographical data (TerraServer) • Week 8: Guest lecture on managing astrophysical data (SkyServer) • Week 9: Managing and exploring large scientific data • Week 10: Managing medical data • Week 11: Managing genomic data • Week 12: Project reports and presentations • Detailed, tentative reading will be available this weekend