720 likes | 739 Views
CyberBricks: The future of Database And Storage Engines Jim Gray http://research.Microsoft.com/~Gray. Outline. What storage things are coming from Microsoft? TerraServer: a 1 TB DB on the Web Storage Metrics: Kaps, Maps, Gaps, Scans The future of storage: ActiveDisks.
E N D
CyberBricks:The future of Database And Storage EnginesJim Grayhttp://research.Microsoft.com/~Gray
Outline • What storage things are coming from Microsoft? • TerraServer: a 1 TB DB on the Web • Storage Metrics: Kaps, Maps, Gaps, Scans • The future of storage: ActiveDisks
New Storage Software From Microsoft • SQL Server 7.0: • Simplicity: Auto-most-things • Scalability on Win95 to Enterprise • Data warehousing: built-in OLAP, VLDB • NT 5: • Better volume management (from Veritas) • HSM architecture • Intellimirror • Active directory for transparency
Dedicated Windows terminal Net PC Existing, Desktop PC MS-DOS, UNIX, Mac clients Thin Client SupportTSO comes to NT • Lower Per-Client cost • Huge centralized data stores. “Hydra” Server
Windows NT 5.0Intelli-Mirror™ • Files and settings mirrored on client and server • Great for mobile users • Facilitates roaming • Easy to replace PCs • Optimizes network performance • Means HUGE data stores
Outline • What storage things are coming from Microsoft? • TerraServer: a 1 TB DB on the Web • Storage Metrics: Kaps, Maps, Gaps, Scans • The future of storage: ActiveDisks
Microsoft TerraServer: Scaleup to Big Databases • Build a 1 TB SQL Server database • Data must be • 1 TB • Unencumbered • Interesting to everyone everywhere • And not offensive to anyone anywhere • Loaded • 1.5 M place names from Encarta World Atlas • 3 M Sq Km from USGS (1 meter resolution) • 1 M Sq Km from Russian Space agency (2 m) • On the web (world’s largest atlas) • Sell images with commerce server.
Earth is 500 Tera-meters square USA is 10 tm2 100 TM2 land in 70ºN to 70ºS We have pictures of 6% of it 3 tsm from USGS 2 tsm from Russian Space Agency Compress 5:1 (JPEG) to 1.5 TB. Slice into 10 KB chunks Store chunks in DB Navigate with Encarta™ Atlas globe gazetteer StreetsPlus™ in the USA Someday multi-spectral image of everywhere once a day / hour 1.8x1.2 km2 tile 10x15 km2 thumbnail 20x30 km2 browse image 40x60 km2 jump image Microsoft TerraServer Background
US Geologic Survey 4 Tera Bytes Most data not yet published Based on a CRADA Microsoft TerraServer makes data available. 1x1 meter 4 TB Continental US New DataComing USGS “DOQ” USGS Digital Ortho Quads (DOQ)
Russian Space Agency(SovInfomSputnik)SPIN-2 (Aerial Images is Worldwide Distributor) • 1.5 Meter Geo Rectified imagery of (almost) anywhere • Almost equal-area projection • De-classified satellite photos (from 200 KM), • More data coming (1 m) • Selling imagery on Internet. • Putting 2 tm2 onto Microsoft TerraServer. SPIN-2
Microsoft BackOffice SPIN-2 Demo http://www.TerraServer.Microsoft.com/
navigate by coverage map to White House Download image buy imagery from USGS navigate by name to Venice buy SPIN2 image & Kodak photo Pop out to Expedia street map of Venice Mention that DB will double in next 18 months (2x USGS, 2X SPIN2) Demo
Compaq AlphaServer 8400 8x400Mhz Alpha cpus 10 GB DRAM 324 9.2 GB StorageWorks Disks 3 TB raw, 2.4 TB of RAID5 STK 9710 tape robot (~14 TB) WindowsNT 4 EE, SQL Server 7.0 The Microsoft TerraServer Hardware
Software TerraServer Web Site Web Client ImageServer Active Server Pages Internet InformationServer 4.0 HTML JavaViewer The Internet browser MTS Terra-ServerStored Procedures Internet InfoServer 4.0 Internet InformationServer 4.0 SQL Server 7 MicrosoftSite Server EE Microsoft AutomapActiveX Server Automap Server Image DeliveryApplication SQL Server7 TerraServer DB Image Provider Site(s)
System Management & Maintenance • Backup and Recovery • STK 9710 Tape robot • Legato NetWorker™ • SQL Server 7 Backup & Restore • Clocked at 80 MBps (peak)(~ 200 GB/hr) • SQL Server Enterprise Mgr • DBA Maintenance • SQL Performance Monitor
H: G: E: F: Microsoft TerraServer File Group Layout • Convert 324 disks to 28 RAID5 setsplus 28 spare drives • Make 4 WinNT volumes (RAID 50)595 GB per volume • Build 30 20GB files on each volume • DB is File Group of 120 files
ESA LoadMgr AlphaServer4100 AlphaServer4100 60 4.3 GB Drives Image Delivery and LoadIncremental load of 4 more TB in next 18 months DLTTape “tar” \Drop’N’ LoadMgrDB DoJob Wait 4 Load DLTTape NTBackup ... Cutting Machines LoadMgr 10: ImgCutter 20: Partition 30: ThumbImg40: BrowseImg 45: JumpImg 50: TileImg 55: Meta Data 60: Tile Meta 70: Img Meta 80: Update Place ImgCutter 100mbitEtherSwitch \Drop’N’ \Images TerraServer Enterprise Storage Array STKDLTTape Library AlphaServer8400 108 9.1 GB Drives 108 9.1 GB Drives 108 9.1 GB Drives
Kilo Mega Giga Tera Peta Exa Zetta Yotta Some Tera-Byte Databases • The Web: 1 TB of HTML • TerraServer 1 TB of images • Several other 1 TB (file) servers • Hotmail: 7 TB of email • Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked • EOS/DIS (picture of planet each week) • 15 PB by 2007 • Federal Clearing house: images of checks • 15 PB by 2006 (7 year history) • Nuclear Stockpile Stewardship Program • 10 Exabytes (???!!)
Kilo Mega Giga Tera Peta Exa Zetta Yotta A letter A novel A Movie Library of Congress (text) LoC (image) LoC (sound + cinima) All Photos All Disks All Tapes All Information!
Michael Lesk’s Pointswww.lesk.com/mlesk/ksg97/ksg.html • Soon everything can be recorded and kept • Most data will never be seen by humans • Precious Resource: Human attention Auto-Summarization Auto-Searchwill be a key enabling technology.
Outline • What storage things are coming from Microsoft? • TerraServer: a 1 TB DB on the Web • Storage Metrics: Kaps, Maps, Gaps, Scans • The future of storage: ActiveDisks
Andromeda 2,000 Years Pluto 2 Years 1.5 hr Sacramento This Campus 10 min This Room My Head 1 min Storage Latency: How Far Away is the Data? 9 10 Tape /Optical Robot 6 Disk 10 100 Memory 10 On Board Cache 2 On Chip Cache 1 Registers
MetaMessage: Technology Ratios Are Important • If everything gets faster&cheaper at the same rate THEN nothing really changes. • Things getting MUCH BETTER: • communication speed & cost 1,000x • processor speed & cost 100x • storage size & cost 100x • Things staying about the same • speed of light (more or less constant) • people (10x more expensive) • storage speed (only 10x better)
Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs
Storage Ratios Changed in Last 20 Years • MediaPrice: 4000X, Bandwidth 10X, Access/s 10X • DRAM:DISK $/MB: 100:1 25:1 • TAPE : DISK $/GB: 100:1 5:1
Disk Access Time • Access time = SeekTime 6 ms 5%/y + RotateTime 3 ms 5%/y + ReadTime 1 ms 25%/y • Other useful facts: • Power rises more than size3 (so small is indeed beautiful) • Small devices are more rugged • Small devices can use plastics (forces are much smaller)e.g. bugs fall without breaking anything
Standard Storage Metrics • Capacity: • RAM: MB and $/MB: today at 100MB & 1$/MB • Disk: GB and $/GB: today at 10GB and 50$/GB • Tape: TB and $/TB: today at .1TB and 10$/GB (nearline) • Access time (latency) • RAM: 100 ns • Disk: 10 ms • Tape: 30 second pick, 30 second position • Transfer rate • RAM: 1 GB/s • Disk: 5 MB/s - - - Arrays can go to 1GB/s • Tape: 3 MB/s - - - not clear that striping works
New Storage Metrics: Kaps, Maps, Gaps, SCANs • Kaps: How many kilobyte objects served per second • the file server, transaction procssing metric • Maps: How many megabyte objects served per second • the Mosaic metric • Gaps: How many gigabyte objects served per hour • the video & EOSDIS metric • SCANS: How many scans of all the data per day • the data mining and utility metric • And: $/Kaps, $/Maps, $/Gaps, $/SCAN
How To Get Lots of Maps, Gaps, SCANS • parallelism: use many little devices in parallel At 10 MB/s: 1.2 days to scan 1,000 x parallel: 100 seconds/scan Parallelism: divide a big problem into many smaller ones to be solved in parallel.
Tape & Optical: Beware of the Media Myth Optical is cheap: 200 $/platter 2 GB/platter =>100$/GB(5x cheaper than disc) Tape is cheap: 100 $/tape 40 GB/tape =>2.5 $/GB (100x cheaper than disc).
Tape & Optical Reality: Media is 10% of System Cost • Tape needs a robot(10 k$ ... 3 m$ ) • 10 ... 1000 tapes (at 40GB each) => 20$/GB ... 200$/GB • (1x…10x cheaper than disc) • Optical needs a robot(50 k$ ) • 100 platters = 200GB ( TODAY ) => 250 $/GB • ( more expensive than disc ) • Robots have poor access times • Not good for Library of Congress (25TB) • Data motel: data checks in but it never checks out!
The Access Time Myth The Myth: seek or pick time dominates The reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server • shorter queues • parallel transfer • lower cost/access and cost/byte This is obvious for disk & tape arrays
My Solution to Tertiary StorageTape Farms, Not Mainframe Silos 100 robots 1M$ 40TB 25$/GB 3K Maps 10K$ robot 1.5K Gaps 10 tapes 2 Scans 400 GB 6 MB/s 25$/GB Scan in 12 hours. many independent tape robots (like a disc farm) 30 Maps 15 Gaps 2 Scans
The Metrics: Disk and Tape Farms Win Data Motel: Data checks in, but it never checks out GB/K$ 1 , 000 , 000 Kaps 100 , 000 Maps Scans 10 , 000 SCANS/Day 1 , 000 100 10 1 0.1 0.01 1000 x D i sc Farm 100x DLT Tape Farm STK Tape Robot 6,000 tapes, 8 readers
Cost Per Access (3-year) 540 ,000 500K 67 ,000 100,000 Kaps/$ Maps/$ Gaps/$ 100 68 SCANS/k$ 23 120 10 4.3 7 7 100 2 1.5 1 0.2 0.1 1000 x Disc Farm STK Tape Robot 100x DLT Tape Farm 6,000 tapes, 16 readers
Storage Ratios Impact on Software • Gone from 512 B pages to 8192 B pages (will go to 64 KB pages in 2006) • Treat disks as tape: • Increased use of sequential access • Use disks for backup copies • Use tape for • VERY COLD data or • Offsite Archive • Data interchange
Summary • Storage accesses are the bottleneck • Accesses are getting larger (Maps, Gaps, SCANS) • Capacity and cost are improvingBUT • Latencies and bandwidth are not improving muchSO • Use parallel access (disk and tape farms) • Use sequential access (scans)
The Memory Hierarchy • Measuring & Modeling Sequential IO • Where is the bottleneck? • How does it scale with • SMP, RAID, new interconnects Goals: balanced bottlenecks Low overhead Scale many processors (10s) Scale many disks (100s) Memory App address space Mem bus File cache Controller Adapter SCSI PCI
PAP (peak advertised Performance) vsRAP (real application performance) • Goal: RAP = PAP / 2 (the half-power point) System Bus 422 MBps 40 MBps 7.2 MB/s 7.2 MB/s Application 10-15 MBps Data 7.2 MB/s File System SCSI Buffers Disk 133 MBps PCI 7.2 MB/s
Temp file Read / Write File System Cache Program uses small (in cpu cache) buffer. So, write/read time is bus move time (3x better than copy) Paradox: fastest way to move data is to write then read it. This hardware islimited to 150 MBpsper processor The Best Case: Temp File, NO IO
Bottleneck Analysis • Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits MemoryRead/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBps
Reads are easy, writes are hard Async write can match WCE. PAP vs RAP 422 MBps 142 MBps SCSI Disks Application Data 40 MBps 10-15 MBps 31 MBps File System 9 MBps 133 MBps SCSI 72 MBps PCI
Bottleneck Analysis • NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI ~ 65 MBps Unbuffered read ~ 43 MBps Unbuffered write ~ 40 MBps Buffered read ~ 35 MBps Buffered write Adapter ~30 MBps Memory Read/Write ~150 MBps PCI ~70 MBps 70 MBps Adapter
Adapter ~30 MBps PCI ~70 MBps Adapter Memory Read/Write ~150 MBps Adapter PCI Adapter Peak Thrughput on Intel/NT • NTFS Read/Write 24 disk, 4 SCSI, 2 PCI (64 bit)~ 190 MBps Unbuffered read ~ 95 MBps Unbuffered write so: 0.8 TB/hr read, 0.4 TB/hr write on a 25k$ server. 190 MBps
Penny Sort Ground Ruleshttp://research.microsoft.com/barc/SortBenchmark • How much can you sort for a penny. • Hardware and Software cost • Depreciated over 3 years • 1M$ system gets about 1 second, • 1K$ system gets about 1,000 seconds. • Time (seconds) = SystemPrice ($) / 946,080 • Input and output are disk resident • Input is • 100-byte records (random data) • key is first 10 bytes. • Must create output file and fill with sorted version of input file. • Daytona (product) and Indy (special) categories
PennySort • Hardware • 266 Mhz Intel PPro • 64 MB SDRAM (10ns) • Dual Fujitsu DMA 3.2GB EIDE • Software • NT workstation 4.3 • NT 5 sort • Performance • sort 15 M 100-byte records (~1.5 GB) • Disk to disk • elapsed time 820 sec • cpu time = 404 sec
AAA AAA AAA AAA AAA AAA BBB BBB BBB BBB BBB BBB CCC CCC CCC CCC CCC CCC Cluster Sort Conceptual Model • Multiple Data Sources • Multiple Data Destinations • Multiple nodes • Disks -> Sockets -> Disk -> Disk A AAA BBB CCC B C AAA BBB CCC AAA BBB CCC
Outline • What storage things are coming from Microsoft? • TerraServer: a 1 TB DB on the Web • Storage Metrics: Kaps, Maps, Gaps, Scans • The future of storage: ActiveDisks
Crazy Disk Ideas • Disk Farm on a card: surface mount disks • Disk (magnetic store) on a chip: (micro machines in Silicon) • NT and BackOffice in the disk controller (a processor with 100MB dram) ASIC