1 / 25

Store Everything Online In A Database

Learn about storing vast amounts of data online in a database, the challenges involved, and the evolution of data storage technology. Explore the benefits of storing everything online and discover the cost-effective solutions available. Discover how storing data in databases can revolutionize data management and access.

cnoble
Download Presentation

Store Everything Online In A Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Store EverythingOnlineIn A Database Jim Gray Microsoft Research Gray@Microsoft.com http://research.microsoft.com/~gray/talks http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  2. Outline • Store Everything • Online (Disk not Tape) • In a Database http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  3. How Much is Everything? Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded • Soon everything can be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ All BooksMultiMedia All LoC books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  4. Storage capacity beating Moore’s law 3 k$/TB today (raw disk) 1k$/TB by end of 2002 http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  5. Outline • Store Everything • Online (Disk not Tape) • In a Database http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  6. Online Data • Can build 1PB of NAS disk for 5M$ today • Can SCAN (read or write) entire PB in 3 hours. • Operate it as a data pump: continuous sequential scan • Can deliver 1PB for 1M$ over Internet • Access charge is 300$/Mbps bulk rate • Need to Geoplex data (store it in two places). • Need to filter/process data near the source, • To minimize network costs. http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  7. The “Absurd” Disk • 2.5 hr scan time (poor sequential access) • 1 access per second / 5 GB (VERY cold data) • It’s a tape! 1 TB 100 MB/s 200 Kaps http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  8. Disk 80 GB 35 MBps 5 ms seek time 3 ms rotate latency 3$/GB for drive 2$/GB for ctlrs/cabinet 15 TB/rack 1 hour scan Tape 40 GB 10 MBps 10 sec pick time 30-120 second seek time 2$/GB for media8$/GB for drive+library 10 TB/rack 1 week scan Disk vs Tape Guestimates Cern: 200 TB 3480 tapes 2 col = 50GB Rack = 1 TB =12 drives The price advantage of disk is growing the performance advantage of disk is huge! At 10K$/TB, disk is competitive with nearline tape. http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  9. Building a Petabyte Disk Store • Cadillac ~ 500k$/TB = 500M$/PB plus FC switches plus… 800M$/PB • TPC-C SANs (Brand PC 18GB/…) 60 M$/PB • Brand PC local SCSI 20M$/PB • Do it yourself ATA 5M$/PB http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  10. 2x800 Mhz 512 MB Cheap Storage and/or Balanced System • Low cost storage (2 x 3k$ servers) 5K$ TB2x ( 800 Mhz, 256Mb + 8x80GB disks + 100MbE)raid5 costs 6K$/TB • Balanced server (5k$/.64 TB) • 2x800Mhz (2k$) • 512 MB • 8 x 80 GB drives (2K$) • Gbps Ethernet + switch (300$/port) • 9k$/TB 18K$/mirrored TB http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  11. Next step in the Evolution • Disks become supercomputers • Controller will have 1bips, 1 GB ram, 1 GBps net • And a disk arm. • Disks will run full-blown app/web/db/os stack • Distributed computing • Processors migrate to transducers. http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  12. It’s Hard to Archive a PetabyteIt takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online (on disk?).A geo-plex • Scrub it continuously (look for errors) • On failure, • use other copy until failure repaired, • refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space) http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  13. Outline • Store Everything • Online (Disk not Tape) • In a Database http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  14. Why Not file = object + GREP? • It works if you have thousands of objects (and you know them all) • But hard to search millions/billions/trillions with GREP • Hard to put all attributes in file name. • Minimal metadata • Hard to do chunking right. • Hard to pivot on space/time/version/attributes. http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  15. The Reality: it’s build vs buy • If you use a file system you will eventually build a database system: • metadata, • Query, • parallel ops, • security,…. • reorganize, • recovery, • distributed, • replication, http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  16. OK: so I’ll put lots of objects in a fileDo It Yourself Database • Good news: • Your implementation will be 10x faster than the general purpose one easier to understand and use than the general purpose on. • Bad news: • It will cost 10x more to build and maintain • Someday you will get bored maintaining/evolving it • It will lack some killer features: • Parallel search • Self-describing via metadata • SQL, XML, … • Replication • Online update – reorganization • Chunking is problematic (what granularity, how to aggregate) http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  17. Top 10 reasons to put Everything in a DB • Someone else writes the million lines of code • Captures data and Metadata, • Standard interfaces give tools and quick learning • Allows Schema Evolution without breaking old apps • Index and Pivot on multiple attributes space-time-attribute-version…. • Parallel terabyte searches in seconds or minutes • Moves processing & search close to the disk arm (moves fewer bytes (qestons return datons). • Chunking is easier (can aggregate chunks at server). • Automatic geo-replication • Online update and reorganization. • Security • If you pick the right vendor, ten years from now, there will be software that can read the data. http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  18. DB Centric Examples • TerraServer • All images and all data in the database (chunked as small tiles).www.TerraServer.Microsoft.com/ • http://research.microsoft.com/~gray/Papers/MSR_TR_99_29_TerraServer.doc • SkyServer & Virtual Sky • Both image and semantic data in a relational store. • Parallel search & NonProcedural access are important. • http://research.microsoft.com/~gray/Papers/MS_TR_99_30_Sloan_Digital_Sky_Survey.doc • http://dart.pha.jhu.edu/sdss/getMosaic.asp?Z=1&A=1&T=4&H=1&S=10&M=30 • http://virtualsky.org/servlet/Page?F=3&RA=16h+10m+1.0s&DE=%2B0d+42m+45s&T=4&P=12&S=10&X=5096&Y=4121&W=4&Z=-1&tile.2.1.x=55&tile.2.1.y=20 http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  19. OK… Why don’t they use our stuff? • Wrong metaphor: HDF with hyper-slab is better match. • Impedence match: getting stuff in/out of DB is too hard • We sold them OODBs and they did not work (unreliable, poor performance, no tools). • … http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  20. So, why will the future be different? • They have MUCH more data (10^8 files?) • Java / C# eases impedance mismatch: rowsets == ragged arrays. • Tools are better • Optimizers are better • CPU and disk parallelism actually works now • Statistical packages are better. http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  21. Outline • Store Everything • Online (Disk not Tape) • In a Database http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  22. But… The title of the talk was… “The Future of Distributed Database Systems” Nobody wants to share his database. blocks, files, tables are wrong abstraction for networks. (too low level) “Objects are the right abstraction” So, UDDI / WSDL / SOAP is the solution (not SQL) XML is the wire format, XLANG is the workflow protocol, Query will be in there somewhere. http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  23. DDB technology GREAT in a Cluster • Uniform architecture • Trust among nodes • High bandwidth-low latency communication • Programs have single system image • Queries run in parallel • Global optimizer does query decomposition http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  24. But in a Distributed System • Heterogenous architecture makes query planning much harder • No trust • Communication is slow and expensive (minimize it). •  Higher level abstraction to minimize round trips http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

  25. DDB the Trust Issue • Customers serve themselves • Follow the rules posted on the door • No Overhead, no staff! • Clerks serve Customers • Take order, fill order, fill out invoice, collect money. • Overhead: staff, training, rules,… • Customers serve themselves • Follow the rules posted on the dorr Client/Server Groceries DDB Grocery http://research.microsoft.com/~gray/talks/Gray_GriPhyN.ppt

More Related