690 likes | 703 Views
What Happens When Processing Storage Bandwidth are Free and Infinite?. Jim Gray Microsoft Research. Outline. Hardware CyberBricks all nodes are very intelligent Software CyberBricks standard way to interconnect intelligent nodes What next? Processing migrates to where the power is
E N D
What Happens WhenProcessingStorageBandwidth are Free and Infinite? Jim Gray Microsoft Research
Outline • Hardware CyberBricks • all nodes are very intelligent • Software CyberBricks • standard way to interconnect intelligent nodes • What next? • Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system.
A Hypothetical QuestionTaking things to the limit • Moore’s law 100x per decade: • Exa-instructions per second in 30 years • Exa-bit memory chips • Exa-byte disks • Gilder’s Law of the Telecosom3x/year more bandwidth 60,000x per decade! • 40 Gbps per fiber today
Grove’s Law • Link Bandwidth doubles every 100 years! • Not much has happened to telephones lately • Still twisted pair
Gilder’s Telecosom Law: 3x bandwidth/year for 25 more years • Today: • 10 Gbps per channel • 4 channels per fiber: 40 Gbps • 32 fibers/bundle = 1.2 Tbps/bundle • In lab 3 Tbps/fiber (400 x WDM) • In theory 25 Tbps per fiber • 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps
ThesisMany little beat few big 3 1 MM 10 nano-second ram 10 microsecond ram 10 millisecond disc 10 second tape archive $1 million $10 K $100 K Pico Processor Micro Nano 10 pico-second ram 1 MB Mini Mainframe 10 0 MB 1 0 GB 1 TB 1 00 TB 1.8" 2.5" 3.5" 5.25" 1 M SPEC marks, 1TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP 9" 14" • Smoking, hairy golf ball • How to connect the many little parts? • How to program the many little parts? • Fault tolerance?
Year 2000 4B Machine 1 Bips Processor .1 B byte RAM 10 GB byte Disk 1 B bits/sec LAN/WAN • The Year 2000 commodity PC • Billion Instructions/Sec • .1 Billion Bytes RAM • Billion Bits/s Net • 10 B Bytes Disk • Billion Pixel display • 3000 x 3000 x 24 • 1,000 $
4 B PC’s: The Bricks of Cyberspace • Cost 1,000 $ • Come with • OS (NT, POSIX,..) • DBMS • High speed Net • System management • GUI / OOUI • Tools • Compatible with everyone else • CyberBricks
Super Server: 4T Machine CPU 50 GB Disc 5 GB RAM • Array of 1,000 4B machines • 1 b ips processors • 1 B B DRAM • 10 B B disks • 1 Bbps comm lines • 1 TB tape robot • A few megabucks • Challenge: • Manageability • Programmability • Security • Availability • Scaleability • Affordability • As easy as a single system Cyber Brick a 4B machine Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work
Functionally Specialized Cards P mips processor Today: P=50 mips M= 2 MB • Storage • Network • Display ASIC M MB DRAM In a few years P= 200 mips M= 64 MB ASIC ASIC
It’s Already True of PrintersPeripheral = CyberBrick • You buy a printer • You get a • several network interfaces • A Postscript engine • cpu, • memory, • software, • a spooler (soon) • and… a print engine.
System On A Chip • Integrate Processing with memory on one chip • chip is 75% memory now • 1MB cache >> 1960 supercomputers • 256 Mb memory chip is 32 MB! • IRAM, CRAM, PIM,… projects abound • Integrate Networking with processing on one chip • system bus is a kind of network • ATM, FiberChannel, Ethernet,.. Logic on chip. • Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip.
All Device Controllers will be Cray 1’s • TODAY • Disk controller is 10 mips risc engine with 2MB DRAM • NIC is similar power • SOON • Will become 100 mips systems with 100 MB DRAM. • They are nodes in a federation(can run Oracle on NT in disk controller). • Advantages • Uniform programming model • Great tools • Security • economics (cyberbricks) • Move computation to data (minimize traffic) Central Processor & Memory Tera Byte Backplane
With Tera Byte Interconnectand Super Computer Adapters Tera Byte Backplane • Processing is incidental to • Networking • Storage • UI • Disk Controller/NIC is • faster than device • close to device • Can borrow device package & power • So use idle capacity for computation. • Run app in device.
Offload device handling to NIC/HBA higher level protocols: I2O, NASD, VIA… SMP and Cluster parallelism is important. Move app to NIC/device controller higher-higher level protocols: CORBA / DCOM. Cluster parallelism is VERY important. Implications Tera Byte Backplane Central Processor & Memory Conventional Radical
Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other CORBA? DCOM? IIOP? RMI? One or all of the above. Huge leverage in high-level interfaces. Same old distributed system story. How Do They Talk to Each Other? Applications Applications datagrams datagrams streams RPC ? ? RPC streams VIAL/VIPL VIAL/VIPL Wire(s)
Objects! • It’s a zoo • ORBs, COM, CORBA,.. • Object Relationa1 Databases • Objects and 3-tier computing
History and Alphabet Soup Microsoft DCOM based on OSF-DCE Technology DCOM and ActiveX extend it UNIX International Open software Foundation (OSF) ODBC XA / TX Object Management Group (OMG) NT OSF DCE DCE RPC GUIDs IDL DNS Kerberos Solaris COM CORBA 1985 X/Open 1990 1995 Open Group COM
Objects are Software CyberBricks productivity breakthrough (plug ins) manageability breakthrough (modules) Microsoft Promises Cairo distributed objects, secure, transparent, fast invocation IBM/Sun/Oracle/Netscape promise CORBA + Open Doc + Java Beans + All will deliver Customers can pick the best one Both camps Share key goals: Encapsulation: hide implementation Polymorphism: generic opskey to GUI and reuse Uniform Naming Discovery: finding a service Fault handling: transactions Versioning: allow upgrades Transparency: local/remote Security: who has authority Shrink-wrap: minimal inheritance Automation: easy The Promise
The OLE-COM Experience • Macintosh had Publish & Subscribe • PowerPoint needed graphs: • plugged MS Graph in as an component. • Office adopted OLE • one graph program for all of office • Internet arrived • URLs are object references, • Office is Web Enabled right away! • Office97 smaller than Office95 because of shared components • It works!!
Linking And EmbeddingObjects are data modules;transactions are execution modules • Link: pointer to object somewhere else • Think URL in Internet • Embed: bytesare here • Objects may be active; can callback to subscribers
Objects Meet Databasesbasis for universal data servers, access, & integration Database Spreadsheet Photos Mail Map Document • Object-oriented (COM oriented) interface to data • Breaks DBMS into components • Anything can be a data source • Optimization/navigation “on top of” other data sources • Makes an RDBMS anO-R DBMS assuming optimizer understands objects DBMS engine
The BIG PictureComponents and transactions • Software modules are objects • Object Request Broker (a.k.a., Transaction Processing Monitor) connects objects (clients to servers) • Standard interfaces allow software plug-ins • Transaction ties execution of a “job” into an atomic unit: all-or-nothing, durable, isolated • ActiveX Components are a 250M$/year business. Object RequestBroker
Object Request Broker (ORB)Orchestrates RPC Transaction • Registers Servers • Manages pools of servers • Connects clients to servers • Does Naming, request-level authorization, • Provides transaction coordination • Direct and queued invocation • Old names: • Transaction Processing Monitor, • Web server, • NetWare Object-Request Broker
The OO Points So Far • Objects are software Cyber Bricks • Object interconnect standards are emerging • Cyber Bricks become Federated Systems. • Next points: • put processing close to data • do parallel processing.
Three Tier Computing • Clients do presentation, gather input • Clients do some workflow (Xscript) • Clients send high-level requests to ORB • ORB dispatches work-flows and business objects -- proxies for client, orchestrate flows & queues • Server-side workflow scripts call on distributed business objects to execute task Presentation workflow Business Objects Database
The Three Tiers Web Client HTML VB Java plug-ins VBscritpt JavaScrpt Middleware ORB TP Monitor Web Server... Object server Pool VB or Java Script Engine VB or Java Virt Machine HTTP+ DCOM ORB Internet DCOM (oleDB, ODBC,...) LU6.2 Legacy Gateways IBM Object & Data server.
Transaction Processing Evolution to Three TierIntelligence migrated to clients Server green screen 3270 Active • Mainframe Batch processing (centralized) • Dumb terminals & Remote Job Entry • Intelligent terminals database backends • Workflow SystemsObject Request BrokersApplication Generators Mainframe cards TP Monitor ORB
Web Evolution to Three TierIntelligence migrated to clients (like TP) Mosaic NS & IE Active Web Server WAIS • Character-mode clients, smart servers • GUI Browsers - Web file servers • GUI Plugins - Web dispatchers - CGI • Smart clients - Web dispatcher (ORB)pools of app servers (ISAPI, Viper)workflow scripts at client & server archie ghopher green screen
PC Evolution to Three TierIntelligence migrated to server • Stand-alone PC (centralized) • PC + File & print servermessage per I/O • PC + Database server message per SQL statement • PC + App server message per transaction • ActiveX Client, ORB ActiveX server, Xscript IO request reply disk I/O SQL Statement Transaction
Why Did Everyone Go To Three-Tier? • Manageability • Business rules must be with data • Middleware operations tools • Performance (scaleability) • Server resources are precious • ORB dispatches requests to server pools • Technology & Physics • Put UI processing near user • Put shared data processing near shared data • Minimizes data moves • Encapsulate / modularity Presentation workflow Business Objects Database
Why Put Business Objects at Server? MOM’s Business Objects DAD’sRaw Data Customer comes to store with list Gives list to clerk Clerk gets goods, makes invoice Customer pays clerk, gets goods Customer comes to store Takes what he wants Fills out invoice Leaves money for goods Easy to manage Clerks controls access Encapsulation Easy to build No clerks
The OO Points So Far • Objects are software Cyber Bricks • Object interconnect standards are emerging • Cyber Bricks become Federated Systems. • Put processing close to data • Next point: • do parallel processing.
Parallelism: the OTHER half of Super-Servers • Clusters of machines allow two kinds of parallelism • Many little jobs: Online transaction processing • TPC A, B, C,… • A few big jobs: data search & analysis • TPC D, DSS, OLAP • Both give automatic Parallelism
Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 100 second SCAN. BANDWIDTH Parallelism: divide a big problem into many smaller ones to be solved in parallel.
Kinds of Parallel Execution Any Any Sequential Sequential Pipeline Program Program Sequential Sequential Partition outputs split N ways inputs merge M ways Any Any Sequential Sequential Sequential Sequential Program Program
Why are Relational OperatorsSuccessful for Parallelism? • Relational data model uniform operators • on uniform data stream • Closed under composition • Each operator consumes 1 or 2 input streams • Each stream is a uniform collection of data • Sequential data in and out: Pure dataflow • partitioning some operators (e.g. aggregates, non-equi-join, sort,..) • requires innovation • AUTOMATIC PARALLELISM
Database Systems “Hide” Parallelism • Automate system management via tools • data placement • data organization (indexing) • periodic tasks (dump / recover / reorganize) • Automatic fault tolerance • duplex & failover • transactions • Automatic parallelism • among transactions (locking) • within a transaction (parallel execution)
SQL a Non-Procedural Programming Language Executors • SQL: functional programming language describes answer set. • Optimizer picks best execution plan • Picks data flow web (pipeline), • degree of parallelism (partitioning) • other execution parameters (process placement, memory,...) Execution Planning Monitor Schema Plan GUI Optimizer Rivers
Automatic Data Partitioning Split a SQL table to subset of nodes & disks Partition within set: Range Hash Round Robin Good for equijoins, range queries group-by Good for equijoins Good to spread load Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning
N x M way Parallelism N inputs, M outputs, no bottlenecks.
Parallel Objects? • How does all this DB parallelism connect to hardware/software Cyber Bricks? • To scale to large client sets • need lots of independent parallel execution. • Comes for from from ORB. • To scale to large data sets • need intra-program parallelism (like parallel DBs) • Requires some invention.
Outline • Hardware CyberBricks • all nodes are very intelligent • Software CyberBricks • standard way to interconnect intelligent nodes • What next? • Processing migrates to where the power is • Disk, network, display controllers have full-blown OS • Send RPCs to them (SQL, Java, HTTP, DCOM, CORBA) to them • Computer is a federated distributed system. • Parallel execution is important
The Disk Farm On a Card The 100GB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc LOTS of accesses/second bandwidth 14" • Life is cheap, its the accessories that cost ya. • Processors are cheap, it’s the peripherals that cost ya • (a 10k$ disc card).
Parallelism: Performance is the Goal Goal is to get 'good' performance. Trade time for money. • Law 1: parallel system should be • faster than serial system • Law 2: parallel system should give • near-linear scaleup or • near-linear speedup or • both. • Parallel DBMSs obey these laws
Success Stories • Online Transaction Processing • many little jobs • SQL systems support • 50 k tpm-C (44 cpu, 600 disk 2 node ) • Batch (decision support and Utility) • few big jobs, parallelism inside • Scan data at 100 MB/s • Linear Scaleup to 1,000 processors transactions / sec hardware recs/ sec hardware
The New Law of Computing 2x $ is 4x performance 1,000 MIPS 32 $ 2x $ is 2x performance 1 MIPS 1 $ .03$/MIPS 1,000 MIPS 1 MIPS 1,000 $ 1 $ • Grosch's Law: • Parallel Law: • Needs • Linear Speedup and Linear Scaleup • Not always possible
Clusters being built • Teradata 1,000 nodes (30k$/slice) • Tandem,VMScluster 150 nodes (100k$/slice) • Intel, 9,000 nodes @ 55M$ ( 6k$/slice) • Teradata, Tandem, DEC moving to NT+low slice price • IBM: 512 nodes ASCI @ 100m$ (200k$/slice) • PC clusters (bare handed) at dozens of nodes web servers (msn, PointCast,…), DB servers • KEY TECHNOLOGY HERE IS THE APPS. • Apps distribute data • Apps distribute execution
BOTH SMP and Cluster? Grow Up with SMP 4xP6 is now standard Grow Out with Cluster Cluster has inexpensive parts Cluster of PCs