Data Management in a Highly Connected World

Data Management in a Highly Connected World James Hamilton JamesRH@microsoft.com Microsoft SQL Server March 3, 2000

Agenda • Client Tier • Number of devices • Device interconnect fabric • Standard programming infrastructure • Client tier database issues • Resource requirements • Implementation language • Administrative cost implications • Development cost implications • Middle Tier • Server Tier • Summary

How Many Clients? • 1998 US WWW users (IDC) • US: 51M; World wide: 131M • 2001 estimates: • World Wide: 319M users • 515M connected devices • ½ billion connected Clients • Conservative estimate based upon conventional device counts

Other Device Types • TVs, VCRs, stoves, thermostats, microwaves, CD players, computers, garage door openers, lights, sprinklers, appliances, driveway de-icers, security systems, refrigerators, health monitoring, etc. • Sony evangelizing IEEE 1394 Interconnect • http://www.sel.sony.com/semi/iee1394wp.html • Microsoft & consortium evangelizing Universal Plug & Play • www.upnp.org • WAP: Wireless Application Protocol • http://www.wap.net/

Device Interconnect Infrastructure • Power line control • X10: http://www.x10.org • Sunbeam Thalia: • http://www.thaliaproducts.com/

Why Connect These Devices? • TV guide & auto VCR programming • CD label info & song list download • Sharing data & resources • Set clocks (flashing 12:00) • Fire and burglar alarms • Persist thermometer settings • Feedback & data sharing based systems: • Temperature control & power blind interaction • Occupancy directed heating and lighting

Device Communication Implications • The need is there • Infrastructure is going in: • Wireless • Power line communications • Unused twisted pair (phone) bandwidth • Connectable devices & infrastructure arriving & being deployed • On order of billions of client devices

Device Interconnect Example

Device interconnect Example Den Windows NT Server Ethernet Hub 56k bps line Ethernet Backbone Deck Filtration Plant 130 Gallon F/W Aquarium Living Room 660 Gallon Marine Aquarium X10 Backbone 130 Gallon F/W Aquarium Home Sprinklers Bedroom

Improvements For Example • Cooperation of lighting, A/C and power blind systems • Alarms and remote notification for failures in: • Circulations pump • Heating & cooling • Salinity & other water chemistry changes • Filtration system • Feedback directed systems

Palmtop Resource Trends • Palmtops I’ve purchased through the years • All about same cost & physical size PalmtopRAM Moore’s Law 100 Casio E105 (32M) 32M HP 200LX (2M) 10 HP 100LX (1M) Everex A20 (4m) HP 95LX (0.5M) Sharp IQ8300 (0.25M) Sharp IQ7000 (0.125M) 1 0.1 1990 1992 1994 1996 1998 2000 2002

O/S Memory Requirements • Windows Memory requirements over time Desktop RAM Moore’s Law 128m 100 Windows 2000(64M) Windows98 (16M) 10 Windows95 (4M) WFW 3.1 (3M) Windows 3.0 (2M) 1 Windows 2.0 (512K) Windows 1.0 (256K) 0.1 1985 1987 1989 1991 1993 1995 1997 1999

Smartcard Resource Trends 300 M 1 M Memory Size (Bits) 10 K You are here 3 K 1990 1992 1996 1998 2000 2002 2004 Source: Denis Roberson PIN/Card -Tech/ NCR

Devices Smaller Than PDAs • Qualcomm PDQ • 2 MB total memory • Same mem curve as PDAs…just 2 to 5 years behind • Nokia 9000il • 8 MB total Memory

Digital Cameras

Resource Trend Implications • Device resources at constant cost are growing at super-Moore rates • Same but 2 to 3 yrs behind desktop system growth • Same is true of each class of devices • Telephones trail PDAs but again grow at the same rate • Memory growth is not the problem • However devices always smaller than desktops • Devices more specialized so resource consumption less … can still run standard vertical app slice

Standard Infrastructure at Client • Clearly specialized user interface S/W needed • But we have the memory resources to support: • Standard communications stack (TCP/IP) • Standard O/S software • Standard data management S/W with query • Transparent replication • Symmetric multi-tiered infrastructure S/W: • Leverage best development environments • No need to rewrite millions of redundant lines of code • More heavily used & tested so less bugs • Better productivity in programming to richer platform • A full DBMS at client both practical & useful

Client-Side Database Issues • “Honey I shrunk the database” (SIGMOD99): • DB Footprint • Implementation Language • Both issues either largely irrelevant or soon to be: • Resource availability trends support standard infrastructure S/W • Dominant costs: admin, operations & user training, and programming • Vertical slice of standard apps rather than full custom infrastructure

DB Implementation Language • Special DB implementation language (Java) argument: • centers on auto-installation of S/W infrastructure • Auto-install is absolutely vital, but independent of implementation language • Auto-install not enough: client should be a cache of recently used S/W and data • Full DBMS at client • Client-side cache of recently accessed data • Optimizer selected access path choice: • driven by accuracy & currency requirements • balanced against connectivity state & communications costs

Admin Costs Still Dominate • 60’s large system mentality still prevails: • Optimizing precious machine resources is false economy • Admin & education costs more important • TCO education from the PC world repeated • Each app requires admin and user training…much cheaper to roll out 1 infrastructure across multiple form factors • Sony PlayStation has 3Mb RAM & Flash • Nokia 9000IL phone has 8Mb RAM • Trending towards 64M palmtop in 2001 • Vertical app slice resource reqmt can be met

Dev Costs Over Memory Costs • Specialty RTOS weak dev environments • Quality & quantity of apps driven by: • Dev environment quality • Availability of trained programmers • Requirement for custom client development & configuration greatly reduces deployment speed • Same apps have wide range of device form factors • Symmetric client/server execution environ. • DB components and data treated uniformly • Both replicated to client as needed

Client Side Summary • On order of billions connected client devices • Most are non-conventional computing devices • All devices include standard DB components • Standard physical & logical device interconnect standards will emerge • DB implementation language irrelevant • Device DB resource consumption much less important than ease of: • Installation • Administration • Programming • Symmetric client/server execution environment

Agenda • Client Tier • Middle Tier • High Availability via redundant data & metadata • Fault Isolation domains • XML • Mid-tier Caching • Server Tier • Summary

High Availability is Tough

Server Availability: Heisenbugs • Industry good at finding functional errors • Multi-user & application interactions hard: • Sequences of statistically unlikely events • Heisenbugs (http://research.microsoft.com/~gray/talks) • Testing for these is exponentially expensive • Server stack is nearing 100 MLOC • Long testing and beta cycles delay software release • System size & complexity growth inevitable: • Re-try operation (Microsoft Exchange) • Re-run operation against redundant data copy (Tandem) • Fail fast design approach is robust but only acceptable with redundant access to redundant copies of data

The Inktomi Lesson • Inktomi web search engine (Brewer --SIGMOD’98) • Quickly evolving software: • Memory leaks, race conditions, etc. considered normal • Don’t attempt to test & beta until quality high • System availability of paramount importance • Individual node availability unimportant • Shared nothing cluster • Exploit ability to fail individual nodes: • Automatic reboots avoid memory leaks • Automatic restart of failed nodes • Fail fast: fail & restart when redundant checks fail • Replace failed hardware weekly (mostly disks) • Dark machine room • No panic midnight calls to admins • Mask failures rather than futile attempt to avoid

Apply to High Value TP Data? • Inktomi model: • Scales to 100’s of nodes • S/W evolves quickly • Low testing costs and no-beta requirement • Exploits ability to lose individual node without impacting system availability • Ability to temporarily lose some data W/O significantly impacting query quality • Can’t loose data availability in most TP systems • Redundant data allows node loss w/o data availability lost • Inktomi model with redundant data & metadata a potential solution

Redundant Data & Metadata • TP Point access to data nearly solved problem • TP systems scale with user number, people on planet, or business size • All trending at sub-Moore rates • Data analysis systems growing far faster than Moore’s Law: • Greg’s law: 2x every 9 to 12 (SIGMOD98—Patterson) • Seriously super-Moore implying that no single system can scale sufficiently: clusters are the only solution • Storage trending to free with access speed limiting factor • Detailed data distribution statistics need to be maintained • Improve access speed & availability using redundant data (indexes, materialized views, etc.) • Async update for stats, indexes, mat views • Data paths choice based upon need currency & accuracy

Affordable Availability • Web-enabled direct access model driving high availability requirements: • recent high profile failures at eTrade and Charles Schwab • Web model enabling competition in information access • Drives much faster server side software innovation which negatively impacts quality • “Dark machine room” approach requires auto-admin and data redundancy • Inktomi model (Erik Brewer–SIGMOD’98) • 42% of system failures admin error (Gray) • Paging admin at 2am to fix problem is dangerous

Connection Model/Architecture • Redundant data & metadata • Shared nothing • Single system image • Symmetric server nodes • Any client connects to any server • All nodes SAN-connected Client Server Node Server Cloud

Compilation & Execution Model Client • Query execution on many subthreads synchronized by root thread Server Thread Lex analyze Parse Normalize Optimize Code generate Server Cloud Query execute

Lose node: • Recompile • Re-execute Node Loss/Rejoin • Execution in progress Client • Rejoin: • Node local recovery • Rejoin cluster • Recover global data at rejoining node • Rejoin cluster Server Cloud

Redundant Data Update Model • Updates are standard parallel query plans • Optimizer manages redundant access paths • Query plan responsible for access plan management: • No significant new technology • Similar to materialized view & index updates today Client Server Cloud

Fault Isolation Domains • Trade single-node perf for redundant data checks: • Complex error recovery more likely to be wrong than original forward processing code • Many redundant checks are compiled out of “retail versions” when shipped • Fail fast rather than attempting to repair: • Bring down node for mem-based data structure faults • Don’t patch inconsistent data … copies keep system available • If anything goes wrong “fire” the node and continue: • Attempt node restart • Auto-reinstall O/S, DB and recreate DB partition • Mark node “dead” for later replacement

Data Structure Matters • Most internet content is unstructured text • restricted to simple Boolean search techniques • Docs have structure, just not explicit • Yahoo hand categorizes content • indexing limited & human involvement doesn’t scale well • XML is a good mix of simplicity, flexibility, & potential richness • Structure description language of internet • DBMSs need to support as first class datatype • Too few librarians in world • so all information must be self-describing

Relational to XML • SELECT … FOR XML • FOR XML RAW (return an XML rowset) • FOR XML AUTO (exploit RI, name matching, etc.) • FOR XML EXPLICIT (maximal control) • Annotated Schema • Mapping between XML and relational schema expressed in XML • Templates • Encapsulated parameterized query • XSL/T support • XPATH support • Direct URL access (SQL “owned” virtual root) • SELECT … FOR XML • Annotated schema • Templates

XML to Relational • XML bulk load • Templates and Annotated Schema • SQL server hosted XML tree • Directly insert document into SQL Server hosted XML tree • Select from server hosted XML tree rowset & insert into SQL tables • XML Data type support • Hierarchical full text search

XML: Example http://SRV1/nwind?sql=SELECT+DISTINCT+ContactTitle+FROM+Customers+WHERE+ContactTitle+LIKE+'Sa%25'+ORDER+bY+ContactTitle+FOR+XML+AUTO Result set: <Customers ContactTitle="Sales Agent"/> <Customers ContactTitle="Sales Associate"/> <Customers ContactTitle="Sales Manager"/> <Customers ContactTitle="Sales Representative"/>

Mid-Tier Cache Requirements • Non-proprietary multi-lingual programming • Symmetric mid-tier & server programming model • Non-connected, stateless programming model • High scale thread pool based • Efficient main memory DB support • Full query over local cache • Query over just cached data, or • Query over full corpus (server interaction reqd) • Ability to handle network partitions & server failure • Support for life-time attributed data: • Transactional (possibly multi-server) • Near real time • Every N time units • Read only

Agenda • Client Tier • Middle Tier • Server Tier • Affordable computing by the slice • Everything online • Disk are actually getting slower • Processing moves to storage • Approximate answers quickly • Semi-structured storage support • Administrative issues • Summary

Server-Side Changes • Server databases more functionally rich than often required • Trend reversal: • Less at the server-tier with richer mid-tier • Focus at back-end shifts to: • Reliability, Availability, and Scalability • Reducing administrative costs • Server side trends: • Scalability over single-node performance • Everything online • Affordable availability in high scale systems

Compaq/Microsoft TPC-C Benchmark • tpmC These are Top 5 benchmarks as of Feb 17, 2000. 227,079 152,207 135,815 135,815 135,461 $98 $55 $53 $20 $19 ProLiant 8500 Cluster Windows 2000 SQL Server 2000 $4,341,603. $ 19.12 tpmC ProLiant 8500 Cluster Windows 2000 SQL Server 2000 $2,880,431. $ 18.93 tpmC IBM RS/6000 S80 AIX 4.3.3 Oracle v 8.1.6 $7,156,910. $52.70/tpmC Escala EPC2400 AIX 4.3.3 Oracle v8.1.6 $7,462,215 $54.94 tpmC Enterprise 6500 Solaris 2.6 Oracle 8i v 8.1.6 $13,153,324. $97.10/tpmC NOTE: All TPC-C results reported as of February 17, 2000

Computing by the Slice Source: TPC report executive summary

Just Save Everything • Able to store all Info produced on earth (Lesk): • Paper sources: less than 160 TB • Cinema: less than 166 TB • Images: 520,000 TB • Broadcasting: 80,000 TB • Sound: 60 TB • Telephony: 4,000,000 TB • These data yield 5,000 petabytes • Others estimate upwards of 12,000 petabytes • World wide 1998 storage production: 13,000 petabytes • No need to manage deletion of old data • Most data never accessed by a human • Access aggregations & analysis, not point fetch • More storage than data allows for greater redundancy: • indexes, materialized views, statistics, & other metadata

Disk are Becoming Black Holes • Seagate Cheetah 73 • Fast: 10k RPM, 5.6 ms access, 16 MB cache • But Very large: 73.4 GB • Result? Black hole: 2.4 accesses/sec/gb • Large data caches required • Employ redundant access paths

Processing Moves Towards Storage • Trends: • I/O bus bandwidth is bottleneck • Switched serial nets support very high bandwidth • Processor/memory interface is bottleneck • Growing CPU/DRAM perf gap leading to most CPU cycles in stalls • Combine CPU, serial network, memory, & disk in single package • E.g. David Patterson ISTORE project

Processing Moves Towards Storage • Each disk forms part of multi-thousand node cluster • Redundant data masks failure • RAID-like approach • Each cyberbrick commodity H/W and S/W • O/S, database, and other server software • Each “slice” plugged in & personality set • E.g. database or SAP app server) • No other configuration required • On failure of S/W or H/W, redundant nodes pick up workload • Replace failed components at leisure • Predictive failure models

Approximate Answers Quickly • DB systems focus on absolute correctness • As size grows, correct answer increasingly expensive • Text search systems depend upon quick approx answer • Approx answer with statistical confidence bound: • Steadily improve result until user satisfied • “Ripple Joins for Online Aggregation”(Hellerstein-SIGMOD99) • Allows rapid exploration of large search spaces: • Conventional full accuracy only when needed • Run query on incomplete mid-tier cache?

Data Management in a Highly Connected World