260 likes | 589 Views
Welcome to CIS 455 / 555 – Internet and Web Systems. Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems January 13, 2010. What this Course Is About. How do we build services like Google, Akamai, iTunes, Facebook, EBAY, …?
E N D
Welcome to CIS 455 / 555 –Internet and Web Systems Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems January 13, 2010
What this Course Is About • How do we build services like Google, Akamai, iTunes, Facebook, EBAY, …? • What are the principles behind them?(This is NOT a course on building Web sites!) • How do “cloud computing,” P2P, and Web services relate? • The main themes of the course: • Distributed systems concepts, with emphasis on data, scalability and interoperability (including “the cloud”) • Data representation fundamentals, with emphasis on XML • Information retrieval concepts, including ranking and indexing • It’s a course that involves building software using the principles learned, evaluating it, and programming in teams
How Does this Relate to Other CIS Courses? CIS 330/550 • Data representation and management • Relational querying with SQL; XML querying with XQuery • DBMS-backed web sites • 455/555 focuses on data with respect to interoperability CIS 350/573: software engineering and mashups CIS 505: focuses on distributed systems and algorithms • CIS 505 is less project-oriented than CIS 555 • CIS 555 covers Web services, cloud architectures in more detail
Some Things We’ll Look at • What are the principles behind building systems that work on the Internet? • How do these relate to many of today’s hot technologies? • Web servers, DHTML, Servlets, JSP, … • XML • Web services • Peer-to-peer • Application servers • Cloud computing environments • Content distribution networks • Web search • Mash-ups • The cloud • …
Staff • Instructor: Zack Ives, zives@cis • Office: 576 Levine North • Office hours Th 3:30-4:30 (and by arrangement) • TA: Katie Gibson, gibsonk@seas • Office hours TBA • Discussion group: • cis-455-555-spring10@googlegroups.com • http://groups.google.com/cis-455-555-spring10
Textbooks • Distributed Systems: Principles and Paradigms, 2nd ed, Tanenbaum and van Steen • We’ll read from the book ~50% of the time • Frequent supplementary handouts • Excerpts from several books • Many recent research papers • Your first one, which you should read by Wed:http://research.microsoft.com/en-us/um/people/blampson/33-Hints/Acrobat.pdf (linked off the CIS 555 “Schedule & Slides” page) • Send me mail if it’s difficult for you to find a way of printing the paper yourself
Prerequisites, Workload, etc. Necessary skills: • Ability to code in Java: there is a substantial implementation project • Good debugging skills – this will be the biggest time sink! • The ability to work as a team with classmates (towards the end) • A willingness to learn how to read API documentation • Some exposure to threads and concurrent programming • A willingness to “push the envelope” Workload: • Several programming/debugging-based homework assignments • A substantial term project with experimental evaluation and a report • Two midterms Payoff: • Lots of practical development and debugging experience • A good working knowledge of the fundamentals behind scalable systems • A working “academic clone of Google,” hosted on Amazon EC2! WARNING: this course should be considered 1.5 CU!
A Disclaimer… • This remains a “bleeding edge” course! • Goal 0: an understanding of scalable distributed data-centric systems • Goal 1: a look under the covers of today’s hottest topics – in lectures and in projects • Goal 2: a level of comfort in managing large, complex software development with others’ code • Part of this means doing a substantial implementation project • As in the real world: learning APIs, dealing with inadequate tools • Most of you will find this a struggle! You’ll spend many hours debugging! • We will be using some immature technology • Not everything has been tested and validated ahead of time • e.g., this will be the first year we are using Amazon Elastic Compute Cloud • We’ll do the best we can to smooth over the bugs • We hope it will be a fun course, though… … And an interesting one!
What Exactly Is the Web? • The Web consists of HTTP servers that publish HTML, XML, and a few other content types • These are hyperlinked via URLs (a subset of URIs) • Plus there are a huge number of web clients • The Web is built on a number of Internet protocols: • DNS, TCP, IP • Other Internet services use other protocols • SMTP, IMAP, POP, AIM, FTP, … • Streaming media, music swapping protocols, … • Web services, custom applications may actually also use HTTP in ways it wasn’t designed for
The Internet is Built in Layers Your Application … … Web Services, distrib transactions, … Middleware SSH, FTP,HTTP, IM, P2P, … Lightweight streaming, etc. Session TCP (session-based) UDP (sessionless) Transport IPv4, IPv6 Unicast, (multicast) IP WiFi, ZigBee, Ethernet, WiMax Link
What Is an Internet System? • Not just a web server or web application… • An application built over the Internet, whose functionality is distributed across more than one machine • Typically, at least in a client-server or server-to-server fashion, but may have many more participants • Typically, data and/or code must be exchanged in distributed fashion for the functioning of the application • Often, the data must be partitioned, replicated, translated, etc. (“shards” in Google-speak) • Often, the code is written in multiple different environments, languages, etc. • Often, there are concerns about handling failures, firewalls, attacks, …
Why Are Internet System Topics Interesting? • Understanding what’s underneath today’s Web • How does it work? • What are its shortcomings? • What are its strengths? • Understanding distributed algorithms • Using the right approach when designing new protocols and web systems • Being able to anticipate what’s actually possible in the future
Example: Web Search, a Cloud Service client client client HTML forms; results queries Web Pages Search Interface Servers Uses a model ofdocument/wordsimilarity to rankmatches pages Crawlers results query Index Servers keywords + locations
Example: Social Networking (Facebook / Twitter), a Cloud Service client client client pages & notifications clicks User PageServers updates, posts Users & entities suggestions Recommender common properties, usage logs, …
Example: Information Integration client client client results in “mediated schema” queries Maps all data into a single format and virtual schema Mediator System XQuery + XPath over XML ODBC results HTML SQL HTTP POST XML XML sources Relational sources HTML sources
Example: SETI@home Breaks computation intomany parts and distributes them tothe clients Problem Partitioning Data Aggregation New sub-problems Computedsubresults client client client
Example: P2P File Sharing Processes name-basedrequests for data; eachnode can make requests,forward requests,return data request client client data request data data request client client
What are the Hard Problems? • Disclaimer: most of the hard problems AREN’T solved (or solvable) – and there often isn’t any single BEST solution Much of systems design is about finding the right compromise for each specific problem • We can divide them into: • Scalability • Availability / reliability • Consistency • Interoperability • Location and resource discovery
Scalability • How do we support a large number of clients or requests? • Distribute work! • Challenges: • Coordination – takes significant overhead in the general case • Load balancing – avoid having bottlenecks • Parts of the solution: • Client-server, multi-tier, P2P architectures • Restricted programming models, e.g., MapReduce • Data partitioning, replication, remote procedure calls, …
Availability/Reliability • How do we ensure the system is “up” when we want it to be, and doing the “right” thing? • Replication and redundancy • Security measures against attacks • Ability to undo/redo • Challenges: • Keeping things consistent • Performance vs. security • Acknowledgments • Parts of the solution: • Data partitioning, replication, … • Logging, transactions, … • Redundant hardware, multiple sites, … • Quorum and consensus algorithms
Consistency / Consensus • Replication, distribution, and failures make it difficult to keep a unified, consistent view of the world – how do we combat this? • Locking, concurrency control, and invalidation schemes • Clock synchronization • Challenges: • Locking has huge performance overhead • Network partitions, disconnected operation • Parts of the solution: • Optimistic concurrency control, 2-phase locking • Distributed clock sync • Conflict resolvers
Interoperability • How do we coordinate the efforts of components that have different data formats and/or source languages, and are on different machines? • Standardization! • Challenges: • Everything has a different semantics! • Parts of the solution: • Standard data formats: XML, XML schemas • “Schema mediation” and data translation • Remote procedure calls: CORBA, XML-RPC, …
Location & Resource Discovery • How do you find what you’re looking for? • Naming • Declarative queries over standard schemas • Advertisements • Challenges: • Naming has implicit semantics • What do you do when you don’t know what to call something? • Parts of the solution: • Directory systems – DNS, LDAP, etc. • Resource discovery and advertising protocols • Overlay networks, sharding schemes • Standardized schemas
Our First Focus: Single Machines, aka Servers • How do you handle large numbers of concurrent users? • Processes • Threads • Events • Hybrids (e.g., thread pools) • Staged architectures
Next Time (Wed due to MLK Day)… • We’ll look under the covers of an HTTP server • Key ideas in building scalable systems • Principles of HTTP and web servers • Management of concurrent sessions • To read by next Wednesday: • Lampson and Saltzer paperhttp://research.microsoft.com/en-us/um/people/blampson/33-Hints/Acrobat.pdf • Tanenbaum Ch. 3.1 • If necessary: Review Tanenbaum “Modern OS,” Ch. 2.3 or a similar OS book on interprocess communication