1 / 76

Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google

Varun Talwar, product manager on Google's gRPC project discusses the fundamentals and specs of gRPC inside of a Google-scale microservices architecture.

datawire
Download Presentation

Bringing Learnings from Googley Microservices with gRPC - Varun Talwar, Google

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bringing learnings from Googley microservices with gRPC Microservices Summit Varun Talwar Google confidential │ Do not distribute distribute Google confidential │ Do not

  2. Contents 1. Context: Why are we here? Learnings from Stubby experience a. HTTP/JSON doesnt cut it b. Establish a Lingua Franca c. Design for fault tolerance and control: Sync/Async, Deadlines, Cancellations, Flow control d. Flying blind without stats e. Diagnosing with tracing f. Load Balancing is critical gRPC a. Cross platform matters ! b. Performance and Standards matter: HTTP/2 c. Pluggability matters: Interceptors, Name Resolvers, Auth plugins d. Usability matters ! 2. 3.

  3. CONTEXT WHY ARE WE HERE?

  4. Business Agility

  5. Developer Productivity

  6. Performance

  7. INTRODUCING STUBBY

  8. Microservices at Google ~O(1010) RPCs per second. Google confidential │ Do not distribute Images by Connie Zhou

  9. Stubby Magic @ Google

  10. Making Google magic available to all Borg Kubernetes Stubby

  11. LEARNINGS FROM STUBBY

  12. Key learnings 1. HTTP/JSON doesnt cut it ! 2. Establish a lingua franca 3. Design for fault tolerance and provide control knobs 4. Dont fly blind: Service Analytics 5. Diagnosing problems: Tracing 6. Load Balancing is critical

  13. HTTP/JSON doesn’t cut it ! 1 1. 2. 3. 4. 5. 6. 7. 8. WWW, browser growth - bled into services Stateless Text on the wire Loose contracts TCP connection per request Nouns based Harder API evolution Think compute, network on cloud platforms

  14. Establish a lingua franca 2 1. 2. 3. 4. 5. 6. Protocol Buffers - Since 2003. Start with IDL Have a language agnostic way of agreeing on data semantics Code Gen in various languages Forward and Backward compatibility API Evolution

  15. How we roll at Google

  16. Service Definition (weather.proto) syntax = "proto3"; message WeatherResponse { Temperature temperature = 1; float humidity = 2; } service Weather { rpc GetCurrent(WeatherRequest) returns (WeatherResponse); } message Temperature { float degrees = 1; Units units = 2; message WeatherRequest { Coordinates coordinates = 1; enum Units { FAHRENHEIT = 0; CELSIUS = 1; KELVIN = 2; } } message Coordinates { fixed64 latitude = 1; fixed64 longitude = 2; } } Google Cloud Platform

  17. Design for fault tolerance and control 3 ● Sync and Async APIs ● Need fault tolerance: Deadlines, Cancellations ● Control Knobs: Flow control, Service Config, Metadata

  18. gRPC Deadlines First-class feature in gRPC. Deadline is an absolute point in time. Deadline indicates to the server how long the client is willing to wait for an answer. RPC will fail with DEADLINE_EXCEEDED status code when deadline reached. 18

  19. Deadline Propagation withDeadlineAfter(200, MILLISECONDS) 60 ms 40 ms 20 ms 20 ms 90 ms Gateway DEADLINE_EXCEEDED DEADLINE_EXCEEDED DEADLINE_EXCEEDED DEADLINE_EXCEEDED Now = Now = Now = Now = 1476600000000 1476600000230 1476600000040 1476600000150 Deadline = 1476600000200 Deadline = 1476600000200 Deadline = 1476600000200 Deadline = 1476600000200 Google Cloud Platform

  20. Cancellation? Deadlines are expected. What about unpredictable cancellations? •User cancelled request. •Caller is not interested in the result any more. •etc 20

  21. Cancellation? Active RPC Active RPC Busy Busy Busy Active RPC Active RPC Active RPC Active RPC GW Busy Busy Busy Active RPC Active RPC Active RPC Busy Busy Busy Google Cloud Platform

  22. Cancellation Propagation Idle Idle Idle GW Idle Idle Idle Idle Idle Idle Google Cloud Platform

  23. Cancellation Automatically propagated. RPC fails with CANCELLED status code. Cancellation status be accessed by the receiver. Server (receiver) always knows if RPC is valid! 23

  24. BiDi Streaming - Slow Client Slow Client Fast Server Request Responses CANCELLED UNAVAILABLE RESOURCE_EXHAUSTED Google Cloud Platform

  25. BiDi Streaming - Slow Server Fast Client Slow Server Request Response Requests CANCELLED UNAVAILABLE RESOURCE_EXHAUSTED Google Cloud Platform

  26. Flow-Control Flow-control helps to balance computing power and network capacity between client and server. gRPC supports both client- and server-side flow control. Photo taken by Andrey Borisenko. 26

  27. Service Config Policies where server tells client what they should do Can specify deadlines, lb policy, payload size per method of a service Loved by SREs, they have more control Discovery via DNS 27

  28. Metadata helps in exchange of useful information Metadata Exchange - Common cross-cutting concerns like authentication or tracing rely on the exchange of data that is not part of the declared interface of a service. Deployments rely on their ability to evolve these features at a different rate to the individual APIs exposed by services.

  29. Don’t fly blind: Stats 4 ● ● ● ● What is the mean latency time per RPC? How many RPCs per hour for a service? Errors in last minute/hour? How many bytes sent? How many connections to my server?

  30. Data collection by arbitrary metadata is useful Any service’s resource usage and performance stats in real time by (almost) any arbitrary metadata 1. Service X can monitor CPU usage in their jobs broken down by the name of the invoked RPC and the mdb user who sent it. 2. Social can monitor the RPC latency of shared bigtable jobs when responding to their requests, broken down by whether the request originated from a user on web/Android/iOS. 3. Gmail can collect usage on servers, broken down by according POP/IMAP/web/Android/iOS. Layer propagates Gmail's metadata down to every service, even if the request was made by an intermediary job that Gmail doesn't own ● Stats layer export data to varz and streamz, and provides stats to many monitoring systems and dashboards ●

  31. Diagnosing problems: Tracing 5 ● ● 1/10K requests takes very long. Its an ad query :-) I need to find out. Take a sample and store in database; help identify request in sample which took similar amount of time ● I didnt get a response from the service. What happened? Which link in the service dependency graph got stuck? Stitch a trace and figure out. Where is it taking time for a trace? Hotspot analysis What all are the dependencies for a service? ● ●

  32. Load Balancing is important ! 5 Iteration 1: Stubby Balancer Iteration 2: Client side load balancing Iteration 3: Hybrid Iteration 4: gRPC-lb

  33. Next gen of load balancing ● Current client support intentionally dumb (simplicity). ○ Pick first available - Avoid connection establishment latency ○ Round-robin-over-list - Lists not sets → ability to represent weights ● For anything more advanced, move the burden to an external "LB Controller", a regular gRPC server and rely on a client-side implementation of the so-called gRPC LB policy. 3) RR over addresses of address-list backends gRPC LB 1) Control RPC client LB Controller 2) address-list

  34. In summary, what did we learn ● ● ● ● ● Contracts should be strict Common language helps Common understanding for deadlines, cancellations, flow control Common stats/tracing framework is essential for monitoring, debugging Common framework lets uniform policy application for control and lb Single point of integration for logging, monitoring, tracing, service discovery and load balancing makes lives much easier !

  35. INTRODUCING gRPC

  36. gRPC core gRPC Java gRPC Go Open source on Github for C, C++, Java, Node.js, Python, Ruby, Go, C#, PHP, Objective-C

  37. Where is the project today? ● ● ● 1.0 with stable APIs Well documented with an active community Reliable with continuous running tests on GCE ○ Deployable in your environment Measured with an open performance dashboard ○ Deployable in your environment Well adopted inside and outside Google ● ●

  38. More lessons 1. Cross language & Cross platform matters ! 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors, Name Resolvers, Auth plugins 4. Usability matters !

  39. More lessons 1. Cross language & Cross platform matters ! 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors, Name Resolvers, Auth plugins 4. Usability matters !

  40. gRPC Principles & Requirements Coverage & Simplicity The stack should be available on every popular development platform and easy for someone to build for their platform of choice. It should be viable on CPU & memory limited devices. http://www.grpc.io/blog/principles Google Cloud Platform

  41. gRPC Speaks Your Language Service definitions and client libraries Platforms supported Java Go C/C++ C# Node.js PHP Ruby Python Objective-C MacOS Linux Windows Android iOS ● ● ● ● ● ● ● ● ● ● ● ● ● ● Google Cloud Platform

  42. Interoperability gRPC Service gRPC Stub gRPC Service GoLang Service gRPC Stub Java Service gRPC Stub gRPC Service gRPC Stub gRPC Service gRPC Stub Python Service C++ Service Google Cloud Platform

  43. More lessons 1. Cross language & Cross platform matters ! 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors, Name Resolvers, Auth plugins 4. Usability matters !

  44. HTTP/2 in One Slide HTTP/1.x • Single TCP connection. POST: /upload HTTP/1.1 Host: www.javaday.org.ua Content-Type: application/json Content-Length: 27 Application (HTTP/2) Binary Framing • No Head-of-line blocking. Session (TLS) [optional] {“msg”: “Welcome to 2016!”} • Binary framing layer. Transport(TCP) Network (IP) HTTP/2 • Request –> Stream. HEADERS Frame DATA Frame • Header Compression. Google Cloud Platform

  45. Binary Framing Stream 1 HEADERS :method: GET :path: /kyiv :version: HTTP/2 :scheme: https Request HTTP/2 breaks down the HTTP protocol communication into an exchange of binary-encoded frames, which are then mapped to messages that belong to a stream, and all of which are multiplexed within a single TCP connection. HEADERS :status: 200 :version: HTTP/2 :server: nginx/1.10.1 ... DATA Response <payload> TCP Stream 2 Stream N Google Cloud Platform

  46. HTTP/1.x vs HTTP/2 http://http2.golang.org/gophertiles http://www.http2demo.io/ Google Cloud Platform

  47. gRPC Service Definitions Unary Server streaming Client streaming BiDi streaming Unary RPCs where the client sends a single request to the server and gets a single response back, just like a normal function call. The client sends a request to the server and gets a stream to read a sequence of messages back. The client reads from the returned stream until there are no more messages. The client send a sequence of messages to the server using a provided stream. Once the client has finished writing the messages, it waits for the server to read them and return its response. Both sides send a sequence of messages using a read-write stream. The two streams operate independently. The order of messages in each stream is preserved. Google Cloud Platform

  48. BiDi Streaming Use-Cases Messaging applications. Games / multiplayer tournaments. Moving objects. Sport results. Stock market quotes. Smart home devices. You name it! 48

  49. Performance ● ● ● ● Open Performance Benchmark and Dashboard Benchmarks run in GCE VMs per Pull Request for regression testing. gRPC Users can run these in their environments. Good Performance across languages: ○ Java Throughput: 500 K RPCs/Sec and 1.3 M Streaming messages/Sec on 32 core VMs ○ Java Latency: ~320 us for unary ping-pong (netperf 120us) ○ C++ Throughput: ~1.3 M RPCs/Sec and 3 M Streaming Messages/Sec on 32 core VMs.

  50. More lessons 1. Cross language & Cross platform matters ! 2. Performance and Standards matter: HTTP/2 3. Pluggability matters: Interceptors, Auth 4. Usability matters !

More Related