360 likes | 543 Views
NDB The new Python client library for the Google App Engine Datastore. Guido van Rossum guido@google.com. Google App Engine in a nutshell. Run your web apps in Google’s cloud Opinionated Platform-as-a-Service ( PaaS ) Automatically scales your app
E N D
NDBThe new Python client library for theGoogle App Engine Datastore Guido van Rossumguido@google.com
Google App Engine in a nutshell • Run your web apps in Google’s cloud • Opinionated Platform-as-a-Service (PaaS) • Automatically scales your app • Python-only launch April 2008; Java in 2009 • NoSQLdatastore • ORM is primary API • small subset of SQL (“GQL”) on top or ORM • Original Python ORM called “db”
Google App Engine in numbers • Attained 7.5 Billion daily hits • 1 Million active applications • 250,000 active developers (30-day actives) • Half of all internet IP addresses touch Google App Engine servers per week • 2 Trillion datastore operations per month
NDB in a nutshell • Fix design bugs in the old db API • Implement cool new API ideas • Asynchronous to the core • 100% compatible on-disk representation • Google App Engine Datastore only • Python 2.5 and 2.7 (single- and multi-threaded) • HRD and M/S datastore; US and EU datacenters
Development process • Notice widespread frustration with old db • Get management buy-in for a full rewrite • Sit in a corner coding for a year :-) • No, really: • Release open source version early and often • Beg users for feedback and contributions • Try to document, redesign what’s hard to explain • Rinse and repeat
What’s wrong with old db • Hard to modify • any time we try to change internals, some user code breaks that depends on those internals • Started out as a quick demo • “how to do Django-style models in App Engine” • made the official API only weeks before launch • Has too many layers • data is copied too many times between layers
Layer cake (old) db datastore.py protocol buffers
Layer cake (new) db ndb datastore.py datastore_{rpc,query}.py protocol buffers
Cool new API features • Async core • Auto-batching • Integrated caching • Pythonic query syntax • Give entities nestable structure • Make subclassing Property classes easy
Other nice things • Use repeated=True instead of ListProperty • Pre- and post-operation hooks • Key and Query types are truly immutable • All objects have useful repr()s • Unified terminology (id instead of key_name) • PickleProperty, JsonProperty • ProtoRPC support: MessageProperty
Model (schema) definitions • Model class and Property classes • similar to Django (or any Python ORM) • uses a simple metaclass • Example: • class Employee(ndb.Model): name = ndb.StringProperty(required=True) rank = ndb.IntegerProperty(default=3) phone = ndb.StringProperty()
Basic CRUD • (Create, Read, Update, Delete) • emp = Employee(name=‘Guido’) • key = emp.put() • emp = key.get() • emp.phone = ‘555-5555’; emp.put() • key.delete()
Queries • Query for all entities: • all_emps = Employee.query().fetch() • for emp in Employee.query(): … • Query for property values: • Employee.query(Employee.rank > 3) • Employee.query(Employee.phone == None) • Query for multiple conditions: • Employee.query(<cond1>, <cond2>, …)
Why repeat the class name? • Limitations of Python as a DSL… • Old db used string literals; error-prone: • Employee.all().filter(‘ rank >’, 3) # extra space • Protip: write queries as class methods: • @classmethoddef outranks(cls, rank): return cls.query(cls.rank > rank) • Employee.outranks(3).fetch()
Mapping a query over a callback • # Pretend you don’t see the async bits • @ndb.taskletdef callback(ent): if not ent.name:ent.name = ent.first_name + ent.last_name yield ent.put_async() • Employee.query().map(callback) • Concurrency controlled by query batch size
StructuredProperty • Example: list of tagged phone numbers • In old db: • class Contact(db.Model): name = db.StringProperty() # following two are parallel arrays phones = db.StringListProperty() tags = db.StringListProperty() • def add_phone(contact, number, tag):contact.phones.append(number)contact.tags.append(tag)
StructuredProperty (2) • class Phone(ndb.Model): number = ndb.StringProperty() tag = ndb.StringProperty() • class Contact(ndb.Model): name = ndb.StringProperty() phones = ndb.StructuredProperty(Phone, repeated=True) • def add_phone(contact, number, tag):contact.phones.append(Phone(number=number, tag=tag)) • Contact.query(Contact.phones.number == ‘555-1212’)
Transactions • Nothing really new or exciting • Well integrated with contexts and caching • Decorator @ndb.transactional • To specify options: • @ndb.transactional(retries=N, xg=True) • Join current transaction if one is in progress: • @ndb.transactional(propagation=ALLOWED)
Caching • CRUD automatically caches in two places: • in memory (per-context; write-through) • in memcache (shared) • one memcache server for all instances of you app • write locks and clears, but doesn’t update memcache • memcache algorithm ensures consistency • even when using transactions • except maybe under extreme failure conditions
Caching (2) • User can override caching policies • per call, per model class, per context • write your own policy function • can even turn off datastore writes completely! • Query results are not cached • consistency is too hard to guarantee • however, this works for high cache hit rates:ndb.get_multi(q.fetch(keys_only=True))
The async API (a fairly deep dive)
Async basics • Based on PEP 342: generators as coroutines • Has its own event loop and Future class • Constrained by App Engine async API • based on RPCs (“Futures” for server-side work) • only RPCs can be asynchronous (no select/poll) • can wait for multiple RPCs • in original (Python 2.5) runtime, no threads • greenlets/gevent/etc. useless in this environment
Synchronous example code def get_or_insert(id):ent = Employee.get_by_id(id) if ent is None:ent = Employee(…, id=id)ent.put() return ent
Converted to async style @ndb.taskletdef get_or_insert_async(id):ent = yield Employee.get_by_id_async(id) if ent is None:ent = Employee(…, id=id)yield ent.put_async()raise ndb.Return(ent) “Look ma, no callbacks”
Writing async code • The decorated function (tasklet) is async itself • Really, async operations just return Futures • can separate call from yield:f = foo_async(); …; a = yield f • yield takes any Future, or a list of Futures • yield <list> returns a list of results:f = f_sync(); …; g = g_sync(); …; a, b = yield f, g • yielding multiple futures is key to running multiple tasklets concurrently
Futures • NDB Futures are explicit Futures • must use an explicit API to wait for the result • Three ways to wait: • call f.get_result() # in synchronous context • yield f # in a tasklet • f.add_callback(callback_function) # internal • Any number of waiters are supported • An exception is also a result (i.e. is re-raised)
Event loop • Doesn’t know about Futures • Knows about App Engine RPCs though… • And knows about callback functions • When you’re calling an async API or tasklet • a helper to run the tasklet is queued • you’re given a Future right away • the helper will eventually set the Future’s result • use the Future to wait for the result
The magic yield • How does yielding a Future wait for its result? • “Trampoline” code calls g.next() or g.send() on the underlying generator object • If this returns a Future, the trampoline adds a callback to the Future to restart the generator • It’s up to whatever created the Future to make sure that its result is eventually set • Go to #1, passing the result into g.send()
Edge cases • If g.next() or g.send() raises StopIteration, we’re done (ndb.Return is a subclass thereof) • If it raises another exception, we’re also done, and we pass the exception on • If it returns an RPC instead of a Future, use the event loop’s native understanding of RPCs • If it returns a non-Future, that’s an error
You don’t have to understand this • Just remember these rules: • use @ndb.tasklet on a generator function • yield *_async() operations • raise ndb.Return(x) instead of return x • Use yield <list> to increase concurrency • Don’t call synchronous APIs! • Helpful convention: name tasklets *_async • Exception passing is remarkably natural
Auto-batching • Automatically combine operations in one RPC • Only like operations can be combined • Must use async API to benefit • Example: • e1.put(); e2.put() # Two RPCs • yield e1.put_async(), e2.put_async() # One RPC! • Implemented for datastore get, put, delete; and memcache operations (via Context)
Auto-batching (2) • Biggest benefit is between multiple tasklets • Each tasklets does some single ops • example: get_or_insert() • Tasklets are run concurrently • Each tasklet in turn runs until first blocking op • Those ops are buffered, not sent out yet • When no tasklets left to run, buffered ops are combined into one batch RPC
Auto-batching (3) • Each original single op has its own Future • When the RPC completes, its result is distributed back over those Futures • And… the tasklets are back in the race! • But… why not just manually batch operations? • restructuring your code to do that is often hard!
Conclusion: caveats • Async coding has lots of newbie traps • Careful when overlapping I/O and CPU work • auto-batch queues only flushed when blocking • Mixing async and synchronous ops can be bad • in extreme cases can cause stack overflow • Debugging async code is a challenge • too much state in suspended generators’ locals • can’t easily step over a yield in pdb