200 likes | 308 Views
Design of a Masked Array facility for Python _______________________. Paul Dubois Program for Climate Model Diagnosis and Intercomparison Lawrence Livermore National Laboratory
E N D
Design of a Masked Array facility for Python_______________________ Paul Dubois Program for Climate Model Diagnosis and Intercomparison Lawrence Livermore National Laboratory This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No.W-7405-Eng-48.
A “masked array” is an array that may have invalid entries. • Invalid entries may be: • Missing observations • Results of invalid computations • Observations known to be defective • Entries excluded for purpose of analysis
Numeric Python is what we want, but it doesn’t support invalid entries. • If we didn’t care about speed or space we could use Python lists. • For learning purposes, we would like a masked array to be as much like a Numerical Python array as possible. • We need to be able to convert to and from Numerical Python arrays to get to algorithms, graphics, etc.
Module MA is written entirely in Python and emulates Numeric. • Interoperates with Numeric arrays and scalars. • Constructors accept anything Numeric does. • Supports the same range of type codes. • Supports almost all the same functions and methods.
The critical design choice: how to represent a “missing” value. • “Traditional” approach in the sciences: a marker value, such as 1.e20, means “missing”. • You have to avoid using such values during operations and put them in appropriate places in the result. • Equality tests on the marker values will not work in mixed-precision environments.
MA represents the array as a data array, a mask array, and a fill value. • A Numeric array to hold the data. • The mask is • a Numeric integer byte array of 0’s and 1’s with the same shape as the mask, where a value of 1 denotes an invalid location; or, • None meaning no invalid values. • The fill value is a default value to use to replace the masked values when converting to a Numeric array. This value is type-dependent.
Masked values are supported for most of Numeric’s operations. • Basic operations +, -, *, /, ** • Standard functions such as sqrt, sin, log. • Array functions such as transpose, arange, choose, take, where • Some additional mask-related functions such as masked_value, masked_equal, putmask, mask_or.
The function / method filled always returns a Numeric array. • filled (x, fill_value=None) • Masked values replaced by fill_value. Can accept x as MA, Numeric array, or anything Numeric can make into one. • x.filled (fill_value = None) No copying if x does not have a mask.
Illustration: a + b The operation x = a + b is roughly: ma = getmask(a) # might be None mb = getmask(b) m = mask_or (ma, mb) # “smart” or d = filled(a, 0) + filled(b, 0) # d is Numeric x = masked_array (d, m)
What about scalars? • Reductions and indexing may lead to scalar results. • Numeric always converts these to Python ints, floats, etc. • So what is x[2] if that element is masked? • Returning the fill value is not a good idea • Want to return something but have it be useless
A singleton class represents masked values. • The one instance is an attribute of the MA module named masked. • Prints as “—” (Can be changed to cause arrays to print the fill value instead). • Most operations involving masked raise exceptions.
The value “masked” in used in ways that make sense. >>> x=arange(5) >>> x[2]=masked >>> print x [0 ,1 ,-- ,3 ,4 ,] >>> if x[2] is masked: print 'missing' missing
Many unary / binary functions were similar but not identical. • Note that for a + b we filled a and b with 0. That might not be the right choice for a different operation. We have to be sure that filled(a, v1) op filled(b, v2) always succeeds. • The choice affects the result in reductions: multiply.reduce(x) needs x filled with 1’s.
Two class objects were created to abstract this idea. • Class masked_unary_function • Class masked_binary_function These classes have a __call__ method. sqrt = masked_unary_function (Numeric.sqrt, 0) Then we can use it: y = sqrt(x)
A domain object is included. >>> x = array([1.,2., -1.]) >>> print sqrt(x) [1.0 ,1.41421356237 ,-- ,]
Some functions needed to be done by hand. • Some arguments not arrays: identity, zeros, ones, sum, product. • New functions with no Numeric counterpart: count • Semantics require detailed handling: choose, where, take, ravel
Many semantic issues had to be decided. • A simple example: compress compress (c, x) means collect elements of x corresponding to true elements of c. Clearly, a masked valid of x that gets chosen has to stay masked. But what does a masked value in c mean?
Speed seems to be good enough for most uses. • Speed relative to Numeric varies depending on what you do and how often you have a mask present. • Simple arithmetic for arrays of size 50,000 is 1.3 – 2.3 times slower than Numeric. • Implementation entirely in Python was a good choice, because many refinements proved to be necessary.
Space does not seem to be more of a problem than it needs to be. • Masks are byte arrays, would use bits if available. • Masks are copy-on-write so they can be shared. • When the mask is None, filled does not copy.
This project illustrates many of Python’s strengths. • Object-oriented features – sqrt is actually a class instance that “knows” it must not take square roots of negative numbers. • masked emulates a new kind of scalar • Adapted Numeric, a very complicated package, to a new use without a complete rewrite, in 1800 lines of Python.