Wednesday, October 20, 2010

careful when casting numpy arrays

It's been a while since I posted here since I've been posting my questions (and eventual answers like this one) to stackoverflow.com.  In any case, I was running into a problem, that I thought I'd share with you all.

I have some code that was creating a numpy array from a list of lists. Depending on the user's interactions with our program, one of the columns in the list could contain strings, so I was casting the array to type np.object.  However, for very specific inputs, I was encountering a strange error:

ValueError: setting an array element with a sequence.

So I set out to take a closer look at what was happening and this is how I was able to reproduce the error:

>>>np.array(['asdf', 1.3456e-9]).astype('object')[1:].astype('float')
Traceback (most recent call last):
File "", line 1, in
ValueError: setting an array element with a sequence.


And more frighteningly:

>>> np.array(['asdf', 1.345678e-9]).astype('object')[1:].astype('float')
array([ 1.345678])


By this point, you probably see what's going on here.  Since I am _casting_ the array to type np.object, numpy is fitting all of the elements into equally sized blocks of memory that fit the largest string in the array.  The problem is that it doesn't seem to take the floating point values into account, so they are apparently cast to strings and tuncated when they are too large.  In the case above it appears that the underlying data representation would be numpy "|S8"since the float value coming back out is 8 characters wide.

In another example, we see that when the largest string is long enough, the whole float representation gets stored and properly casted.

>>>np.array(['asdfasdfasd', 1.34567e-9]).astype('object')[1:].astype('float')
array([ 1.34567800e-09])


The message here is never to cast a mixed type array to np.object. Rather, create it with that type to begin with so numpy knows how much memory to allocate for your floating point values.

>>> np.array(['asdf', 1.3456789123456e-9], dtype='object')[1:].astype('float')
array([  1.34567891e-09])