Saturday, May 5, 2012

Playing with arrays: slicing, sorting, filtering, where function, etc.

You can also read the more recent post on Numpy here: http://python-astro.blogspot.mx/2014/08/introduction-to-numpy.html


Most of the data we have to manage are in form of arrays. That's why it's quite always necessary to import the numpy library to easily deal with tables, arrays, images, cubes, etc...
In this post I will discuss some of the methods and functions one may have to use in this context.

In Python, one can use lists, tuples and dictionaries to put different elements together. They even can contain elements of different types. But nothing better than numpy arrays to really "play" with the data, extract subsets, combine them using arithmetic operations.

We already discuss a little some array commands in a previous post: read-ascii-file-cont. Here we are more complete, but not exhaustive, as numpy is a whole world... Have a look at the reference guide: http://docs.scipy.org/doc/numpy/reference/ and other references at the end of this post.

Create arrays

a = np.array([1, 2, 3, 7, 5])
a
array([1, 2, 3, 7, 5])

numpy's arrays cannot contain values of different types, the values are transformed if necessary, from integer into real, and from real into string:
b = np.array([1, 2, 3, 4.])
b
array([ 1.,  2.,  3.,  4.])
c = np.array([1, 2, 3., '4'])
c
array(['1', '2', '3', '4'],
      dtype='|S1')

shapes and sizes of the arrays are obtained using method and functions:
print c.shape
(4,)
print len(c)
4

Different ways to create arrays:
a = np.arange(1, 10, 0.5)
print a
[ 1.   1.5  2.   2.5  3.   3.5  4.   4.5  5.   5.5  6.   6.5  7.   7.5  8.
  8.5  9.   9.5]

TAKE CARE, THE LATEST ELEMENT IS NOT WHAT YOU MAY EXPECT...
It's because the first element in Python is indexed by 0. The simple use of np.arange is :
a = np.arange(10)
print a
[0 1 2 3 4 5 6 7 8 9]
a starts at 0, and has 10 elements, that's ok.

Easy to create linearly spaced arrays:
c = np.linspace(0, 1, 4)
print c
[ 0.          0.33333333  0.66666667  1.        ]
or log spaced:

c = np.logspace(0, 1, 4)
c
array([  1.        ,   2.15443469,   4.64158883,  10.        ])
which is actually the same as :
c = 10**np.logspace(0, 1, 4)

To create arrays of 0. or 1.:
a = np.zeros(10)
b = np.ones((5, 4))
Here b is a 2D array.
b.shape
(5, 4)

You can use a list (or a tuple, or an array) and replicate it:
a = np.array([1, 2, 3])
b = np.tile(a, 5)
print b
[1 2 3 1 2 3 1 2 3 1 2 3 1 2 3]
c = np.tile(a, (5, 1))
print c
[[1 2 3]
 [1 2 3]
 [1 2 3]
 [1 2 3]
 [1 2 3]]

You can also create 2D arrays from 1D vectors:
x = np.linspace(-1, 1, 100)
y = x
X, Y = np.meshgrid(x, y) #this create 2D arrays containing x and y for each pixel
print X.shape
(100, 100)
Outer products are obtain using... np.outer:
a = np.array([1, 2, 3])
b = np.array([1, 10, 100])
np.outer(a, b)
array([[  1,  10, 100],
       [  2,  20, 200],
       [  3,  30, 300]])
np.outer(b, a)
array([[  1,   2,   3],
       [ 10,  20,  30],
       [100, 200, 300]])

Broadcasting

Python numpy is able to add dimensions to arrays to perform operations:
a = np.array([1, 2, 3, 4])
b = np.ones((6, 4))
a * b
array([[ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.],
       [ 1.,  2.,  3.,  4.]])
Nice page on this:
http://www.scipy.org/EricsBroadcastingDoc

Indexing and Slicing

The access to elements of arrays is done using []:
a = np.arange(10)
a[5]
0
a[-1]  # last elements
9

In case of N-dims arrays, one can extract slides using :
a = np.array([1, 2, 3, 4, 5])
b = np.array([1, 10, 100, 1000])
c = np.outer(a, b)
c.shape
(5, 4)
print c 
[[   1   10  100 1000]
 [   2   20  200 2000]
 [   3   30  300 3000]
 [   4   40  400 4000]
 [   5   50  500 5000]]
print c[:,2]
[100 200 300 400 500]
print c[-1,:]
[   5   50  500 5000]
print c[-1, -1]
5000

There is a lot of methods in the array object, the best page to learn more is there: http://scipy-lectures.github.com/intro/numpy/numpy.html#the-numpy-array-object

Filtering

Sometimes we need to extract elements from arrays following some criteria. There is basically two ways to do this: defining a set of indices where the condition is completed (a la WHERE in IDL), or defining a boolean mask.

Let's first define a 2D array made of 10 times 1000 random values:
a = np.random.random((10, 1000))

We want to extract the values where the 2nd and the 4th 1000-elements vectors are greater than 0.5.

Using the numpy.where function:
w1 = np.where((a[1,:] > 0.5) & (a[3,:] > 0.5))
as the result is a tuple of indices, and in this case the dimension of the result is 1, better extract on the fly the array of indices from the tuple:
w2 = np.where((a[1,:] > 0.5) & (a[3,:] > 0.5))[0]
len(w2)
248  # your result may differ form this value, but must be close to 250.

We can now reduce the initial array to the desired values:
b = a[:, w2]
b.shape
(10, 248)

One can also build an array of boolean:
mask = (a[1,:] > 0.5) & (a[3,:] > 0.5)
which is actually similar to
mask2 =  np.where((a[1,:] > 0.5) & (a[3,:] > 0.5), True, False)

It can be used the same way as before:
b = a[:, mask]
b.shape
(10, 248)

These latter cases result in a 1000-element array, filled with True and False. To know the number of True values, just sum the array:
mask.sum()
248  # your result may differ form this value, but must be close to 250.

One big advantage of the mask technique is that you can combine different masks:
mask1 = a[1,:] > 0.5
mask2 = a[3,:] > 0.5
mask_total = mask1 & mask2
b = a[:, mask_total]

Sorting

a = np.array([1,2,45,2,1,46,7,-1])
b1 = np.sort(a)
ib = np.argsort(a)
b2 = a[ib]
print b1
[-1  1  1  2  2  7 45 46]
print b2 
[-1  1  1  2  2  7 45 46]

More on numpy arrays

http://scipy.org/Numpy_Example_List_With_Doc
http://scipy-lectures.github.com/intro/numpy/numpy.html#the-numpy-array-object

7 comments:

  1. Little mistake, you wanted to say lin instead of log :
    which is actually the same as :
    c = 10**np.linspace(0, 1, 4)

    ReplyDelete
  2. Why a mistake? logspace is a valide numpy command.

    ReplyDelete
  3. Anonymous10/6/15 14:50

    Because np.logspace(0, 1, 4) is not "actually the same as" 10**np.logspace(0, 1, 4).

    But it is the same as 10**np.linspace(0, 1, 4).

    ReplyDelete
  4. I'd spent so long trying to figure out how to apply 1d masks onto 2d numpy arrays, b = a[:, mask]! Thanks for a really helpful guide!

    ReplyDelete
  5. Anonymous12/4/16 05:42

    The access to elements of arrays is done using []:
    a = np.arange(10)
    a[5]
    0 <-- it should be 5

    ReplyDelete
  6. Hey! I have to convert some lists into arrays in order to use function np.where
    I have to plot indicators of different sources from a huge data file and I make the operations between those indicators using lists not arrays. So lists are already populated with data but the np.where function would be useful when I have to plot them.
    Can you help me?
    Thanks a lot

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete