Numpy in Python for High Performance Computing

One significant source of overhead in Python comes from the way it handles data: each object carries a considerable amount of metadata, including information about its type. This overhead becomes even more pronounced when managing large collections of data. For instance, representing a 1000×1000 grid using a list of lists would involve a million elements, each carrying more metadata than actual data. This inefficiency arises because Python lists are designed to hold objects of any type, not just uniform types. However, in many real-world scenarios—especially in scientific computing—the data within such structures is typically all of the same type. A more efficient approach would be to use a data structure that stores type information once for the entire collection, rather than individually for each element. Python does provide a built-in “array” type that addresses this to some extent by storing elements more compactly. However, these native arrays offer limited functionality, which led the community to develop more powerful alternatives.

Numpy

Numpy (short for Numerical Python) offers an efficient solution to this problem. It provides a powerful data structure called an ndarray (N-dimensional array), which can have multiple dimensions—unlike Python arrays that are restricted to one-dimensional structures like lists. This flexibility makes ndarrays highly suitable for a wide range of data types and use cases. In addition to the array itself, Numpy includes an extensive library of functions that perform various operations on arrays.

One of Python’s limitations in terms of performance is the overhead caused by the metadata associated with sequences of numbers. This metadata prevents the processor from efficiently vectorizing computations. Since ndarrays remove much of this metadata, operations on them can be vectorized, leading to faster execution and improved performance compared to standard Python lists.

Let’s compare the list comprehension from the previous episode to the equivalent Numpy functions.

pip install numpy # fist install numpy using pip
Defaulting to user installation because normal site-packages is not writeable
^C
Note: you may need to restart the kernel to use updated packages.
import numpy as np
#the numpy module has been imported as np. 
#This is a very common convention when using Numpy in Python
%timeit x = [i**2 for i in range(1000)]
%timeit x = np.arange(1000) ** 2
75.1 µs ± 300 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
4.07 µs ± 2.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

This is clearly not just vectorisation—the most speedup you’d expect there is around 18×. Instead, Numpy implements operations across the whole array with high-speed loops in a compiled programming language, rather than using Python’s slower loops.

Broadcasting and whole-array operations is one of Numpy’s most powerful features.

In Numpy, operations on arrays are applied to the whole array, element-wise. For example, in the example above we squared an array; this returned the array containing the squared numbers—the numbers being the same as those generated in the list comprehension. The single number to the right of the ** has been broadcast to act on all the elements in the ndarray

Remark

This also works with arithmetic operations between arrays. You can, for example, add two arrays together, or multiply their elements, and Numpy will perform the operations as efficiently as it knows how.

The key point is that the operations must be these whole-array or broadcast operations in order to gain this speed. If you try to use a normal Python for loop over elements of an ndarray, this will most likely perform even slower than the equivalent loop over a Python list, since Numpy needs to do some extra bookkeeping compared to Python.

Let’s look at the next script and see how we can use whole-array operations to improve its performance. The Euclidean distance between two vectors.

The Euclidean distance

If we hadn’t heard of Numpy, we might treat pi and qi as lists, and use the following function to calculate the distance:

%%writefile dist.py
def naive_dist(p, q):
    square_distance = 0
    for p_i, q_i in zip(p, q):
        square_distance += (p_i - q_i) ** 2
    return square_distance ** 0.5
Writing dist.py

Let’s save this function in a file dist.py, and time it for some example 1000-element vectors.

import dist
%timeit p = [i for i in range(1000)]; q = [i + 2 for i in range(1000)]; dist.naive_dist(p, q)
192 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

This method uses Python lists and a custom naive_dist function to calculate the distance. The time is around 193 microseconds per loop, which is quite slow due to the overhead associated with Python’s dynamic typing and the use of lists.

%%writefile -a dist.py
import numpy as np

def simple_numpy_dist(p, q):
    return (np.sum((p - q) ** 2)) ** 0.5
Appending to dist.py

Add this function definition to the dist.py file we just made.

To time this, we need to do the setup with Numpy rather than with list comprehensions, since otherwise Python will complain when it tries to subtract two lists. We’ll use arange here to generate a some test data, but the performance should be the same no matter how the input data are generated, provided the data type is the same.

import dist
import numpy as np
%timeit  p = np.arange(1000); q = np.arange(1000) + 2; dist.simple_numpy_dist(p, q)
20.5 µs ± 35.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In this case, you’re using NumPy arrays, which are much more efficient for numerical computations due to their optimized C-backed implementation. The time per loop is about 20.6 microseconds, significantly faster than the list-based approach.

%%bash
python -m timeit --setup='import dist; import numpy as np; \
    p = np.arange(1000); q = np.arange(1000) + 2' \
    'dist.simple_numpy_dist(p, q)'
20000 loops, best of 5: 13.1 usec per loop

This approach runs timeit from the command line, and shows a result of 12.8 microseconds per loop for 20,000 loops, indicating consistent performance.

This is a factor of thirty improvement, but can we do better? Since the length of a vector is a very common operation, Numpy in fact provides a built-in function for it: np.linalg.norm. Using that:

%%writefile -a dist.py
def numpy_norm_dist(p, q):
    return np.linalg.norm(p - q)
Appending to dist.py
import dist
import numpy as np
%timeit  p = np.arange(1000); q = np.arange(1000) + 2; dist.numpy_norm_dist(p, q)
14.9 µs ± 57.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Adding this to dist.py and timing it again reveals:

%%bash
python -m timeit --setup='import dist; import numpy as np; \
    p = np.arange(1000); q = np.arange(1000) + 2' \
    'dist.numpy_norm_dist(p, q)'
50000 loops, best of 5: 7.94 usec per loop

The reason the 7.74 µs per loop result is quite good is because np.linalg.norm is highly optimized for these types of calculations.

That’s a 50% improvement on the previous case, and a 27x improvement on the original naive code!

Multiple dimensions

For one-dimensional arrays, translating from naive to whole-array operations is normally quite direct. But when it comes to multi-dimensional arrays, some additional work may be needed to get everything into the right shape.

Let’s extend the previous example to work on multiple vectors at once. We would like to calculate the Euclidean distances between M pairs of vectors, each of length N . In plain Python we could take this as a list of lists, and re-use the previous function for each vector in turn.

%%writefile -a dist.py
def naive_dists(ps, qs):
    return [naive_dist(p, q) for p, q in zip(ps, qs)]
Appending to dist.py
import dist
import numpy as np
%timeit  ps = [[i + 1000 * j for i in range(1000)] for j in range(1000)]; qs = [[i + 1000 * j + 2 for i in range(1000)] for j in range(1000)]; dist.naive_dists(ps, qs)
308 ms ± 725 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

now timing this script.

%%bash
python -m timeit --setup='import dist; \
    ps = [[i + 1000 * j for i in range(1000)] for j in range(1000)]; \
    qs = [[i + 1000 * j + 2 for i in range(1000)] for j in range(1000)]' \
    'dist.naive_dists(ps, qs)'
2 loops, best of 5: 110 msec per loop

This is becoming a more significant workload, increasing

Moving this to Numpy, we can subtract as we did previously, but for the summation, we need to be able to tell Numpy to leave us with a one-dimensional array of distances, rather than a single number. To do this, we pass the axis keyword argument, which tells Numpy which axis to sum over. In this case, axis 0 controls which vector we are selecting, and axis 1 controls which element of the vector. Thus here we only want to sum over axis 1, leaving axis 0 still representing the vector of sums.

%%writefile -a dist.py
def simple_numpy_dists(ps, qs):
    return np.sum((ps - qs) ** 2, axis=1) ** 0.5
Appending to dist.py
import dist
import numpy as np
%timeit  ps = np.asarray([[i + 1000 * j for i in range(1000)] for j in range(1000)]); qs = np.asarray([[i + 1000 * j + 2 for i in range(1000)] for j in range(1000)]); dist.simple_numpy_dists(ps, qs)
350 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

To test the performance of this:

%%bash
python -m timeit --setup='import numpy as np; \
    import dist; \
    ps = np.asarray([[i + 1000 * j \
                      for i in range(1000)] \
                     for j in range(1000)]); \
    qs = np.asarray([[i + 1000 * j + 2 \
                      for i in range(1000)] \
                     for j in range(1000)])' \
    'dist.simple_numpy_dists(ps, qs)'
50 loops, best of 5: 4.57 msec per loop

we’ve achieved almost a 30x speedup by switching over to Numpy.

%%writefile -a dist.py
def numpy_norm_dists(ps, qs):
    return np.linalg.norm(ps - qs, axis=1)
Appending to dist.py
import dist
import numpy as np
%timeit  ps = np.asarray([[i + 1000 * j for i in range(1000)] for j in range(1000)]); qs = np.asarray([[i + 1000 * j + 2 for i in range(1000)] for j in range(1000)]); dist.simple_numpy_dists(ps, qs)
351 ms ± 4.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%bash
python -m timeit --setup='import numpy as np; \
    import dist; \
    ps = np.asarray([[i + 1000 * j \
                      for i in range(1000)] \
                     for j in range(1000)]); \
    qs = np.asarray([[i + 1000 * j + 2 \
                      for i in range(1000)] \
                     for j in range(1000)])' \
    'dist.simple_numpy_dists(ps, qs)'
50 loops, best of 5: 4.41 msec per loop

it’s always important to test your improvements on data that resemble those that you will be performing your computation on, as the performance characteristics will change from case to case.

Einsum

The challenge with these more complex scenarios is that we must rely on optional arguments in np.sum and np.linalg.norm. Because of this, Numpy can’t optimize all instances of these general functions as effectively as it could optimize a specific case that we need.

However, Numpy offers a method to express the exact reduction we want, and it can perform this operation in a highly optimized manner. The function for this is np.einsum, which stands for “Einstein summation convention.” If you’re unfamiliar with it, this notation is often used in physics to simplify expressions involving multiple summations.

np.einsum takes as arguments a string that specifies the operations to be performed, followed by the array(s) on which those operations will be carried out. The string follows a specific format: indices (starting from i, j, k, etc.) are assigned to each array, separated by commas. After that, -> is used to indicate the result’s indices. The indices on the right-hand side represent the desired output shape, and any indices not included on the right-hand side will be summed over.

For example, ‘ii->i’ gives a one-dimensional array (or a vector) containing the diagonal elements of a two-dimensional array (or matrix). This is because the i th element of this vector contains the element (i,i) of the matrix.

In this case, we want to calculate the array’s product with itself, summing along axis 1. This can be done via ‘ij,ij->i’, although the array will need to be given twice or this to work. Because of this, we’ll need to put the difference array into a variable before passing it into np.einsum. Putting this together:

%%writefile -a dist.py

def einsum_dists(ps, qs):
    difference = ps - qs
    return np.einsum('ij,ij->i', difference, difference) ** 0.5
Appending to dist.py
%%bash
python -m timeit --setup='import numpy as np; \
    import dist; \
    ps = np.asarray([[i + 1000 * j \
                      for i in range(1000)] \
                     for j in range(1000)]); \
    qs = np.asarray([[i + 1000 * j + 2 \
                      for i in range(1000)] \
                     for j in range(1000)])' \
    'dist.einsum_dists(ps, qs)'
200 loops, best of 5: 1.72 msec per loop

Many functions in Numpy attempt to leverage multi-core parallelism on your machine. While this is an improvement over not using parallelism at all, it’s not always optimal. In many cases, the parallel implementation doesn’t lead to any noticeable speedup, and some functions aren’t parallelized at all.