Cython

Although Python is a fantastic programming language, its performance was its major drawback. The elegant language with straightforward syntax was not made for faster computing.

Cython has been bridging this gap for many years by converting Python code into compiled C programs. A range of Scientific computing packages relies on Cython to speed up computation.

Cython is a programming language that makes writing C extensions for the Python language as easy as Python itself. It aims to become a superset of the Python language which gives it high-level, object-oriented, functional, and dynamic programming.

Its main feature on top of these is support for optional static type declarations as part of the language. The source code gets translated into optimized C/C++ code and compiled as Python extension modules. This allows for both very fast program execution and tight integration with external C libraries, while keeping up the high programmer productivity for which the Python language is well known.Cython is a programming language that makes writing C extensions for the Python language as easy as Python itself.

Alt Text

Following the diagram, we will start by creating a .pyfile containing our python code (in this case, factorial_python.py, which computes the factorial of a given number), then we will create a .pyxfile containing our Cython code (in this case,factorial_cython.pyx).

The %load_ext Cython magic command in Jupyter notebooks loads the Cython extension, allowing you to use Cython-specific syntax directly in cells. This enables you to write C-like code in Python for performance optimization and compile it within the notebook, helping with tasks like speeding up numerical computations.

%load_ext Cython 
import cython 

This function factorial_python(x) computes the factorial of a given number x using a for loop. It initializes fact to 1 and multiplies it by each integer from x down to 1 ((x-i)), returning the computed factorial.

%%writefile factorial_python.py

def factorial_python(x):
    fact = 1
    
    for i in range(x):
        fact *= (x-i)
    return fact
Writing factorial_python.py
%%writefile factorial_cython.pyx
# cython: language_level=3

def factorial_cython(x):
    fact = 1
    
    for i in range(x):
        fact *= (x-i)
    return fact
Writing factorial_cython.pyx

This code is the same as the Python version but written in Cython syntax. The cython: language_level=3 directive ensures compatibility with Python 3. To use this Cython code, we’ll need to compile it using Cython in a setup file or a Jupyter notebook with %load_ext Cython and %%cython magic.

Cython will be in charge of generating the a .cfile . Finally, Cython will compile the .c file into a .so file (or .pyd on Windows).

!cython -a factorial_cython.pyx 
%%writefile factorial_cython_var_typed.pyx

def factorial_cython_var_typed(int x):
    cdef int i
    fact = 1
    
    for i in range(x):
        fact *= (x-i)
    return fact
Writing factorial_cython_var_typed.pyx
%%writefile factorial_cython_call_typed.pyx

cpdef unsigned int factorial_cython_call_typed(unsigned int x):
    cdef unsigned int fact = 1
    cdef unsigned int i

    for i in range(x):
        fact *= (x-i)
    return fact
Writing factorial_cython_call_typed.pyx
!cython -a factorial_cython_call_typed.pyx
/apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages/Cython/Compiler/Main.py:381: FutureWarning: Cython directive 'language_level' not set, using '3str' for now (Py3). This has changed from earlier releases! File: /home/gth/it4i_course/D2_01_Cython/factorial_cython_call_typed.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
%%writefile setup_factorial_cython.py

from Cython.Build import cythonize
from setuptools import setup

setup(
    name=["factorial_cython",
          "factorial_cython_var_typed",
          "factorial_cython_call_typed"],
    ext_modules=cythonize(["factorial_cython.pyx",
                           "factorial_cython_var_typed.pyx",
                           "factorial_cython_call_typed.pyx"],
                          compiler_directives={"language_level":"3"}))
Writing setup_factorial_cython.py

Use the cythonize method to tell Cython which files to be translated and compiled. For example, in the code snippet above, we are telling Cython to compile fib.pyx . After we have the setup.py configured, simply execute the following command in the terminal.

!python3 setup_factorial_cython.py build_ext --inplace
Compiling factorial_cython_var_typed.pyx because it changed.
[1/1] Cythonizing factorial_cython_var_typed.pyx
running build_ext
building 'factorial_cython' extension
creating build/temp.linux-x86_64-cpython-311
gcc -O3 -fPIC -march=znver2 -fPIC -I/apps/all/Python/3.11.5-GCCcore-13.2.0/include/python3.11 -c factorial_cython.c -o build/temp.linux-x86_64-cpython-311/factorial_cython.o
creating build/lib.linux-x86_64-cpython-311
gcc -shared -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -O3 -fPIC -march=znver2 build/temp.linux-x86_64-cpython-311/factorial_cython.o -L/apps/all/Python/3.11.5-GCCcore-13.2.0/lib -o build/lib.linux-x86_64-cpython-311/factorial_cython.cpython-311-x86_64-linux-gnu.so
building 'factorial_cython_var_typed' extension
gcc -O3 -fPIC -march=znver2 -fPIC -I/apps/all/Python/3.11.5-GCCcore-13.2.0/include/python3.11 -c factorial_cython_var_typed.c -o build/temp.linux-x86_64-cpython-311/factorial_cython_var_typed.o
gcc -shared -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -O3 -fPIC -march=znver2 build/temp.linux-x86_64-cpython-311/factorial_cython_var_typed.o -L/apps/all/Python/3.11.5-GCCcore-13.2.0/lib -o build/lib.linux-x86_64-cpython-311/factorial_cython_var_typed.cpython-311-x86_64-linux-gnu.so
building 'factorial_cython_call_typed' extension
gcc -O3 -fPIC -march=znver2 -fPIC -I/apps/all/Python/3.11.5-GCCcore-13.2.0/include/python3.11 -c factorial_cython_call_typed.c -o build/temp.linux-x86_64-cpython-311/factorial_cython_call_typed.o
gcc -shared -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -O3 -fPIC -march=znver2 build/temp.linux-x86_64-cpython-311/factorial_cython_call_typed.o -L/apps/all/Python/3.11.5-GCCcore-13.2.0/lib -o build/lib.linux-x86_64-cpython-311/factorial_cython_call_typed.cpython-311-x86_64-linux-gnu.so
copying build/lib.linux-x86_64-cpython-311/factorial_cython.cpython-311-x86_64-linux-gnu.so -> 
copying build/lib.linux-x86_64-cpython-311/factorial_cython_var_typed.cpython-311-x86_64-linux-gnu.so -> 
copying build/lib.linux-x86_64-cpython-311/factorial_cython_call_typed.cpython-311-x86_64-linux-gnu.so -> 
from timeit import timeit

t1 = timeit("factorial_python(100)",
            setup = "from factorial_python import factorial_python",
            number = 10_000,
           )
t2 = timeit("factorial_cython(100)",
            setup = "from factorial_cython import factorial_cython",
            number = 10_000,
           )
t3 = timeit("factorial_cython_var_typed(100)",
            setup = "from factorial_cython_var_typed import factorial_cython_var_typed",
            number = 10_000,
           )
t4 = timeit("factorial_cython_call_typed(100)",
            setup = "from factorial_cython_call_typed import factorial_cython_call_typed",
            number = 10_000,
           )
print(f"Pure Python: {t1:.4f}")
print(f"Cythonized Python: {t2:.4f}")
print(f"Cython var typed: {t3:.4f}")
print(f"Cython call typed: {t4:.4f}")
print(f"Cythonized Python is {(t1 / t2):.4f}x faster than pure Python")
print(f"Cython var typed is {(t1 / t3):.4f}x faster than pure Python")
print(f"Cython call typed is {(t1 / t4):.4f}x faster than pure Python")
Pure Python: 0.1033
Cythonized Python: 0.1030
Cython var typed: 0.0800
Cython call typed: 0.0009
Cythonized Python is 1.0028x faster than pure Python
Cython var typed is 1.2900x faster than pure Python
Cython call typed is 121.0200x faster than pure Python

The performance comparison indicates the following speedups between pure Python and different Cython optimizations:

  • Cythonized Python: Slightly faster than pure Python (1.0028x speedup), showing minimal improvement, possibly because the Python code was already fairly efficient.

  • Cython with variable typing: 1.29x faster than pure Python, showing that adding type declarations in Cython can lead to noticeable performance improvements.

  • Cython with function call typing: 121.02x faster than pure Python, indicating a massive speedup. This suggests that by explicitly typing function calls in Cython, the performance gain is substantial, likely due to reduced overhead and more efficient handling of operations.

%%writefile factorial_cython_openmp.pyx
from cython.parallel import prange

cpdef unsigned int factorial_cython_openmp(unsigned int x):
    cdef unsigned int fact = 1
    cdef unsigned int i

    with nogil:
        for i in prange(x, schedule="guided"):
            fact *= (x-i)
    return fact
Writing factorial_cython_openmp.pyx

This Cython code, written in factorial_cython_openmp.pyx, computes the factorial using OpenMP for parallel execution. It utilizes the prange() function to parallelize the loop, allowing multiple threads to calculate parts of the factorial concurrently. This optimization can significantly speed up factorial calculations for large numbers by leveraging parallel processing.

%%writefile setup_factorial_openmp.py

from Cython.Build import cythonize
from setuptools import Extension, setup

ext_modules = [
    Extension(
        "factorial_cython_openmp",
        ["factorial_cython_openmp.pyx"],
        extra_compile_args=['-fopenmp'],
        extra_link_args=['-fopenmp'],
    )
]

setup(
    ext_modules=cythonize(ext_modules,
                          compiler_directives={"language_level":"3"}),
)
Writing setup_factorial_openmp.py

This %%bash command configures the environment to use GCC 14 with OpenMP support for compiling Cython code. It sets the compiler paths, enables OpenMP in the compilation (CFLAGS), and linking (LDFLAGS) stages, allowing multi-threaded execution for faster performance.

%%bash

export PATH=/opt/homebrew/bin:$PATH
alias gcc=gcc-14
alias g++=g++-14

export CC=$(which gcc-14)
export CXX=$(which g++-14)
export CFLAGS="-fopenmp"
export LDFLAGS="-fopenmp"
/usr/bin/which: no gcc-14 in (/opt/homebrew/bin:/apps/all/JupyterNotebook/7.2.0-GCCcore-13.2.0/bin:/apps/all/IPython/8.17.2-GCCcore-13.2.0/bin:/apps/all/libxslt/1.1.38-GCCcore-13.2.0/bin:/apps/all/JupyterLab/4.2.0-GCCcore-13.2.0/bin:/apps/all/jupyter-server/2.14.0-GCCcore-13.2.0/bin:/apps/all/ZeroMQ/4.3.5-GCCcore-13.2.0/bin:/apps/all/util-linux/2.39-GCCcore-13.2.0/sbin:/apps/all/util-linux/2.39-GCCcore-13.2.0/bin:/apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/bin:/apps/all/virtualenv/20.24.6-GCCcore-13.2.0/bin:/apps/all/Python/3.11.5-GCCcore-13.2.0/bin:/apps/all/SQLite/3.43.1-GCCcore-13.2.0/bin:/apps/all/Tcl/8.6.13-GCCcore-13.2.0/bin:/apps/all/ncurses/6.4-GCCcore-13.2.0/bin:/apps/all/bzip2/1.0.8-GCCcore-13.2.0/bin:/apps/all/Julia/1.10.0-linux-x86_64/bin:/apps/all/OpenMPI/4.1.6-GCC-13.2.0/bin:/apps/all/UCC/1.2.0-GCCcore-13.2.0/bin:/apps/all/PMIx/4.2.6-GCCcore-13.2.0/bin:/apps/all/libfabric/1.19.0-GCCcore-13.2.0/bin:/apps/all/UCX/1.15.0-GCCcore-13.2.0/bin:/apps/all/libevent/2.1.12-GCCcore-13.2.0/bin:/apps/all/hwloc/2.9.2-GCCcore-13.2.0/sbin:/apps/all/hwloc/2.9.2-GCCcore-13.2.0/bin:/apps/all/libxml2/2.11.5-GCCcore-13.2.0/bin:/apps/all/XZ/5.4.4-GCCcore-13.2.0/bin:/apps/all/numactl/2.0.16-GCCcore-13.2.0/bin:/apps/all/binutils/2.40-GCCcore-13.2.0/bin:/apps/all/GCCcore/13.2.0/bin:/apps/all/XALT/3.0.2/bin:/home/gth/it4i_course:/home/gth/.local/bin:/mnt/proj3/fta-25-9/myenv/bin/vllm:/mnt/proj3/fta-25-9/myenv/bin:/apps/all/Anaconda3/2024.02-1/condabin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/c3/bin:/sbin:/bin:/opt/slurm/bin:/home/gth/bin)
/usr/bin/which: no g++-14 in (/opt/homebrew/bin:/apps/all/JupyterNotebook/7.2.0-GCCcore-13.2.0/bin:/apps/all/IPython/8.17.2-GCCcore-13.2.0/bin:/apps/all/libxslt/1.1.38-GCCcore-13.2.0/bin:/apps/all/JupyterLab/4.2.0-GCCcore-13.2.0/bin:/apps/all/jupyter-server/2.14.0-GCCcore-13.2.0/bin:/apps/all/ZeroMQ/4.3.5-GCCcore-13.2.0/bin:/apps/all/util-linux/2.39-GCCcore-13.2.0/sbin:/apps/all/util-linux/2.39-GCCcore-13.2.0/bin:/apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/bin:/apps/all/virtualenv/20.24.6-GCCcore-13.2.0/bin:/apps/all/Python/3.11.5-GCCcore-13.2.0/bin:/apps/all/SQLite/3.43.1-GCCcore-13.2.0/bin:/apps/all/Tcl/8.6.13-GCCcore-13.2.0/bin:/apps/all/ncurses/6.4-GCCcore-13.2.0/bin:/apps/all/bzip2/1.0.8-GCCcore-13.2.0/bin:/apps/all/Julia/1.10.0-linux-x86_64/bin:/apps/all/OpenMPI/4.1.6-GCC-13.2.0/bin:/apps/all/UCC/1.2.0-GCCcore-13.2.0/bin:/apps/all/PMIx/4.2.6-GCCcore-13.2.0/bin:/apps/all/libfabric/1.19.0-GCCcore-13.2.0/bin:/apps/all/UCX/1.15.0-GCCcore-13.2.0/bin:/apps/all/libevent/2.1.12-GCCcore-13.2.0/bin:/apps/all/hwloc/2.9.2-GCCcore-13.2.0/sbin:/apps/all/hwloc/2.9.2-GCCcore-13.2.0/bin:/apps/all/libxml2/2.11.5-GCCcore-13.2.0/bin:/apps/all/XZ/5.4.4-GCCcore-13.2.0/bin:/apps/all/numactl/2.0.16-GCCcore-13.2.0/bin:/apps/all/binutils/2.40-GCCcore-13.2.0/bin:/apps/all/GCCcore/13.2.0/bin:/apps/all/XALT/3.0.2/bin:/home/gth/it4i_course:/home/gth/.local/bin:/mnt/proj3/fta-25-9/myenv/bin/vllm:/mnt/proj3/fta-25-9/myenv/bin:/apps/all/Anaconda3/2024.02-1/condabin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/c3/bin:/sbin:/bin:/opt/slurm/bin:/home/gth/bin)
%%bash
python3 setup_factorial_openmp.py build_ext --inplace
Compiling factorial_cython_openmp.pyx because it changed.
[1/1] Cythonizing factorial_cython_openmp.pyx
running build_ext
building 'factorial_cython_openmp' extension
gcc -O3 -fPIC -march=znver2 -fPIC -I/apps/all/Python/3.11.5-GCCcore-13.2.0/include/python3.11 -c factorial_cython_openmp.c -o build/temp.linux-x86_64-cpython-311/factorial_cython_openmp.o -fopenmp
gcc -shared -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -O3 -fPIC -march=znver2 build/temp.linux-x86_64-cpython-311/factorial_cython_openmp.o -L/apps/all/Python/3.11.5-GCCcore-13.2.0/lib -o build/lib.linux-x86_64-cpython-311/factorial_cython_openmp.cpython-311-x86_64-linux-gnu.so -fopenmp
copying build/lib.linux-x86_64-cpython-311/factorial_cython_openmp.cpython-311-x86_64-linux-gnu.so -> 
from timeit import timeit

t1 = timeit("factorial_python(100)",
            setup = "from factorial_python import factorial_python",
            number = 10_000,
           )
t2 = timeit("factorial_cython_openmp(100)",
            setup = "from factorial_cython_openmp import factorial_cython_openmp",
            number = 10_000,
           )

print(f"Pure Python: {t1:.4f}")
print(f"Cython with OpenMP: {t2:.4f}")

print(f"\nCython with OpenMP is {(t1 / t2):.4f}x faster than pure Python")
Pure Python: 0.1048
Cython with OpenMP: 0.7541

Cython with OpenMP is 0.1390x faster than pure Python

The code runs slower with OpenMP because factorial is hard to parallelize, and the overhead of using threads is too high for such a small task.

Cython is a superset of Python Cython is a superset of Python, with additional functionality for defining C types and calling C functions

Cython generates C wrapper code, which is compiled into a Python extension module

Major advantage: enables incremental code optimization

type annotations are used to declare C variables

import cython as C

i: C.int
j: C.int
f: C.float
float_array: C.float[42]
float_ptr = C.pointer(C.float)

Cython function definitions

There are three kinds of Cython function definitions: def, cdef and cpdef:

Cython optimises based on type definitions

If no type is specified for a variable, parameter or return type, it defaults to a Python object

The standard Python for-loop is used in Cython:

i: C.int
n: C.int
for i in range(n):
    ...

If i is declared as an integer (with i: C.int), this will be optimized into a standard C loop.

A Cython example

Approximate the integral of a general function f(x)

!pip install matplotlib
Defaulting to user installation because normal site-packages is not writeable
Collecting matplotlib
  Downloading matplotlib-3.10.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.57.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (102 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Downloading kiwisolver-1.4.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.2 kB)
Requirement already satisfied: numpy>=1.23 in /home/gth/.local/lib/python3.11/site-packages (from matplotlib) (2.2.4)
Requirement already satisfied: packaging>=20.0 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in /home/gth/.local/lib/python3.11/site-packages (from matplotlib) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from matplotlib) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in /apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Downloading matplotlib-3.10.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.6/8.6 MB 70.3 MB/s eta 0:00:00
?25hDownloading contourpy-1.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (326 kB)
Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)
Downloading fonttools-4.57.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.9/4.9 MB 78.3 MB/s eta 0:00:00
?25hDownloading kiwisolver-1.4.8-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 54.3 MB/s eta 0:00:00
?25hInstalling collected packages: kiwisolver, fonttools, cycler, contourpy, matplotlib
%matplotlib inline    
import matplotlib.pyplot as plt
import numpy as np
def f(x):
    return np.sin(x)
x = np.linspace(-2*np.pi, 2*np.pi, 100)
y = np.sin(x)
plt.plot(x,y);
plt.bar(x, y, width=0.1, align='center', alpha=0.3, color='orange', edgecolor='black');
../../_images/b3231ae6bfcdc82421f36b6733de0f92eb1463784ec5ac870f628c8073f34541.png

Numerical integration: accuracy increases with number of intervals

Speed is not a problem in 1D, but may be critical in 3D

%%writefile integral_cal.py
from math import sin


def f(x):
    return sin(x**2)


def integrate_f(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx

N = 8_000_000
Overwriting integral_cal.py

This Python script numerically approximates the definite integral of sin(x²) between limits a and b using the rectangle method (Riemann sum). It defines a function f(x) = sin(x²) and implements integration through integrate_f(a, b, N), which divides the interval into N rectangles of width (b-a)/N, sums their heights, and returns the total area. The script is configured with N=8,000,000 for high precision when integrating from 0 to 2, which approximates the non-elementary Fresnel integral important in wave optics. While mathematically sound, this pure Python implementation uses explicit loops without vectorization, making it computationally intensive - precisely why it serves as an excellent candidate for optimization with Cython, where adding type annotations and C-level compilation could dramatically improve its performance while maintaining the same mathematical accuracy.

import integral_cal
N = 8_000_000
tr = %timeit -o integral_cal.integrate_f(0, 2, N)
2.88 s ± 36.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Integration takes around 2 second with N=8_000_000.

The pure Python implementation of the numerical integration takes 2.88 seconds ± 36.8 ms to compute ∫₀² sin(x²) dx using 8 million rectangles, highlighting Python’s interpreter overhead when executing computationally intensive loops. This baseline measurement reveals significant optimization potential, as Cython can eliminate dynamic type checking, reduce function call overhead, and compile the math operations to native code—typically achieving 10–100x speedups for such numerical workloads. The current performance is constrained by Python’s runtime interpretation of each operation, whereas a Cython-optimized version could leverage static typing and C-level compilation to dramatically accelerate the calculation, potentially reducing runtime to milliseconds while maintaining identical mathematical accuracy. This stark contrast demonstrates why Cython is indispensable for optimizing numerical Python code.

Compiling with setuptools

Basic Cython Version (Compilation with setuptools)

The “Basic Cython Version (Compilation with setuptools)” approach involves writing Cython code in a .pyx file and compiling it using Python’s setuptools. This method allows developers to boost performance by converting Python-like syntax into optimized C code, making it especially useful for computationally intensive tasks.

%%writefile integrate_v1.pyx
from math import sin

def f(x):
    return sin(x**2)

def integrate_f(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx
Writing integrate_v1.pyx

This script demonstrates a basic use of Cython to speed up numerical code by writing a .pyx file for compilation. It highlights how Cython can improve performance in loop-heavy functions, offering a faster alternative to pure Python for numerical tasks.

%%writefile setup.py
from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize("integrate_cython.pyx"),
)
Overwriting setup.py

This setup.py script uses setuptools and Cython to compile the integrate_cython.pyx file. It configures the build process, allowing Cython code to be converted into a C extension module for improved performance.

now compile the module with:

!python setup.py build_ext --inplace
Compiling integrate_cython.pyx because it changed.
[1/1] Cythonizing integrate_cython.pyx
/apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages/Cython/Compiler/Main.py:381: FutureWarning: Cython directive 'language_level' not set, using '3str' for now (Py3). This has changed from earlier releases! File: /home/gth/it4i_course/D2_01_Cython/integrate_cython.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
running build_ext
building 'integrate_cython' extension
gcc -O3 -fPIC -march=znver2 -fPIC -I/apps/all/Python/3.11.5-GCCcore-13.2.0/include/python3.11 -c integrate_cython.c -o build/temp.linux-x86_64-cpython-311/integrate_cython.o
gcc -shared -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -O3 -fPIC -march=znver2 build/temp.linux-x86_64-cpython-311/integrate_cython.o -L/apps/all/Python/3.11.5-GCCcore-13.2.0/lib -o build/lib.linux-x86_64-cpython-311/integrate_cython.cpython-311-x86_64-linux-gnu.so
copying build/lib.linux-x86_64-cpython-311/integrate_cython.cpython-311-x86_64-linux-gnu.so -> 

Running !python setup.py build_ext --inplace compiles the Cython .pyx file into a shared object (.so) file directly in the current directory. This command uses the setup configuration to build the extension module in place, making it immediately importable like a regular Python module—essential for testing and benchmarking Cython code efficiently.

import integrate_v1
N = 8_000_000
%timeit integrate_v1.integrate_f(0, 2, N)
2.27 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This benchmark demonstrates a clear performance improvement when using Cython with setuptools for compilation. The Cython version of the integration function completes in approximately 2.27 seconds, compared to 2.88 seconds for the pure Python version—a reduction of over 20% in execution time. This gain is achieved without changing the core algorithm, simply by compiling the code through a setup.py script using setuptools and cythonize. It highlights how straightforward it is to boost performance in computational code by leveraging Cython’s ability to generate optimized C extensions from Python-like syntax.

Adding Types Version

The “Adding Types Version” of the script introduces static type annotations to the Cython code to further enhance performance and code clarity. By explicitly specifying data types for function parameters and return values, Cython can generate more optimized C code, allowing for even faster execution. In addition to improving performance, type annotations make the code easier to understand and maintain by providing clear expectations for variable types. This version typically involves defining types for variables like integers and floating-point numbers, which allows Cython to handle them more efficiently during compilation, resulting in reduced overhead and faster execution times for numerically intensive tasks.

%%writefile integrate_v2.pyx
# Version with added type declarations
from math import sin

def f(double x):
    return sin(x**2)

def integrate_f(double a, double b, int N):
    cdef double s = 0
    cdef double dx = (b - a) / N
    cdef int i
    for i in range(N):
        s += f(a + i * dx)
    return s * dx
Writing integrate_v2.pyx

This script, integrate_v2.pyx, introduces type declarations to the Cython code for improved performance. It specifies the types of variables and function parameters using double for floating-point values and int for integers. The cdef keyword is used to declare variables with specific types, ensuring that Cython can optimize the code for faster execution. By explicitly defining types for s, dx, and the loop counter i, Cython can generate more efficient C code, reducing overhead and speeding up numerical operations. This version leverages Cython’s static typing to further enhance the performance of the integration function.

%%writefile setup.py
from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize("integrate_v2.pyx"),
)
Overwriting setup.py

This setup.py script compiles the integrate_v2.pyx file using setuptools and cythonize, generating a C extension for the type-annotated Cython code, enabling faster execution.

!python setup.py build_ext --inplace
Compiling integrate_v2.pyx because it changed.
[1/1] Cythonizing integrate_v2.pyx
/apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages/Cython/Compiler/Main.py:381: FutureWarning: Cython directive 'language_level' not set, using '3str' for now (Py3). This has changed from earlier releases! File: /home/gth/it4i_course/D2_01_Cython/integrate_v2.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
running build_ext
building 'integrate_v2' extension
gcc -O3 -fPIC -march=znver2 -fPIC -I/apps/all/Python/3.11.5-GCCcore-13.2.0/include/python3.11 -c integrate_v2.c -o build/temp.linux-x86_64-cpython-311/integrate_v2.o
gcc -shared -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -O3 -fPIC -march=znver2 build/temp.linux-x86_64-cpython-311/integrate_v2.o -L/apps/all/Python/3.11.5-GCCcore-13.2.0/lib -o build/lib.linux-x86_64-cpython-311/integrate_v2.cpython-311-x86_64-linux-gnu.so
copying build/lib.linux-x86_64-cpython-311/integrate_v2.cpython-311-x86_64-linux-gnu.so -> 
import integrate_v2
N = 8_000_000
%timeit integrate_v2.integrate_f(0, 2, N)
1.12 s ± 6.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With the added type declarations in integrate_v2.pyx, the function runs in 1.12 seconds on average, significantly faster than both the pure Python version (2.88 seconds) and the basic Cython version (2.27 seconds). This demonstrates the performance benefits of using type annotations in Cython, optimizing the function for even quicker execution.

### Fully Typed Version with C Math Library

The Fully Typed Version with C Math Library takes the Cython script to the next level by incorporating explicit type declarations for all variables and function parameters while also using the C math library for faster mathematical operations. By specifying types such as double for floating-point numbers and int for integers, Cython can generate highly optimized C code, resulting in a significant performance boost. Additionally, leveraging the C math library allows direct access to optimized C functions like sin, further accelerating computations. This version maximizes the efficiency of the numerical integration process by minimizing Python overhead and fully exploiting C-level optimizations.

%%writefile integrate_v3.pyx

cimport cython
from libc.math cimport sin

@cython.boundscheck(False)
@cython.wraparound(False)
cdef double f(double x) nogil:
    return sin(x**2)

def integrate_f(double a, double b, int N):
    cdef:
        double s = 0.0
        double dx = (b - a) / N
        int i
        double x
    
    for i in range(N):
        x = a + i * dx
        s += f(x)
    return s * dx
Overwriting integrate_v3.pyx

This integrate_v3.pyx script represents a fully optimized version of the integration function. It incorporates several key performance improvements:

  1. Cython Typing: All variables and function parameters are explicitly typed (e.g., double for floating-point numbers, int for integers), enabling Cython to generate highly efficient C code.

  2. C Math Library: The script uses the C math library (libc.math), importing sin directly for faster mathematical computation.

  3. Optimization Annotations: The @cython.boundscheck(False) and @cython.wraparound(False) decorators disable runtime bounds checking and negative indexing checks, removing Python’s safety overhead for faster execution in critical loops.

These changes combine to deliver a highly optimized and efficient version of the integration function, ideal for performance-critical numerical tasks.

%%writefile setup.py
from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules=cythonize("integrate_v3.pyx"),
)
Overwriting setup.py

This setup.py script compiles the integrate_v3.pyx file using setuptools and cythonize. It generates a C extension for the fully optimized Cython code, ensuring all performance enhancements are applied.

!python setup.py build_ext --inplace
Compiling integrate_v1.pyx because it changed.
[1/1] Cythonizing integrate_v1.pyx
/apps/all/Python-bundle-PyPI/2023.10-GCCcore-13.2.0/lib/python3.11/site-packages/Cython/Compiler/Main.py:381: FutureWarning: Cython directive 'language_level' not set, using '3str' for now (Py3). This has changed from earlier releases! File: /home/gth/it4i_course/D2_01_Cython/integrate_v1.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
running build_ext
building 'integrate_v1' extension
gcc -O3 -fPIC -march=znver2 -fPIC -I/apps/all/Python/3.11.5-GCCcore-13.2.0/include/python3.11 -c integrate_v1.c -o build/temp.linux-x86_64-cpython-311/integrate_v1.o
gcc -shared -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -O3 -fPIC -march=znver2 build/temp.linux-x86_64-cpython-311/integrate_v1.o -L/apps/all/Python/3.11.5-GCCcore-13.2.0/lib -o build/lib.linux-x86_64-cpython-311/integrate_v1.cpython-311-x86_64-linux-gnu.so
copying build/lib.linux-x86_64-cpython-311/integrate_v1.cpython-311-x86_64-linux-gnu.so -> 
copying build/lib.linux-x86_64-cpython-311/integrate_v2.cpython-311-x86_64-linux-gnu.so -> 
copying build/lib.linux-x86_64-cpython-311/integrate_v3.cpython-311-x86_64-linux-gnu.so -> 
import integrate_v3
N = 8_000_000
%timeit integrate_v3.integrate_f(0, 2, N)
131 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

With the fully optimized version in integrate_v3.pyx, the function runs in just 131 milliseconds on average, a dramatic improvement over the previous versions. This showcases the power of advanced Cython optimizations, including type declarations, the C math library, and disabling bounds checking, resulting in a significant reduction in execution time for the numerical integration task.

"""
%%writefile setup.py
from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize([
        "integrate_v1.pyx",
        "integrate_v2.pyx", 
        "integrate_v3.pyx"
    ]),
)
"""
'\n%%writefile setup.py\nfrom setuptools import setup\nfrom Cython.Build import cythonize\n\nsetup(\n    ext_modules = cythonize([\n        "integrate_v1.pyx",\n        "integrate_v2.pyx", \n        "integrate_v3.pyx"\n    ]),\n)\n'

%%cython

The %%cython magic command in Jupyter notebooks is used to compile Cython code directly within a notebook cell. When this command is placed at the beginning of a cell, it allows you to write Cython code, which is then automatically compiled into a C extension. This eliminates the need for creating separate .pyx files and running setup scripts manually. It’s a convenient way to prototype and test Cython code in an interactive environment like Jupyter.

#### Fully Typed Version with %%cython

%%cython
cimport cython
from libc.math cimport sin

@cython.boundscheck(False)
@cython.wraparound(False)
cdef double f(double x) nogil:
    return sin(x**2)

def integrate_f(double a, double b, int N):
    cdef:
        double s = 0.0
        double dx = (b - a) / N
        int i
        double x
    
    for i in range(N):
        x = a + i * dx
        s += f(x)
    return s * dx
N = 8_000_000
%timeit integrate_f(0, 2, N)
131 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

when you use %%cython in a Jupyter notebook, it does not require a separate setup.py file or manual compilation steps. The %%cython magic command automatically handles the compilation of the Cython code directly within the notebook cell. This means you can write and execute Cython code immediately without needing to set up external compilation scripts or run additional commands.

Cython and numpy

Cython works seamlessly with NumPy arrays, providing a powerful way to accelerate numerical operations. By explicitly typing NumPy arrays, Cython can leverage efficient C-level operations for element-wise manipulations, significantly speeding up performance compared to pure Python. With the ability to directly interact with NumPy’s C API, Cython enables low-level access to array data, allowing for faster loops, optimized memory access, and improved overall performance. This integration is particularly beneficial for computationally intensive tasks like matrix operations, numerical solvers, and data analysis, making Cython a valuable tool for accelerating scientific computing workflows that rely on NumPy.

Example: Apply sin to all numbers in an array:

from math import sin

import numpy as np


def apply_sin(a):
    out = np.empty_like(a)

    for i in range(len(a)):
        out[i] = sin(a[i])

    return out
a = np.linspace(0, 10, 1_000_000, dtype=np.double)
tr_sin = %timeit -o apply_sin(a)
322 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This script applies the sine function to a large NumPy array a using a Python loop. The apply_sin function iterates over each element of the array, applying the sin function from the math module and storing the result in a new array out. The performance of the function is benchmarked with %timeit, showing an execution time of approximately 323 milliseconds per loop for processing 1 million elements. While this approach works correctly, it is relatively slow due to the use of Python loops for element-wise operations, which can be significantly optimized using Cython or vectorized NumPy operations for better performance.

Declaring numpy data types

Cython utilizes typed memoryviews to efficiently generate C code for interacting with NumPy arrays, allowing for high-performance, low-level access to array data. By using memoryviews, Cython avoids the overhead of Python’s dynamic typing and provides a direct interface to the underlying memory, enabling faster computation. This approach facilitates element-wise operations and other array manipulations at near C-speed, significantly speeding up numerical tasks. Below is a translation table that maps Python data types to their corresponding Cython types, showcasing how Cython optimizes the handling of NumPy arrays for improved computational efficiency.

from cython import int_types

NumPy vs Cython Data Types

NumPy Data Type

Cython Equivalent

numpy.uint8

cython.cimports.libc.stdint.uint8_t

numpy.int16

cython.cimports.libc.stdint.int16_t

numpy.single

cython.float

numpy.double

cython.double

numpy.complex

cython.complex

  • Defining a new numpy array in Cython:

from cython import double

out: double[:]

out = np.zeros(1000, dtype=np.double)

This code defines out as a Cython typed memoryview of double[:], which represents a one-dimensional array of double values. The array is then initialized with NumPy’s zeros function, creating a 1000-element array of type np.double. This approach combines Cython’s efficient memoryview access with NumPy’s array creation for faster computation.

  • Declaring numpy data types

Below is a fully typed version of the apply_sin function:

%%cython
import numpy as np
cimport cython
from libc.math cimport sin

@cython.boundscheck(False)  # Disable bounds checking
@cython.wraparound(False)   # Disable negative indexing
def apply_sin(double[:] a):
    cdef int i, n = a.shape[0]
    cdef double[:] out = np.empty(n, dtype=np.float64)
    
    for i in range(n):
        out[i] = sin(a[i])
    
    return np.asarray(out)
building '_cython_magic_bbeb82e0ead0c9a51c0b5059f92ed523161d7f8c' extension
gcc -O3 -fPIC -march=znver2 -fPIC -I/home/gth/.local/lib/python3.11/site-packages/numpy/_core/include -I/apps/all/Python/3.11.5-GCCcore-13.2.0/include/python3.11 -c /home/gth/.cache/ipython/cython/_cython_magic_bbeb82e0ead0c9a51c0b5059f92ed523161d7f8c.c -o /home/gth/.cache/ipython/cython/home/gth/.cache/ipython/cython/_cython_magic_bbeb82e0ead0c9a51c0b5059f92ed523161d7f8c.o
gcc -shared -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -L/apps/all/OpenSSL/1.1/lib64 -L/apps/all/OpenSSL/1.1/lib -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib64 -L/apps/all/libffi/3.4.4-GCCcore-13.2.0/lib -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib64 -L/apps/all/XZ/5.4.4-GCCcore-13.2.0/lib -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib64 -L/apps/all/SQLite/3.43.1-GCCcore-13.2.0/lib -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib64 -L/apps/all/ncurses/6.4-GCCcore-13.2.0/lib -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib64 -L/apps/all/libreadline/8.2-GCCcore-13.2.0/lib -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib64 -L/apps/all/zlib/1.2.13-GCCcore-13.2.0/lib -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib64 -L/apps/all/bzip2/1.0.8-GCCcore-13.2.0/lib -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib64 -L/apps/all/binutils/2.40-GCCcore-13.2.0/lib -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib64 -L/apps/all/pkgconf/2.0.3-GCCcore-13.2.0/lib -L/apps/all/GCCcore/13.2.0/lib64 -L/apps/all/GCCcore/13.2.0/lib -O3 -fPIC -march=znver2 /home/gth/.cache/ipython/cython/home/gth/.cache/ipython/cython/_cython_magic_bbeb82e0ead0c9a51c0b5059f92ed523161d7f8c.o -L/apps/all/Python/3.11.5-GCCcore-13.2.0/lib -o /home/gth/.cache/ipython/cython/_cython_magic_bbeb82e0ead0c9a51c0b5059f92ed523161d7f8c.cpython-311-x86_64-linux-gnu.so

This apply.py script defines a Cython function apply_sin that applies the sine function to a NumPy array a using a typed memoryview. The function accepts a one-dimensional array of type double[:] and returns a double[:] array as well. It initializes an empty array out with the same shape as a and iterates over the elements, applying sin from the C math library to each element. The use of Cython’s type annotations and memoryviews ensures efficient handling of the array, providing a significant performance improvement over a pure Python implementation.

  • now using Using the Cython memoryview API

import numpy as np
a = np.random.rand(1_000_000) 
%timeit apply_sin(a)
16.7 ms ± 4.75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The Cython version using typed memoryviews and np.float64 arrays runs in just 16.7 ms, compared to 322 ms for the pure Python version—about 20 times faster. This major speedup is thanks to Cython’s efficient handling of NumPy data with explicit types and low-level optimizations.

Cython summary

Cython pros and cons

[+] Allows incremental optimization, easy to access C libraries, generated C code more compact and readable than swig, active developer community, advanced and flexible

[+] Pure Python syntax (requires Cython 3.0)

[-] Less suitable than e.g. pybind11 for wrapping large libraries to Python modules, fully optimized code not as readable as Python

Should be considered (maybe as a first choice?) for mixing Python with C

Exercice:

time the next script and use cython for better optimization.

def  calculate_z ( maxiter, zs, cs ):
     """Calculate output list using Julia update rule""" 
     output = [ 0 ] * len (zs)
     for i in  range ( len (zs)):
        n = 0
        z = zs[i]
        c = cs[i]
        while n < maxiter and  abs (z) < 2 :
            z = z * z + c
            n += 1
        output[i] = n
     return output

The calculate_z function helps generate a Julia set fractal, which is a complex and beautiful mathematical image formed by repeating a simple rule over complex numbers.