You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importnumba# don't have to explicitly import Cython.# can use pyximport library (provided by Cython) & register an import hook.# enables us to import Cython files (.pyx) directly.importpyximport# still useful to explicitly import Cython:importcythonimportnumpyasnpimportmatplotlib.pyplotasplt
Numba
Don't have to alter Python code to see speedup.
Converts pure Python to LLVM
use @numba.jit decorator function - enables JIT compilation into optimized code
very useful when Python code dominated by loops that can't be vectorized.
# loops in Python = notoriously slow due to dynamic typing & overhead# let's benchmark simple algorithm.defpy_sum(data):
s=0fordindata:
s+=dreturns
data=np.random.randn(50000)
%timeitpy_sum(data)
4.36 ms ± 36.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# verify py_sum returns same result as NumPy sum function:# if no error, we're good.assertabs(py_sum(data) -np.sum(data)) <1e-10
# benchmark NumPy sum function the same way%timeitnp.sum(data)
22.2 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# another example - accumulative sums:defpy_cumsum(data):
out=np.zeros_like(data)
s=0forninrange(len(data)):
s+=data[n]
out[n] =sreturnout%timeitpy_cumsum(data)
10 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeitnp.cumsum(data)
139 µs ± 349 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# activate JIT compilation:# 1) verify same results# 2) run benchmark@numba.jitdefjit_sum(data):
s=0fordindata:
s+=dreturns# jit_sum should be ~300X faster than NumPy equivalent.assertabs(jit_sum(data) -np.sum(data)) <1e-10%timeitjit_sum(data)
45.8 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# can also apply JIT decorator after the fact:jit_cumsum=numba.jit()(py_cumsum)
assertnp.allclose(np.cumsum(data), jit_cumsum(data))
%timeitjit_cumsum(data)
59.1 µs ± 163 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# example: Julia fractal# three nested loops - in Python, VERY slow.defpy_julia_fractal(z_re, z_im, j):
forminrange(len(z_re)):
forninrange(len(z_im)):
z=z_re[m] +1j*z_im[n]
fortinrange(256):
z=z**2-0.05+0.68jifnp.abs(z) >2.0:
j[m,n] =tbreak# JIT optimization options:# default: graceful fallback to python if not possible to optimize# nopython=true: JIT compilation fails if Numba can't optimize.# can use 'locals' keyword to define variable types in the function body.jit_julia_fractal=numba.jit(nopython=True)(py_julia_fractal)
# note how NumPy arrays are defined outside the function.# this helps NumPy know the datatypes involved - aids code efficiency for JIT compilation.N=1024j=np.zeros((N,N), np.int64)
z_real=np.linspace(-1.5, 1.5, N)
z_imag=np.linspace(-1.5, 1.5, N)
jit_julia_fractal(z_real, z_imag, j) # result stored in j
# %timeit py_julia_fractal(z_real, z_imag, j)# interrrupted via keyboard due to execution length
%timeitjit_julia_fractal(z_real, z_imag, j)
245 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Numba.vectorize
Generates a vectorized function from a kernel written for scalar inputs/outputs.
Example: Heavyside step function
defpy_Heavyside(x):
ifx==0.0:
return0.5ifx<0.0:
return0.0else:
return1.0# only works with scalars. need to iterate to work with arrays:x=np.linspace(-2,2,50001)
%timeit [py_Heavyside(xx) forxxinx]
13 ms ± 89.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# not great. let's use vectorize.np_vec_Heavyside=np.vectorize(py_Heavyside)
np_vec_Heavyside(x)
array([0., 0., 0., ..., 1., 1., 1.])
%timeitnp_vec_Heavyside(x)
7.47 ms ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# helps, but not much. next: Numpy array expression.defnp_Heavyside(x):
return (x>0.0) + (x==0.0)/2.0%timeitnp_Heavyside(x)
186 µs ± 572 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# best: Numba, vectorize decorator for two signatures.@numba.vectorize( [numba.float32(numba.float32),numba.float64(numba.float64)])defjit_Heavyside(x):
ifx==0.0:
return0.5ifx<0:
return0.0else:
return1.0%timeitjit_Heavyside(x)
27.8 µs ± 41.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Cython
Cython methodology completely different from Numba.
Extends Python - provides explicit & static data type declarations.
Enables efficient converstion to C/C++ code for compilation into Python extension mods.
Use cases: code speedup, and building library wrappers.
Uses ahead-of-time (not JIT) compilation
Requires Cython-specific compilation pipeline (cython cmnd line tool)
!headcy_sum.pyx
# cy_sum.pyx
def cy_sum(data):
s = 0.0
for d in data:
s += d
return s
!headsetup.py
# usage: $python setup.py build_ext --inplace
# instructs distutils to build extension module in same directory as source.
from distutils.core import setup
from Cython.Build import cythonize
import numpy as np
setup(ext_modules=cythonize('cy_sum.pyx'),
include_dirs=[np.get_include()],
requires=['Cython','numpy'])
3.73 ms ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeitpy_sum(data)
4.29 ms ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Speedup ~25%. Nice, but worth the trouble?
Let's try another approach.
!headcy_cumsum.pyx
# cy_cumsum.pyx
cimport numpy
import numpy
def cy_cumsum(data):
out = numpy.zeros_like(data)
s = 0
for n in range(len(data)):
s += data[n]
out[n] =s
Explicit compilation (Cython to Python, above) is useful for distributing prebuilt Cython modules (no need for Cython installation to use extension module).
Alternative: invoke Cython compilation during import (via pyximport library).
Add type declarations wherever possible, via Cython's cdef keyword.
%%cython
cimport numpy
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
defcy_sum(numpy.ndarray[numpy.float64_t, ndim=1] data):
cdef numpy.float64_t s =0.0
cdef int n, N =len(data)
for n inrange(N):
s += data[n]
return s
%timeitcy_sum(data)
45.8 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeitpy_sum(data)
8.83 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%cython
cimport numpy
import numpy
cimport cython
ctypedef numpy.float64_t FTYPE_t
@cython.boundscheck(False)
@cython.wraparound(False)
defcy_cumsum(numpy.ndarray[FTYPE_t, ndim=1] data):
cdef int n, N = data.size
cdef numpy.ndarray[FTYPE_t, ndim=1] out = numpy.zeros(N, dtype=data.dtype)
cdef numpy.float64_t s =0.0for n inrange(N):
s += data[n]
out[n] = s
return out
%timeitcy_cumsum(data)
84.7 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeitpy_cumsum(data)
18.9 ms ± 38.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now we've got significant speedup over previous cumsum(data) iteration.
Also signficantly faster than NumPy's builtin cumsum() function.
Adding explicit data type declarations adds performance, but loses flexibility. The pure Python implementation accepts lists & arrays, integers & floating-point.
Cython has a workaround via its ctypedef fused keyword.
%%cython
cimport numpy
cimport cython
#ctypedef numpy.float64_t FTYPE_t
ctypedef fused I_OR_F_t:
numpy.int64_t
numpy.float64_t
@cython.boundscheck(False)
@cython.wraparound(False)
defcy_fused_sum(numpy.ndarray[I_OR_F_t, ndim=1] data):
cdef I_OR_F_t s =0
cdef int n, N =len(data)
for n inrange(N):
s += data[n]
return s
# now we should have code that runs on both floating-point & integer data:cy_fused_sum(np.array([1.0, 2.0, 3.0, 4.0, 5.0]))
15.0
cy_fused_sum(np.array([1, 2, 3, 4, 5]))
15
Final example, using Julia fractal algorithm
Again using data type declarations & array range disabling decorators
Now using Cython inline keyword to define a function.
cdef keyword
cpdef:
%%cython
cimport numpy
cimport cython
cdef inline double abs2(doublecomplex z):
return z.real * z.real + z.imag * z.imag
@cython.boundscheck(False)
@cython.wraparound(False)
defcy_julia_fractal(numpy.ndarray[numpy.float64_t, ndim=1] z_re,
numpy.ndarray[numpy.float64_t, ndim=1] z_im,
numpy.ndarray[numpy.int64_t, ndim=2] j):
cdef int m, n, t, M = z_re.size, N = z_im.size
cdef doublecomplex z
for m inrange(M):
for n inrange(N):
z = z_re[m] +1.0j* z_im[n]
for t inrange(256):
z = z**2-0.05+0.68jif abs2(z) >4.0:
j[m,n] = t
break