The problem is illustrated by the following script, which works correctly if MKL is used for linear algebra operations:
from numba import njit, prange
from numpy import random, dot, empty
from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()
numba_parallel=True # True False
@njit(parallel=False)
def internal_dot(a, b):
return dot(a, b)
@njit(parallel=numba_parallel)
def total_sum(b, c):
npoints=c.shape[0]
output=empty((npoints, c.shape[1], b.shape[1]))
for i in prange(npoints):
output[i]=internal_dot(c[i], b)
return output
@controller.wrap(limits=1, user_api='blas')
def safe_total_sum(b, c):
return total_sum(b, c)
nvecs=256
dim1=256
dim2=256
vector=random.random((dim1, dim2))
matrix=random.random((nvecs, dim2, dim1))
_ = total_sum(vector, matrix)
_ = safe_total_sum(vector, matrix)
However, using it with OpenBLAS leads to warning OpenBLAS warning: precompiled NUM_THREADS exceeded, adding auxiliary array for thread metadata., indicating an oversubscription problem that in my (much more lengthy) usecase crashes the code. I am aware that setting export OPENBLAS_NUM_THREADS=1 solves the issue for this script, but it is not applicable in my usecase since my code calls other Numpy functions elsewhere and needs them parallelized. Using ThreadpoolController does not seem to help; I am also aware of the possibility to pack everything into a .pkl and unpack it in a subprocess environment with OPENBLAS_NUM_THREADS=1, but it'd really prefer to avoid this dirty trick.
Is there a proper Python solution for this problem?
OMP_MAX_ACTIVE_LEVELSorOMP_NESTEDand possibly others). That being said, Numba can use different parallel backend, not just OpenMP. It might use Intel TBB on a machine with an intel environment setup. IDK if TBB have a similar feature. It would be great to use OpenMP everywhere to avoid mixing parallel library/frameworks/APIs (IDK what the MKL uses -- possibly the Intel TBB -- or if you can force Numba to use it either).omp_set_num_threadsand reset it later. One possible downside is that the threads might be re-created for each parallel section which can be quite expensive for small work but this cannot be avoided with some OpenMP implementation if you change the number of threads dynamically. Note the above function is a C one from the OpenMP runtime. It can be called from Python using ctypes, cffi, or cython.NUMBA_THREADING_LAYER='omp'to force OMP. numba.pydata.org/numba-doc/dev/user/…OMP_MAX_ACTIVE_LEVELSandOMP_NESTEDdid not work, but I realized that settingNUMBA_NUM_THREADSto the number of CPUs does help, though I am not sure why (I didn't even notice previously thatNUMBA_NUM_THREADSwas undefined in the environment because there was no evidence of CPU oversubscription). Once I can test the "problematic" machine again I will additionally check settingNUMBA_THREADING_LAYER, and then post a summary of possible solutions either as an answer to the question or its edit.