From 9bcbcb24cc3820e39d1fa46d119c9cad03426586 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jan=20Jan=C3=9Fen?= Date: Thu, 4 Jun 2026 08:55:15 +0200 Subject: [PATCH 1/2] Add FAQ section explaining cores, threads_per_core and max_workers Users frequently confuse the cores, threads_per_core and max_workers/max_cores parameters since they all control compute resources but on different levels. Add a dedicated FAQ section to the trouble shooting docs clarifying that max_workers/max_cores is the executor-wide core budget, while cores (per call, primarily for mpi4py) and threads_per_core (per call, OpenMP) describe how each function call spends a slice of that budget. Co-Authored-By: Claude Opus 4.8 --- docs/trouble_shooting.md | 41 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/docs/trouble_shooting.md b/docs/trouble_shooting.md index e71487c9f..15e10659f 100644 --- a/docs/trouble_shooting.md +++ b/docs/trouble_shooting.md @@ -55,6 +55,47 @@ Executorlib supports all current Python version ranging from 3.9 to 3.13. Still the [flux](http://flux-framework.org) job scheduler are currently limited to Python 3.12 and below. Consequently for high performance computing installations Python 3.12 is the recommended Python verion. +## Cores, Threads per Core and Maximum Workers +A common point of confusion is the difference between the `cores`, `threads_per_core` and `max_workers` (or `max_cores`) +parameters, as they all control how many compute resources executorlib uses, but on different levels: + +* `max_workers` / `max_cores` are arguments of the `Executor` itself. They define the *total* number of compute cores + the executor is allowed to use in parallel across all submitted function calls - essentially the size of the resource + pool that all tasks share. `max_workers` exists for backwards compatibility with the + [Executor interface](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor) of the + Python standard library, while `max_cores` is the recommended way to express the same limit, as it makes clear that the + limit refers to the number of compute cores. Setting either is optional - when neither is provided executorlib uses the + number of cores available on the machine. +* `cores` is an entry of the `resource_dict` and is defined *per function call*. It sets how many MPI ranks a single + function is executed with. This parameter should primarily be used together with + [mpi4py](https://mpi4py.readthedocs.io), because requesting more than one core only speeds up a function if the + function itself distributes its work over the requested ranks via MPI. Submitting a plain serial Python function with + `cores > 1` does not make it run faster - it simply reserves additional cores that stay idle. +* `threads_per_core` is also an entry of the `resource_dict` and defined *per function call*. It sets how many OpenMP + threads each core is allowed to use, which executorlib exposes through environment variables like `OMP_NUM_THREADS`. + Use this for functions that rely on thread based parallelism in linked libraries (e.g. NumPy/BLAS) rather than MPI. + +In other words, `max_workers` / `max_cores` is the budget for the whole executor, whereas `cores` and `threads_per_core` +describe how each individual function call spends a slice of that budget. The total resources used by a single task is +`cores * threads_per_core`, and executorlib schedules tasks so that the sum of their requested cores never exceeds +`max_cores`. As an example, an executor created with `max_cores=4` can run two MPI parallel functions submitted with +`resource_dict={"cores": 2}` at the same time: + +```python +from executorlib import SingleNodeExecutor + +def calc_mpi(i): + from mpi4py import MPI + + size = MPI.COMM_WORLD.Get_size() + rank = MPI.COMM_WORLD.Get_rank() + return i, size, rank + +with SingleNodeExecutor(max_cores=4) as exe: + futures = [exe.submit(calc_mpi, i, resource_dict={"cores": 2}) for i in range(4)] + print([f.result() for f in futures]) +``` + ## Resource Dictionary The resource dictionary parameter `resource_dict` can contain one or more of the following options: * `cores` (int): number of MPI cores to be used for each function call From 121d38ea8d67e7e7d30afe8901ede025a216cbf7 Mon Sep 17 00:00:00 2001 From: Jan Janssen Date: Thu, 4 Jun 2026 10:38:58 +0200 Subject: [PATCH 2/2] Update trouble_shooting.md --- docs/trouble_shooting.md | 60 +++++++++++++++++++--------------------- 1 file changed, 28 insertions(+), 32 deletions(-) diff --git a/docs/trouble_shooting.md b/docs/trouble_shooting.md index 15e10659f..dc321d671 100644 --- a/docs/trouble_shooting.md +++ b/docs/trouble_shooting.md @@ -52,8 +52,8 @@ The `coverage combine` command merges the data from the main process and subproc ## Python Version Executorlib supports all current Python version ranging from 3.9 to 3.13. Still some of the dependencies and especially -the [flux](http://flux-framework.org) job scheduler are currently limited to Python 3.12 and below. Consequently for high -performance computing installations Python 3.12 is the recommended Python verion. +the [flux](http://flux-framework.org) job scheduler are currently limited to Python 3.13 and below. Consequently for high +performance computing installations Python 3.13 is the recommended Python verion. ## Cores, Threads per Core and Maximum Workers A common point of confusion is the difference between the `cores`, `threads_per_core` and `max_workers` (or `max_cores`) @@ -61,40 +61,36 @@ parameters, as they all control how many compute resources executorlib uses, but * `max_workers` / `max_cores` are arguments of the `Executor` itself. They define the *total* number of compute cores the executor is allowed to use in parallel across all submitted function calls - essentially the size of the resource - pool that all tasks share. `max_workers` exists for backwards compatibility with the + pool or allocation that all tasks share. `max_workers` exists for backwards compatibility with the [Executor interface](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor) of the Python standard library, while `max_cores` is the recommended way to express the same limit, as it makes clear that the limit refers to the number of compute cores. Setting either is optional - when neither is provided executorlib uses the number of cores available on the machine. -* `cores` is an entry of the `resource_dict` and is defined *per function call*. It sets how many MPI ranks a single - function is executed with. This parameter should primarily be used together with - [mpi4py](https://mpi4py.readthedocs.io), because requesting more than one core only speeds up a function if the - function itself distributes its work over the requested ranks via MPI. Submitting a plain serial Python function with - `cores > 1` does not make it run faster - it simply reserves additional cores that stay idle. -* `threads_per_core` is also an entry of the `resource_dict` and defined *per function call*. It sets how many OpenMP - threads each core is allowed to use, which executorlib exposes through environment variables like `OMP_NUM_THREADS`. - Use this for functions that rely on thread based parallelism in linked libraries (e.g. NumPy/BLAS) rather than MPI. - -In other words, `max_workers` / `max_cores` is the budget for the whole executor, whereas `cores` and `threads_per_core` -describe how each individual function call spends a slice of that budget. The total resources used by a single task is -`cores * threads_per_core`, and executorlib schedules tasks so that the sum of their requested cores never exceeds -`max_cores`. As an example, an executor created with `max_cores=4` can run two MPI parallel functions submitted with -`resource_dict={"cores": 2}` at the same time: - -```python -from executorlib import SingleNodeExecutor - -def calc_mpi(i): - from mpi4py import MPI - - size = MPI.COMM_WORLD.Get_size() - rank = MPI.COMM_WORLD.Get_rank() - return i, size, rank - -with SingleNodeExecutor(max_cores=4) as exe: - futures = [exe.submit(calc_mpi, i, resource_dict={"cores": 2}) for i in range(4)] - print([f.result() for f in futures]) -``` +* `cores` is an entry of the `resource_dict` and is defined *per function call*. It specifies how many Python processes + executorlib starts for a single task. These processes are connected via + [mpi4py](https://mpi4py.readthedocs.io) and together form one MPI application. Consequently, `cores` is primarily + intended for functions implemented with [mpi4py](https://mpi4py.readthedocs.io), where the same Python function is + executed once per MPI rank. For a typical serial Python function, increasing `cores` does **not** provide additional + parallelism. Instead, executorlib launches multiple copies of the function, which usually wastes resources and can + lead to incorrect behavior. Unless you are using MPI through [mpi4py](https://mpi4py.readthedocs.io), `cores` + should generally be left at its default value of `1`. +* `threads_per_core` is also an entry of the `resource_dict` and defined *per function call*. In contrast to `cores`, + executorlib starts only a single Python process for the task and reserves the requested resources for that process. + The number of reserved cores is communicated through environment variables such as `OMP_NUM_THREADS`. This parameter + should be used whenever the Python function itself is executed only once, but internally uses multiple cores. Common + examples include thread-parallel libraries such as NumPy, BLAS, MKL or OpenMP-enabled code, as well as Python + functions which launch external applications. In the latter case, executorlib starts a single Python process, which + then launches the external application. Whether that external application internally uses OpenMP, MPI or a hybrid + MPI/OpenMP parallelization strategy is transparent to executorlib. This functionality is demonstrated in the Quantum + ESPRESSO application example. + +A useful rule of thumb is: + +* Use `cores` when executorlib should start multiple Python processes which together form an MPI application via + `mpi4py`. +* Use `threads_per_core` when executorlib should start the Python function only once and reserve multiple cores for it + or for an external application launched by it. +* Use `max_cores` to limit how many resources all submitted tasks may consume collectively. ## Resource Dictionary The resource dictionary parameter `resource_dict` can contain one or more of the following options: