From 9bcbcb24cc3820e39d1fa46d119c9cad03426586 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Jan=20Jan=C3=9Fen?= <janssen@mpie.de>
Date: Thu, 4 Jun 2026 08:55:15 +0200
Subject: [PATCH 1/2] Add FAQ section explaining cores, threads_per_core and
 max_workers

Users frequently confuse the cores, threads_per_core and
max_workers/max_cores parameters since they all control compute
resources but on different levels. Add a dedicated FAQ section to the
trouble shooting docs clarifying that max_workers/max_cores is the
executor-wide core budget, while cores (per call, primarily for mpi4py)
and threads_per_core (per call, OpenMP) describe how each function call
spends a slice of that budget.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/trouble_shooting.md | 41 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/docs/trouble_shooting.md b/docs/trouble_shooting.md
index e71487c9f..15e10659f 100644
--- a/docs/trouble_shooting.md
+++ b/docs/trouble_shooting.md
@@ -55,6 +55,47 @@ Executorlib supports all current Python version ranging from 3.9 to 3.13. Still
 the [flux](http://flux-framework.org) job scheduler are currently limited to Python 3.12 and below. Consequently for high
 performance computing installations Python 3.12 is the recommended Python verion. 
 
+## Cores, Threads per Core and Maximum Workers
+A common point of confusion is the difference between the `cores`, `threads_per_core` and `max_workers` (or `max_cores`)
+parameters, as they all control how many compute resources executorlib uses, but on different levels:
+
+* `max_workers` / `max_cores` are arguments of the `Executor` itself. They define the *total* number of compute cores
+  the executor is allowed to use in parallel across all submitted function calls - essentially the size of the resource
+  pool that all tasks share. `max_workers` exists for backwards compatibility with the
+  [Executor interface](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor) of the
+  Python standard library, while `max_cores` is the recommended way to express the same limit, as it makes clear that the
+  limit refers to the number of compute cores. Setting either is optional - when neither is provided executorlib uses the
+  number of cores available on the machine.
+* `cores` is an entry of the `resource_dict` and is defined *per function call*. It sets how many MPI ranks a single
+  function is executed with. This parameter should primarily be used together with
+  [mpi4py](https://mpi4py.readthedocs.io), because requesting more than one core only speeds up a function if the
+  function itself distributes its work over the requested ranks via MPI. Submitting a plain serial Python function with
+  `cores > 1` does not make it run faster - it simply reserves additional cores that stay idle.
+* `threads_per_core` is also an entry of the `resource_dict` and defined *per function call*. It sets how many OpenMP
+  threads each core is allowed to use, which executorlib exposes through environment variables like `OMP_NUM_THREADS`.
+  Use this for functions that rely on thread based parallelism in linked libraries (e.g. NumPy/BLAS) rather than MPI.
+
+In other words, `max_workers` / `max_cores` is the budget for the whole executor, whereas `cores` and `threads_per_core`
+describe how each individual function call spends a slice of that budget. The total resources used by a single task is
+`cores * threads_per_core`, and executorlib schedules tasks so that the sum of their requested cores never exceeds
+`max_cores`. As an example, an executor created with `max_cores=4` can run two MPI parallel functions submitted with
+`resource_dict={"cores": 2}` at the same time:
+
+```python
+from executorlib import SingleNodeExecutor
+
+def calc_mpi(i):
+    from mpi4py import MPI
+
+    size = MPI.COMM_WORLD.Get_size()
+    rank = MPI.COMM_WORLD.Get_rank()
+    return i, size, rank
+
+with SingleNodeExecutor(max_cores=4) as exe:
+    futures = [exe.submit(calc_mpi, i, resource_dict={"cores": 2}) for i in range(4)]
+    print([f.result() for f in futures])
+```
+
 ## Resource Dictionary
 The resource dictionary parameter `resource_dict` can contain one or more of the following options: 
 * `cores` (int): number of MPI cores to be used for each function call

From 121d38ea8d67e7e7d30afe8901ede025a216cbf7 Mon Sep 17 00:00:00 2001
From: Jan Janssen <jan-janssen@users.noreply.github.com>
Date: Thu, 4 Jun 2026 10:38:58 +0200
Subject: [PATCH 2/2] Update trouble_shooting.md

---
 docs/trouble_shooting.md | 60 +++++++++++++++++++---------------------
 1 file changed, 28 insertions(+), 32 deletions(-)

diff --git a/docs/trouble_shooting.md b/docs/trouble_shooting.md
index 15e10659f..dc321d671 100644
--- a/docs/trouble_shooting.md
+++ b/docs/trouble_shooting.md
@@ -52,8 +52,8 @@ The `coverage combine` command merges the data from the main process and subproc
 
 ## Python Version 
 Executorlib supports all current Python version ranging from 3.9 to 3.13. Still some of the dependencies and especially 
-the [flux](http://flux-framework.org) job scheduler are currently limited to Python 3.12 and below. Consequently for high
-performance computing installations Python 3.12 is the recommended Python verion. 
+the [flux](http://flux-framework.org) job scheduler are currently limited to Python 3.13 and below. Consequently for high
+performance computing installations Python 3.13 is the recommended Python verion. 
 
 ## Cores, Threads per Core and Maximum Workers
 A common point of confusion is the difference between the `cores`, `threads_per_core` and `max_workers` (or `max_cores`)
@@ -61,40 +61,36 @@ parameters, as they all control how many compute resources executorlib uses, but
 
 * `max_workers` / `max_cores` are arguments of the `Executor` itself. They define the *total* number of compute cores
   the executor is allowed to use in parallel across all submitted function calls - essentially the size of the resource
-  pool that all tasks share. `max_workers` exists for backwards compatibility with the
+  pool or allocation that all tasks share. `max_workers` exists for backwards compatibility with the
   [Executor interface](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor) of the
   Python standard library, while `max_cores` is the recommended way to express the same limit, as it makes clear that the
   limit refers to the number of compute cores. Setting either is optional - when neither is provided executorlib uses the
   number of cores available on the machine.
-* `cores` is an entry of the `resource_dict` and is defined *per function call*. It sets how many MPI ranks a single
-  function is executed with. This parameter should primarily be used together with
-  [mpi4py](https://mpi4py.readthedocs.io), because requesting more than one core only speeds up a function if the
-  function itself distributes its work over the requested ranks via MPI. Submitting a plain serial Python function with
-  `cores > 1` does not make it run faster - it simply reserves additional cores that stay idle.
-* `threads_per_core` is also an entry of the `resource_dict` and defined *per function call*. It sets how many OpenMP
-  threads each core is allowed to use, which executorlib exposes through environment variables like `OMP_NUM_THREADS`.
-  Use this for functions that rely on thread based parallelism in linked libraries (e.g. NumPy/BLAS) rather than MPI.
-
-In other words, `max_workers` / `max_cores` is the budget for the whole executor, whereas `cores` and `threads_per_core`
-describe how each individual function call spends a slice of that budget. The total resources used by a single task is
-`cores * threads_per_core`, and executorlib schedules tasks so that the sum of their requested cores never exceeds
-`max_cores`. As an example, an executor created with `max_cores=4` can run two MPI parallel functions submitted with
-`resource_dict={"cores": 2}` at the same time:
-
-```python
-from executorlib import SingleNodeExecutor
-
-def calc_mpi(i):
-    from mpi4py import MPI
-
-    size = MPI.COMM_WORLD.Get_size()
-    rank = MPI.COMM_WORLD.Get_rank()
-    return i, size, rank
-
-with SingleNodeExecutor(max_cores=4) as exe:
-    futures = [exe.submit(calc_mpi, i, resource_dict={"cores": 2}) for i in range(4)]
-    print([f.result() for f in futures])
-```
+* `cores` is an entry of the `resource_dict` and is defined *per function call*. It specifies how many Python processes
+  executorlib starts for a single task. These processes are connected via
+  [mpi4py](https://mpi4py.readthedocs.io) and together form one MPI application. Consequently, `cores` is primarily
+  intended for functions implemented with [mpi4py](https://mpi4py.readthedocs.io), where the same Python function is
+  executed once per MPI rank. For a typical serial Python function, increasing `cores` does **not** provide additional
+  parallelism. Instead, executorlib launches multiple copies of the function, which usually wastes resources and can
+  lead to incorrect behavior. Unless you are using MPI through [mpi4py](https://mpi4py.readthedocs.io), `cores`
+  should generally be left at its default value of `1`.
+* `threads_per_core` is also an entry of the `resource_dict` and defined *per function call*. In contrast to `cores`,
+  executorlib starts only a single Python process for the task and reserves the requested resources for that process.
+  The number of reserved cores is communicated through environment variables such as `OMP_NUM_THREADS`. This parameter
+  should be used whenever the Python function itself is executed only once, but internally uses multiple cores. Common
+  examples include thread-parallel libraries such as NumPy, BLAS, MKL or OpenMP-enabled code, as well as Python
+  functions which launch external applications. In the latter case, executorlib starts a single Python process, which
+  then launches the external application. Whether that external application internally uses OpenMP, MPI or a hybrid
+  MPI/OpenMP parallelization strategy is transparent to executorlib. This functionality is demonstrated in the Quantum
+  ESPRESSO application example.
+
+A useful rule of thumb is:
+
+* Use `cores` when executorlib should start multiple Python processes which together form an MPI application via
+  `mpi4py`.
+* Use `threads_per_core` when executorlib should start the Python function only once and reserve multiple cores for it
+  or for an external application launched by it.
+* Use `max_cores` to limit how many resources all submitted tasks may consume collectively.
 
 ## Resource Dictionary
 The resource dictionary parameter `resource_dict` can contain one or more of the following options: