benoitc · benoitc · May 1, 2026 · Apr 8, 2026 · May 1, 2026 · May 1, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,23 +1,91 @@
 # Changelog
 
-## 2.4.0 (Unreleased)
+## 3.0.0 (Unreleased)
 
-### Added
+### Breaking Changes
+
+- **Simplified execution model** - Only two public execution modes: `worker` and `owngil`
+  - `worker`: Dedicated pthread per context with stable thread affinity (default)
+  - `owngil`: Dedicated pthread + subinterpreter with own GIL (Python 3.14+)
+  - Removed `multi_executor` and `free_threaded` from public API
+  - Internal capability detection still tracks Python features
+
+- **Removed `py:num_executors/0`** - Contexts now use per-context worker threads
+  instead of a shared executor pool. This function is no longer needed.
+
+- **`py:execution_mode/0` returns `worker | owngil`** - Based on the `context_mode`
+  application configuration. Previously returned internal capabilities like
+  `free_threaded`, `subinterp`, or `multi_executor`.
 
-- **Context thread affinity** - Contexts in MULTI_EXECUTOR mode are now assigned a
-  fixed executor thread at creation. All operations (call, eval, exec) from the same
-  context run on the same OS thread, preventing thread state corruption in libraries
-  like numpy and PyTorch that have thread-local state.
+- **Removed `py:async_stream/3,4`** - Streaming async generators was never
+  implemented behind the API and always returned `{error, stream_not_implemented}`.
+  Use `py:stream_start/3,4` for sync generators; async-generator support may
+  return in a later release.
+
+- **Removed `num_executors` / `num_async_workers` configuration** - Both keys
+  were no-ops after the v3.0 worker rework. Configure context count via
+  `num_contexts` and the rate-limit ceiling via `max_concurrent`.
+
+- **Strict context-mode validation at the NIF boundary** - `py_nif:context_create/1`
+  now returns `{error, {invalid_mode, Atom}}` for anything other than `worker | owngil`.
+  Previously, callers that bypassed `py_context` (notably `py_reactor_context`)
+  silently mapped any unknown atom — including legacy `auto` and `subinterp` —
+  to worker mode. Code that relied on that loophole must pass `worker` (or
+  `owngil`) explicitly.
+
+### Fixed
+
+- **`py:async_call/3,4` + `py:async_await/1,2` round-trip** - Previously the
+  await receive matched `{py_response, _, _}` while the event loop sent
+  `{async_result, _, _}`, causing every async call to silently time out.
+  Async calls now go directly through `py_event_loop:create_task` and
+  `py_event_loop:await`.
+
+- **`py:async_gather/1,2` actually executes** - Reimplemented as concurrent
+  `async_call` submission with sequential `async_await`. Returns
+  `{ok, [Result1, ...]}` on success or `{error, {gather_failed, [{Idx, Reason}, ...]}}`
+  if any call fails. The previous implementation returned `gather_not_implemented`.
 
 ### Changed
 
-- **`py:execution_mode/0` now returns actual mode** - Returns `worker` (default),
-  `owngil`, `free_threaded`, or `multi_executor` based on actual configuration
-  instead of Python capability. Previously returned `subinterp` even when using
-  worker mode.
+- **Per-context worker threads** - Each context now gets its own dedicated pthread
+  that handles all Python operations. This provides stable thread affinity for
+  numpy/torch/tensorflow compatibility without needing a shared executor pool.
+
+- **Async NIF dispatch** - Context operations use async NIFs with message passing
+  instead of blocking dirty schedulers. This improves concurrency under load.
+
+- **Request queue per context** - Replaced single-slot request pattern with proper
+  request queues that support multiple concurrent callers.
+
+- **No global asyncio policy install on Python 3.14+.** `asyncio.set_event_loop_policy`
+  was deprecated in 3.14 and is removed in 3.16. The Erlang integration's run path
+  already uses `loop_factory=` (`erlang.run/1`, `asyncio.Runner`) so the global
+  policy was only a convenience for bare `asyncio.run()` inside `py:exec`. We now
+  skip the install on 3.14+ to avoid the deprecation warning. On 3.14+ use
+  `erlang.run(main)` or `asyncio.Runner(loop_factory=erlang.new_event_loop)`
+  explicitly. Behavior on Python 3.9–3.13 is unchanged. `erlang.install()` raises
+  `RuntimeError` on 3.14+ (still emits a `DeprecationWarning` and works on 3.12–3.13).
+
+### Removed
 
-- **Removed obsolete subinterp test references** - Test suites updated to reflect
-  the removal of subinterpreter mode. Tests now use `worker` or `owngil` modes.
+- Multi-executor pool (`g_executors[]`, `multi_executor_start/stop`)
+- `context_dispatch_call/eval/exec` functions (dead code)
+- References to `PY_MODE_MULTI_EXECUTOR` in context operations
+- `py_async_pool` legacy gen_server (unused after async API rewire)
+- **Explicit `py:subinterp_*` handle API removed.** `py:subinterp_create/0`,
+  `subinterp_destroy/1`, `subinterp_call/4,5`, `subinterp_eval/2,3`,
+  `subinterp_exec/2`, `subinterp_cast/4`, `subinterp_async_call/4`,
+  `subinterp_await/1,2`, and `subinterp_pool_*` are all gone. Use
+  `py_context:new(#{mode => owngil})` instead — it gives the same
+  parallelism with OTP supervision and automatic cleanup.
+  `py:subinterp_supported/0` (capability probe) and `py:parallel/1`
+  (which routes through the context API) stay.
+- Internal `py_execution_mode_t` collapsed from 3 values to 2 (`free_threaded`
+  / `gil`); `py_nif:execution_mode/0` returns `free_threaded | gil` instead
+  of the old `free_threaded | subinterp | multi_executor`.
+- `examples/reactor_owngil_example.erl` deleted (called nonexistent
+  `py:subinterp_reactor_*` functions; pre-existing breakage).
 
 ## 2.3.1 (2026-04-01)
 

diff --git a/README.md b/README.md
@@ -16,10 +16,9 @@ evaluate expressions, and stream from generators - all without blocking Erlang
 schedulers.
 
 **Parallelism options:**
-- **Worker mode** (default, recommended) - Works with any Python version. With free-threaded Python (3.13t+), provides true parallelism automatically
-- **SHARED_GIL sub-interpreters** (Python 3.12+) - Isolated namespaces, shared GIL (isolation improves in 3.14+)
-- **OWN_GIL sub-interpreters** (Python 3.14+) - Each interpreter has its own GIL, true parallelism
-- **BEAM processes** - Fan out work across lightweight Erlang processes
+- **Worker mode** (default, recommended) - Works with any Python version. With free-threaded Python (3.13t+), provides true parallelism automatically.
+- **OWN_GIL sub-interpreters** (Python 3.14+) - Each interpreter has its own GIL, true parallelism.
+- **BEAM processes** - Fan out work across lightweight Erlang processes.
 
 Key features:
 - **Process-bound environments** - Each Erlang process gets isolated Python state, enabling OTP-supervised Python actors
@@ -302,14 +301,11 @@ Ref = py:async_call(aiohttp, get, [<<"https://api.example.com/data">>]),
 {ok, Response} = py:async_await(Ref).
 
 %% Gather multiple async calls concurrently
-{ok, Results} = py:async_gather([
+{ok, [Users, Posts, Comments]} = py:async_gather([
     {aiohttp, get, [<<"https://api.example.com/users">>]},
     {aiohttp, get, [<<"https://api.example.com/posts">>]},
     {aiohttp, get, [<<"https://api.example.com/comments">>]}
 ]).
-
-%% Stream from async generators
-{ok, Chunks} = py:async_stream(mymodule, async_generator, [args]).
 ```
 
 ## Parallel Execution with Sub-interpreters
@@ -328,7 +324,7 @@ True parallelism without GIL contention using Python 3.14+ OWN_GIL sub-interpret
 %% Each call runs in its own interpreter with its own GIL
 ```
 
-For Python 3.12/3.13, use SHARED_GIL sub-interpreters (`mode => subinterp`) for namespace isolation, but note that parallelism is limited by the shared GIL.
+For Python 3.12/3.13 the public modes are `worker` (default) and `owngil` (Python 3.14+ only). Earlier versions run all contexts under the shared main interpreter via dedicated worker threads — namespace isolation between contexts is local-dict based, not via subinterpreters.
 
 ## Parallel Processing with BEAM Processes
 
@@ -590,9 +586,9 @@ ok = py:clear_traces().
 %% sys.config
 [
   {erlang_python, [
-    {num_workers, 4},           %% Python worker pool size
-    {max_concurrent, 17},       %% Max concurrent operations (default: schedulers * 2 + 1)
-    {num_executors, 4}          %% Executor threads (multi-executor mode)
+    {num_contexts, 8},          %% Number of contexts (default: schedulers)
+    {context_mode, worker},     %% worker | owngil
+    {max_concurrent, 17}        %% Max concurrent operations (default: schedulers * 2 + 1)
   ]}
 ].
 ```
@@ -605,40 +601,34 @@ When creating Python contexts, you can choose the execution mode:
 
 | Mode | Python Version | Description |
 |------|----------------|-------------|
-| `worker` | Any | Main interpreter, shared namespace (default, recommended) |
-| `subinterp` | 3.12+ | SHARED_GIL sub-interpreter, isolated namespace |
-| `owngil` | 3.14+ | OWN_GIL sub-interpreter, true parallelism |
+| `worker` | Any | Dedicated pthread per context, main interpreter namespace (default) |
+| `owngil` | 3.14+ | Dedicated pthread + subinterpreter with its own GIL, true parallelism |
 
 ```erlang
 %% Default: worker mode (recommended)
 %% With free-threaded Python (3.13t+), provides true parallelism automatically
 {ok, Ctx} = py_context:new(#{}).
 
-%% Explicit subinterpreter with shared GIL (Python 3.12+)
-%% Provides namespace isolation but no parallelism
-{ok, Ctx} = py_context:new(#{mode => subinterp}).
-
 %% OWN_GIL mode for true parallelism (Python 3.14+ required)
 %% Each context runs in its own pthread with independent GIL
 {ok, Ctx} = py_context:new(#{mode => owngil}).
 ```
 
-**Worker mode is recommended** because it works with any Python version and automatically benefits from free-threaded Python (3.13t+) when available.
+**Worker mode is recommended** because it works with any Python version and automatically benefits from free-threaded Python (3.13t+) when available. Each context owns a dedicated pthread, providing stable thread affinity for libraries with thread-local state (numpy, torch, tensorflow).
 
-**Why OWN_GIL requires Python 3.14+**: Some C extensions (e.g., `_decimal`, `numpy`) have global state bugs in sub-interpreters on Python 3.12/3.13. These are fixed in Python 3.14. SHARED_GIL mode works on 3.12+ but with caveats for C extensions with global state.
+**Why OWN_GIL requires Python 3.14+**: Some C extensions (e.g., `_decimal`, `numpy`) have global state bugs in sub-interpreters on Python 3.12/3.13. These are fixed in Python 3.14.
 
 ### Runtime Detection
 
-Check the current execution mode:
+Check the current execution mode (mirrors the `context_mode` application env):
 ```erlang
-py:execution_mode().  %% => free_threaded | subinterp | multi_executor
+py:execution_mode().  %% => worker | owngil
 ```
 
 | Mode | Python Version | Parallelism |
 |------|----------------|-------------|
-| Free-threaded | 3.13+ (nogil) | True parallel, no GIL |
-| Sub-interpreter | 3.12+ | Per-interpreter GIL |
-| Multi-executor | Any | GIL contention |
+| `worker` (default) | Any | One pthread per context; true parallelism on free-threaded 3.13t+ |
+| `owngil` | 3.14+ | Per-interpreter GIL, true parallelism across contexts |
 
 ## Error Handling
 

diff --git a/c_src/py_convert.c b/c_src/py_convert.c
@@ -95,13 +95,19 @@ static void shared_dict_capsule_destructor(PyObject *capsule) {
  * @return true if obj is a numpy ndarray, false otherwise
  */
 static inline bool is_numpy_ndarray(PyObject *obj) {
-    /* Use cached type for fast isinstance check when available.
-     * The cache is only valid in the main interpreter - subinterpreters
-     * have their own object space, so we fall back to attribute detection. */
-    if (g_numpy_ndarray_type != NULL && g_execution_mode != PY_MODE_SUBINTERP) {
+    /* The cache is populated in the main interpreter. On builds where
+     * subinterpreters can be created (and the runtime isn't free-threaded,
+     * which short-circuits subinterp use) a context may be running inside
+     * a subinterpreter where the cached type is invalid -- fall back to
+     * duck typing in that case. */
+#if defined(HAVE_SUBINTERPRETERS) && !defined(HAVE_FREE_THREADED)
+    /* Build supports subinterpreters and isn't free-threaded:
+     * skip the cached fast path. */
+#else
+    if (g_numpy_ndarray_type != NULL) {
         return PyObject_IsInstance(obj, g_numpy_ndarray_type) == 1;
     }
-
+#endif
     /* Fallback: duck typing via attribute detection.
      * Check for both 'tolist' method and 'ndim' attribute. */
     return PyObject_HasAttrString(obj, "tolist") &&