[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul by Nucs · Pull Request #611 · SciSharp/NumSharp

Nucs · 2026-04-22T09:19:43Z

Summary

This PR ports NumPy 2.4.2's nditer machinery to NumSharp (NpyIter), builds a composable expression DSL on top (NpyExpr) with a three-tier custom-op API, wires multi-order memory layout (C/F/A/K) through the entire API surface, and replaces the matmul fallback with stride-native GEMM for all 12 dtypes (eliminates a ~100x slowdown on transposed inputs). Also lands a new Char8 1-byte dtype with 100% Python bytes parity, a trainable MNIST MLP example demonstrating fusion, and several pre-existing bug fixes surfaced by battletest.

Stats: +50,426 / -1,188 across 156 files, 64 commits.

TL;DR

NpyIter -- full NumPy nditer port: 32+ APIs, all iteration orders (C/F/A/K with NEGPERM), all indexing modes (MULTI_INDEX / C_INDEX / F_INDEX / RANGE), buffered casting, buffered-reduce double-loop, masking, unlimited operands and dimensions. 566/566 NumPy 2.4.2 parity scenarios pass byte-for-byte.
NpyExpr DSL + 3-tier custom-op API (Tier 3A raw IL / Tier 3B element-wise w/ SIMD / Tier 3C expression composition + Call(...) for arbitrary Func / MethodInfo invocation). 50+ ops, full operator overloads, structural cache key, ~5ns delegate dispatch.
C/F/A/K memory layout wired through np.copy, np.array, np.asarray, np.asanyarray, np.asfortranarray (new), np.ascontiguousarray (new), *_like, astype, flatten, ravel, reshape, eye, concatenate, vstack, hstack, cumsum, argsort -- plus post-hoc F-contig preservation across the ILKernel dispatchers (41 element-wise layout bugs fixed).
Stride-native matmul for all 12 dtypes -- np.dot(x.T, grad) MLP shape: 240 ms -> 1 ms. Removes ~165 lines of dead fallback code.
Char8 -- new 1-byte dtype, NumPy S1 / Python bytes(1) equivalent, full Python bytes API parity (battletested against Python oracle).
Trainable MNIST MLP example -- fused forward + backward + Adam, 99.89% test accuracy, 100 epochs in ~42s.
Bug fixes: NPTypeCode.Char.SizeOf() returned 1 (real=2), IsInf was stubbed to return null, Decimal priority was stale, argsort mishandled non-C-contig input, several NpyExpr IL emission bugs.
Test suite: 6,710 passing on net8.0 + net10.0, zero regressions.

NpyIter -- Full NumPy nditer Port

From-scratch C# port of NumPy 2.4.2's nditer under src/NumSharp.Core/Backends/Iterators/.

Files (new)

File	Lines	Purpose
`NpyIter.cs`	3,331	Core iterator engine
`NpyIter.State.cs`	977	Dynamic per-operand state allocation
`NpyIter.Execution.cs`	657	Iteration execution paths
`NpyIter.Execution.Custom.cs`	155	Tier 3A/3B/3C custom-op entry points
`NpyIterBufferManager.cs`	637	Buffered iteration with casting
`NpyIterFlags.cs`	516	Flag enums (matches NumPy semantics)
`NpyIterCoalescing.cs`	495	Axis reorder + coalesce
`NpyIterCasting.cs`	483	Casting validation + conversion
`NpyIterKernels.cs`	263	Path selection
`NpyAxisIter.cs`	492	Axis-based iterator (cumsum, var, std)
`NpyLogicalReductionKernels.cs`	127	Generic reduction kernel interfaces

Capability Matrix

Capability	Notes
Iteration orders	C, F, A, K (NEGPERM for negative-stride memory order)
Indexing modes	`MULTI_INDEX`, `C_INDEX`, `F_INDEX`, `RANGE` (parallel chunking)
Buffering	Type conversion, full casting rules (`no` / `equiv` / `safe` / `same_kind` / `unsafe`)
Reduction	`op_axes` w/ `-1` reduction axes, `REDUCE_OK`, `IsFirstVisit`, buffered-reduce double-loop including `bufferSize < coreSize`
Multi-operand	Unlimited (NumPy `NPY_MAXARGS=64` parity, dynamic alloc)
Dimensions	Unlimited (NumSharp divergence; replaces `NPY_MAXDIMS=64`)
Masking	`WRITEMASKED` + `ARRAYMASK` w/ reduction safety check
APIs ported	`Copy`, `GotoIndex`, `GotoMultiIndex`, `RemoveAxis`, `RemoveMultiIndex`, `ResetBasePointers`, `GetMultiIndexFunc`, `GetInnerFixedStrideArray`, `GetAxisStrideArray`, `CreateCompatibleStrides`, `DebugPrint`, `GetIterView`, `IterRange`, `Iternext`, `GetValue<T>` / `SetValue<T>`, `Finished`, `Shape`, `OVERLAP_ASSUME_ELEMENTWISE`, `TRANSFERFLAGS`, reduction-axis encoding, more

Bugs found and fixed during port

Negative strides flipped regardless of order -- NumPy only flips when FORCEDORDER is unset (K-order only).
NO_BROADCAST flag not enforced -- was skipping operands instead of validating shape match.
F_INDEX returned C-order indices -- coalescing reduced NDim=1, losing axis structure. Now disables coalescing for C_INDEX / F_INDEX / MULTI_INDEX.
ALLOCATE with null operand threw NRE -- CalculateBroadcastShape accessed op[i].ndim before allocating.
op_axes reductions broken -- ApplyOpAxes was re-applying axes to already-correct strides, zeroing non-reduce strides.
SetupBufferedReduction produced inverted strides for non-reduce operands; only worked for 2-axis cases.
K-order on broadcast / non-contiguous arrays produced wrong order -- fixed by falling back to C-order when stride=0 is present.
Reset() desynced ranged iterators with IterStart > 0 -- now delegates to GotoIterIndex(IterStart).
CoalesceAxes rejected size-1 axes unless stride==0 -- size-1 axes contribute no iteration and should always absorb.
Buffer free-list bug -- Dispose used NativeMemory.Free for AlignedAlloc'd buffers (memory corruption).

NpyExpr DSL + 3-tier Custom-Op API

User-extensible kernel layer built on NpyIter.

Tiers

Tier 3A -- ExecuteRawIL(body, key, aux): emit raw IL against the NumPy ufunc signature void(void** dataptrs, long* byteStrides, long count, void*). Full control.
Tier 3B -- ExecuteElementWise(scalar, vector, ...): per-element IL + 4x-unrolled SIMD shell + 1-vec remainder + scalar tail + scalar-strided fallback. SIMD when all operand dtypes match and are SIMD-capable.
Tier 3C -- ExecuteExpression(expr, inputTypes, outputType): compose NpyExpr trees via operator syntax, Compile() emits IL automatically.

`NpyExpr` Node Catalog

Category	Ops
Binary arithmetic	`Add Subtract Multiply Divide Mod Power FloorDivide ATan2`
Binary bitwise	`BitwiseAnd BitwiseOr BitwiseXor`
Unary arithmetic	`Negate Abs Sign Sqrt Cbrt Square Reciprocal`
Unary exp/log	`Exp Exp2 Expm1 Log Log2 Log10 Log1p`
Unary trig	`Sin Cos Tan Sinh Cosh Tanh ASin ACos ATan Deg2Rad Rad2Deg`
Unary rounding	`Floor Ceil Round Truncate`
Unary predicates	`BitwiseNot LogicalNot IsNaN IsFinite IsInf`
Comparisons	`Equal NotEqual Less LessEqual Greater GreaterEqual` (return 0/1 at output dtype)
Combinators	`Min Max Clamp Where(cond, a, b)`
Operators	`+ - * / % & OR ^ ~ ! unary-`

`Call(...)` escape hatch (commit `8da3e693`)

Invoke any Func<...>, Delegate, or MethodInfo per element -- three dispatch paths chosen at construction:

Condition	Emitted IL
Static method, no target	`call <methodinfo>` (zero-indirection, JIT-inlinable)
Instance `MethodInfo` + target	`ldc.i4 id -> LookupTarget -> castclass T -> callvirt <mi>`
Any other `Delegate`	`ldc.i4 id -> LookupDelegate -> castclass Func<..> -> callvirt Invoke`

Auto-conversion at the call boundary (input/output via EmitConvertTo). Process-wide DelegateSlots registry, ~5ns lookup. Cache key includes MetadataToken + ModuleVersionId to disambiguate dynamic assemblies.

Bugs caught during DSL battletest

IsNaN / IsFinite / IsInf silently wrote I4 0/1 into double slots (denormals instead of 1.0). Fix: UnaryNode inserts trailing EmitConvertTo(Int32, outType).
LogicalNot broken for Int64 / Single / Double / Decimal -- Ldc_I4_0+Ceq only valid for I4-sized operands. Fix: route through EmitComparisonOperation(Equal, outType) with properly-typed zero literal.
WhereNode prelude unfinished (threw at compile). Rewrote.
MinMaxNode did not propagate NaN -- rerouted through Math.Min / Math.Max (matches np.minimum / np.maximum, not fmin / fmax).
Vector256.Round / Truncate are .NET 9+ only -- excluded from SIMD path; scalar path works on net8.

Multi-Order Memory Layout (C/F/A/K)

Shape changes (`src/NumSharp.Core/View/Shape.cs`, +171 lines)

New IsFContiguous (O(1) flag check via ArrayFlags.F_CONTIGUOUS).
New ComputeFContiguousStrides helper.
New Shape(long[] dims, char order) ctor for explicit physical-order construction.
Aligned _UpdateContiguousFlags with NumPy -- single-pass (isC, isF) tuple, fewer call sites.
Fixed Shape.Order -- was hardcoded to 'C', now derives from contiguity flags.
Fixed empty-array semantics -- any dim==0 is both C- and F-contig per NumPy, no BROADCASTED flag.
OrderResolver.cs (new, 75 lines) -- centralizes C/F/A/K -> C/F mapping.

API surface wiring

API	Change
`np.copy`, `NDArray.copy(order)`	New overload, default `'K'` (was `'C'`)
`np.array(Array, ..., order)`	F-order materialization via `copy('F')`
`np.asarray`, `np.asanyarray`	New overloads accepting `Type? + order`
`np.asfortranarray`, `np.ascontiguousarray`	NEW thin wrappers
`np.empty_like` / `zeros_like` / `ones_like` / `full_like`	New `order` overload, default `'K'`
`astype(Type, copy, order)`	New overload, default `'K'`
`flatten(order)`, `ravel(order)`	F-order via `copy('F')` reinterpret
`reshape(Shape, char order)`	New overload; F-order column-major fill
`np.eye(..., order)`	New overload
`np.concatenate`, `vstack`, `hstack`	Preserve F-contig when all inputs F-contig
`np.cumsum(axis)`	Preserve F-contig via post-hoc `copy('F')`
`NDArray.argsort`	Copies non-C-contig input to C-contig first

Post-hoc F-contig preservation across ILKernel dispatch (Group F, 41 bugs fixed)

Refactoring 27 partial files (~21K lines) of IL emitters to accept arbitrary output strides was rejected as too invasive. Instead, central dispatchers (ExecuteBinaryOp, ExecuteUnaryOp, ExecuteComparisonOp) call ShouldProduceFContigOutput(operands, resultShape) after the kernel and relay via cheap .copy('F') when every non-scalar operand is strictly F-contig. Matches NumPy's F+F->F, C+C->C, F+C->C, F*scalar->F, F+FCol->F rules.

np.modf, np.clip, np.negative, np.maximum / minimum updated individually (do not route through the central dispatchers).

TDD coverage

51 sections of OrderSupport.OpenBugs.Tests.cs (3,005 lines), each driven by side-by-side Python / NumPy 2.4.2 output. Remaining [OpenBugs] are minimal API gaps (np.tile, np.flip, np.where, np.sort).

Stride-Native MatMul

np.dot / np.matmul previously fell into a ~100x slower fallback for any non-contiguous operand (transposed view, slice). This PR ships stride-native paths for all 12 dtypes.

New files

SimdMatMul.Strided.cs (338 lines) -- generalized 8x16 Vector256 FMA micro-kernel for float. New packers PackAPanelsStrided / PackBPanelsStrided with fast paths for transposed-contig and row-contig.
SimdMatMul.Double.cs (108 lines) -- stride-aware IKJ Vector256 kernel (4 FMAs).
Default.MatMul.Strided.cs (357 lines) -- MatMulStridedSame<T> where T : INumber<T> (JIT specializes per type with auto-vectorization), plus MatMulStridedBool (NumPy's OR-of-ANDs short-circuit), MatMulStridedMixed<TResult> (typed pointer reads via ReadAsDouble, no boxing).

Dead code removed (~165 lines)

MatMulGeneric, MatMulCore<TResult>, MatMulSameType<T>, four MatMulContiguous overloads (float/double/int/long), MatMulMixedType<TResult>.

Performance (MLP backward shapes)

Op	Before	After
`dot(x.T, grad)` 784x64 @ 64x128	240 ms	1 ms
`dot(grad, W.T)` 64x128 @ 128x784	226 ms	1 ms
Lt(400,500) @ L(500,400) blocked	12 ms	8 ms (skips copy)
int32 (150,200) @ (200,150) stride-native	--	10 ms (was 11 ms copy+matmul)

The MLP example's .copy() workaround on transposed views is now removed -- kernel absorbs strides directly.

Test coverage

MatMulStridedTests (28 tests): all 4 BLAS transpose cases (NN/NT/TN/TT) x float/double x simple/blocked, per-dtype stride-native (byte/int16/uint16/int32/uint32/int64/uint64/char/decimal/bool), sliced views with Shape.offset > 0, mixed-type, MLP-shape regression tests.

Char8 Dtype

New NumSharp.Char8 -- [StructLayout(Sequential, Size=1)] readonly struct, NumPy dtype('S1') / Python bytes(1) equivalent.

Files (new, ~1,450 lines)

File	Lines	Purpose
`Char8.cs`	725	Core: Latin1CharInfo, classification, case, operators, IConvertible, IFormattable
`Char8.Operators.cs`	169	Mixed-type ops (char/byte/int), Span reinterpret
`Char8.Conversions.cs`	261	Instance/factory To* / From* for all 12 dtypes, saturating, truncating
`Char8.Spans.cs`	201	Search / equality / UTF-8 classification on `ReadOnlySpan<Char8>`
`Char8.PyBytes.cs`	531	Python `bytes` array methods (Strip/Split/Replace/Center/...)
`Converts.Char8.cs`	317	`Converts` integration parallel to `Converts.Native.cs`

Adapted from .NET System.Char (src/dotnet/, fetched into a reference library; Latin1CharInfo[256] table copied verbatim).

Python parity (caught by oracle diff)

3 parity bugs surfaced and fixed during 250-line Python bytes oracle comparison:

Count with empty pattern -- Python returns len(s) + 1, was 0.
Center asymmetric padding -- CPython formula left = pad/2 + (pad & width & 1).
SplitLines too permissive -- bytes.splitlines() only recognizes \n / \r / \r\n (not \v / \f / \x1c-1e).

Status

Standalone for now -- not yet wired into NPTypeCode enum (would touch ~50 switch statements across DefaultEngine / ILKernelGenerator / NpyIter / casting / Converts; deferred to a separate PR).

MNIST MLP Example

examples/NeuralNetwork.NumSharp/MnistMlp/ -- runnable end-to-end classifier.

Architecture: 784 -> 128 (ReLU) -> 10, float32, He-init, Adam.
Forward fusion: bias + ReLU collapses into one NpyIter per layer (NpyExpr.Max(Input(0) + Input(1), 0)).
Backward fusion: gradOut * (y > 0) ReLU mask fused.
Loss: SoftmaxCrossEntropy (combined, max-subtracted, numerically stable).

Results (6000 train / 1000 test, batch 128, Adam lr=1e-3):

Phase	Total	Final test acc
Pre-stride-native dot	100.7 s (5 epochs)	100%
Post-`copy()` workaround	3.2 s (5 epochs)	100%
Final 100-epoch demo (post stride-native)	~42 s	99.89%

NN scaffolding fixes outside MnistMlp/: completed every stubbed/broken class -- Softmax (was empty Forward + sigmoid-derivative Backward), Sigmoid.Forward (empty), CategoricalCrossentropy (no clipping, wrong backward formula), BinaryCrossEntropy (did not divide by N), Accuracy / BinaryAccuacy (returned scalar/null), FullyConnected (no bias, skewed init), NeuralNet.Train (used 2-index integer selection where slicing was intended), Adam (ms / vs init was commented out), added SGD optimizer. Each verified against analytical references with finite-difference grad checks (29/29 pass).

Bug Fixes

NPTypeCode.Char.SizeOf() returned 1, real is 2 (UTF-16). Affected NpyIter.SetOpDType (ElementSizes[op] x stride in 8 places), 8 cast sites, np.frombuffer, np.dtype(char).itemsize, axis reductions. Survived without test failures because NumPy has no native char dtype and ASCII reads accidentally land on the right byte.
GetPriority(Decimal) = 5*10*32 was stale after the prior Decimal SizeOf fix -- corrected to 5*10*16=800. No behavioral change (relative ordering preserved).
DefaultEngine.IsInf was stubbed to return null (NRE on any IsInf call). Now wired through ExecuteUnaryOp with the existing IL kernel.
NDArray.Copy.cs share-by-reference bug -- new Shape(this.Shape.dimensions, 'F') aliased the source int[]; cloned now.
NDArray.argsort -- copies non-C-contig input to C-contig first (matches NumPy's invariant that argsort always returns C-contig).
flatten allocation regression -- F-order path was double-allocating (copy('F') + Array.Clone()). Fixed: reinterpret directly.

Behavioral Changes

Area	Change	Migration
`np.copy` default order	`'C'` -> `'K'`	No change for C-contig input (K preserves layout)
`MaxOperands=8` removed	Now unlimited (dynamic alloc)	Drop-in
`MaxDims=64` removed	Now unlimited (~300K dims, stackalloc-bound)	Drop-in
F-order iteration	Now produces `[0,3,1,4,2,5]` for 2x3 C-contig (was `[0,1,2,3,4,5]`)	Matches NumPy
K-order on broadcast / non-contig	Falls back to C-order (was stride-sort, broken with `stride=0`)	Matches NumPy
Negative strides	Only flipped for K-order	Matches NumPy `FORCEDORDER` rule
Empty arrays	`IsContiguous` and `IsFContiguous` both `true` (was both `false`)	Matches NumPy
`Shape.Order`	Now derives from contiguity flags (transpose of C reports `'F'`)	Was hardcoded to `'C'`

Documentation

docs/website-src/docs/NDIter.md (1,934 lines) -- comprehensive NpyIter reference: 7-technique quick reference, decision tree, full Tier C node catalog with NumPy-equivalent column, type discipline, SIMD coverage rules, caching/auto-keys, validation, gotchas, debugging, memory model + lifetime, 19 worked examples (Swish, GELU, Heaviside, Horner polynomial, fused sigmoid, NaN replacement, etc.).
docs/website-src/docs/ndarray.md (537 lines) -- NDArray reference page.
docs/NPYITER_AUDIT.md, NPYITER_DEEP_AUDIT.md, NPYITER_NUMPY_DIFFERENCES.md, NPYITER_BUFFERED_REDUCE_ANALYSIS.md -- implementation audit reports.
Tier renamed A/B/C -> 3A/3B/3C to make the layer-3 sub-tier relationship explicit (100 references across 6 files).

Test Plan

Full suite: 6,710 tests pass on net8.0 + net10.0 with CI filter TestCategory!=OpenBugs&TestCategory!=HighMemory. Zero regressions.
NumPy 2.4.2 nditer parity: 566/566 scenarios pass byte-for-byte (491 random fuzz seed=42 + 75 structured). Element sequences, stride arrays, multi-indices, reduction outputs all match.
NpyExpr + custom-op tests: +264 tests (NpyIterCustomOpTests, NpyIterCustomOpEdgeCaseTests, NpyExprExtensiveTests, NpyExprCallTests).
nditer API parity tests: +94 across 10 new files (NpyIterAxisStrideArrayTests, NpyIterCreateCompatibleStridesTests, NpyIterDebugPrintTests, NpyIterGetMultiIndexFuncTests, NpyIterInnerFixedStrideArrayTests, NpyIterOverlapAssumeElementwiseTests, NpyIterReductionAxisEncodingTests, NpyIterResetBasePointersTests, NpyIterTransferFlagsTests, NpyIterWriteMaskedTests).
MatMul stride-native: +28 MatMulStridedTests covering all 4 BLAS transpose cases, per-dtype stride-native, sliced views, mixed-type, MLP shapes.
Char8: +69 cases (source-generated discovery), 250-line Python bytes oracle diff (identical), 270+ C# edge assertions.
Order support TDD: +150 tests across 51 sections in OrderSupport.OpenBugs.Tests.cs.
Shape order: +24 tests in Shape.Order.Tests.cs.
MLP example: 100-epoch run reaches 99.89% test accuracy; per-epoch breakdown: forward 20.3% / loss 1.5% / backward 52.5% / optimizer 25.8%; bit-exact fused-vs-naive correctness (max |diff| = 0); kernel cache delta = 6 (compiled once, cache-hit thereafter); delegate-slot count = 0.

Known Issues / Out of Scope

Char8 not wired into NPTypeCode -- would touch ~50 switch statements; separate PR.
np.tile, np.flip, np.where, np.sort still missing (4 [OpenBugs] markers remain after this PR).
One pre-existing SetIndicesND assertion on multi-row fancy-write with scalar/matching-array RHS -- investigation in commit 47a74aa9 showed it is a pre-existing indexing bug, not F-order specific. Reproductions added under [OpenBugs].

Migration / Compatibility

Most changes are additive. The behavioral changes table above lists the user-visible deltas -- all align NumSharp closer to NumPy 2.4.2. No deprecated APIs, no removed public surface. The MaxOperands=8 and MaxDims=64 constants are removed but were internal.

Implements fixes detailed in docs/NPYITER_FIXES_REQUIRED.md to improve NumPy compatibility of the NpyIter implementation. Fix #1: Coalescing Always Runs - Changed NpyIterRef.Initialize() to always coalesce axes after construction unless MULTI_INDEX flag is set - Matches NumPy's nditer_constr.c line 395-396 behavior Fix #2: Inner Stride Cache - Added InnerStrides[MaxOperands] array to NpyIterState - Added UpdateInnerStrides() method to gather inner strides - GetInnerStrideArray() now returns contiguous array matching NumPy's NpyIter_GetInnerStrideArray() format Fix #3: op_axes Parameter Implementation - Added ApplyOpAxes() method to support axis remapping - Supports -1 entries for broadcast/reduction axes - Enables reduction operations via custom axis mapping Fix #4: Multi-Index Support - Added GetMultiIndex(Span<long>) for coordinate retrieval - Added GotoMultiIndex(ReadOnlySpan<long>) for coordinate jumping - Added HasMultiIndex property - HASMULTIINDEX flag tracked during construction Fix #5: Ranged Iteration - Added ResetToIterIndexRange(start, end) for parallel chunking - Added IterStart, IterEnd, and IsRanged properties - RANGE flag tracks ranged iteration mode Fix #6: Buffer Copy Type Dispatch - Added non-generic CopyToBuffer/CopyFromBuffer overloads - Runtime dtype dispatch for all 12 NumSharp types - Enables dtype-agnostic iteration code Fix #7: Flag Bit Positions Documented - Added documentation explaining NumSharp's flag bit layout - Legacy compatibility flags use bits 0-7 - NumPy-equivalent flags use bits 8-15 - Semantic meaning matches NumPy, positions differ Fix #8: MaxDims Increased to 64 - Changed MaxDims from 32 to 64 to match NPY_MAXDIMS - Supports high-dimensional array iteration Test coverage: - 13 new tests for coalescing, multi-index, ranged iteration, inner strides, and MaxDims validation - All 5666 non-OpenBugs tests pass Note: Full axis reordering before coalescing (for complete 1D coalescing of contiguous arrays) not yet implemented. Current implementation coalesces adjacent compatible axes only.

BREAKING: Replaces NumPy's fixed NPY_MAXDIMS=64 limit with unlimited dimension support via dynamic array allocation. NumSharp Divergence Rationale: - NumSharp's Shape uses regular managed arrays (int[] dimensions, int[] strides) - Practical limit is ~300,000 dimensions (stackalloc limit) - This matches NumSharp's core design philosophy of unlimited dimensions - Memory scales with actual dimensions, not worst-case fixed allocation Implementation: - Removed MaxDims constant (was 64) - Added StridesNDim field to track stride array allocation size - Dimension-dependent arrays (Shape, Coords, Perm, Strides) are now dynamically allocated pointers instead of fixed arrays - Added AllocateDimArrays(ndim, nop) for allocation - Added FreeDimArrays() for cleanup - All arrays allocated in single contiguous block for cache efficiency Per-operand arrays still use fixed MaxOperands=8 limit (reasonable for multi-operand operations). Memory Management: - NpyIterRef.Dispose() calls FreeDimArrays() - Static NpyIter methods use try/finally for cleanup - Exception handling properly frees arrays on construction failure Updated files: - NpyIter.State.cs: Dynamic allocation with detailed comments - NpyIter.cs: Call allocation in Initialize(), free in Dispose() - NpyIterCoalescing.cs: Use StridesNDim instead of MaxDims - NpyIterBufferManager.cs: Use StridesNDim for stride indexing - NpyIterKernels.cs: Use StridesNDim in path selection Tests: - Removed MaxDims_Is64 test - Added UnlimitedDimensions_HighDimensionalArray (20D array test) - Added UnlimitedDimensions_MaxOperands (verifies MaxOperands=8) - All 5667 tests pass

NpyAxisIter Implementation: - NpyAxisIter.cs: Axis-based iterator for cumsum, cumprod, var, std - NpyAxisIter.State.cs: State struct for axis iteration - NpyLogicalReductionKernels.cs: Generic numeric reduction kernel interfaces Logical Reduction Refactoring: - Default.LogicalReduction.cs: Unified logical reduction implementation - Default.All.cs/Default.Any.cs: Simplified to use new infrastructure - np.all.cs/np.any.cs: Cleaned up API layer Cumulative Operation Fixes: - Default.Reduction.CumAdd.cs: Added contiguity checks for IL kernel path - Default.Reduction.CumMul.cs: Added contiguity checks for IL kernel path - Falls back to NpyAxisIter for sliced/reversed/broadcast views Variance/Std Fixes: - Default.Reduction.Std.cs: Updated reduction implementation - Default.Reduction.Var.cs: Updated reduction implementation Copy Kernel Infrastructure: - CopyKernel.cs: Copy kernel key and execution path definitions - ILKernelGenerator.Copy.cs: IL-generated copy kernels - np.copyto.cs: Updated to use new copy infrastructure Utilities: - InfoOf.cs: Added GetSize helper for dtype size lookup - MultiIterator.cs: Minor updates - UnmanagedStorage.Cloning.cs: Minor updates Documentation: - docs/NPYITER_FIXES_REQUIRED.md: NpyIter implementation requirements - docs/NPYITER_PARITY_ANALYSIS.md: NumPy parity analysis - docs/DEFAULTENGINE_ILKERNEL_PLAYBOOK.md: IL kernel guidelines - docs/DEFAULTENGINE_ILKERNEL_RULEBOOK.md: IL kernel rules - docs/plans/NDITER.md: NDIter implementation plan Tests: - NpyIterBattleTests.cs: Iterator battle tests - NpyIterReductionBattleTests.cs: Reduction battle tests - NpyIterScanBattleTests.cs: Scan operation battle tests - np.logical_reduction.iterator.tests.cs: Logical reduction tests - np.copyto.NpyIter.Test.cs: Copy operation tests - Updated np.all.Test.cs, np.any.Test.cs

- NpyIterRefTests.cs: Fix ref struct lambda capture issue - np.logical_reduction.iterator.tests.cs: Add [TestClass], replace [Test] with [TestMethod] All 5742 tests pass (excluding OpenBugs category)

…D collapse NumPy reorders axes by stride magnitude BEFORE coalescing, which allows contiguous arrays to fully coalesce to ndim=1. This commit implements the same behavior in NumSharp. Problem: For a C-contiguous (2,3,4) array with strides [12,4,1], the coalescing formula stride[i]*shape[i]==stride[i+1] fails without reordering: - (0,1): 12*2=24 != 4 → cannot coalesce Solution: Added ReorderAxesForCoalescing() that sorts axes by minimum stride: - After reorder: shapes [4,3,2], strides [1,4,12] - (0,1): 1*4=4 == 4 ✓ → coalesce to [12,2], strides [1,12] - (0,1): 1*12=12 == 12 ✓ → coalesce to [24], strides [1] Changes: - NpyIterCoalescing.cs: Added ReorderAxesForCoalescing(state, order) - Uses insertion sort (stable, good for nearly-sorted data) - Respects NPY_ORDER parameter (ascending for C-order, descending for F-order) - Marked old ReorderAxes() as obsolete - NpyIter.cs: Initialize() now calls ReorderAxesForCoalescing() before CoalesceAxes() when multi-index is not tracked - NpyIterRefTests.cs: Updated tests to expect ndim=1 for contiguous arrays Test results: 5742 passed, 11 skipped, 0 failed

Add missing NumPy nditer features to NpyIterRef: - RemoveMultiIndex(): Enable coalescing after construction by calling ReorderAxesForCoalescing + CoalesceAxes, matching NumPy behavior - Finished property: Check if iteration is complete - Shape property: Get current iterator shape after coalescing - IterRange property: Get (Start, End) tuple - Iternext(): Advance and return bool for remaining elements - GetValue<T>/SetValue<T>: Type-safe value access at current position - GetDataPtr(): Raw pointer access to current operand data Fix RemoveAxis() to recalculate IterSize after dimension removal. Add 45 comprehensive NumPy parity tests derived from actual NumPy 2.4.2 output, covering: - Coalescing behavior (contiguous, transposed, sliced, scalar, empty) - C_INDEX and F_INDEX tracking (2D and 3D arrays) - RemoveMultiIndex and RemoveAxis - Finished property and Iternext pattern - Shape property changes after coalescing - Ranged iteration - Value access (GetValue, SetValue) - Multi-operand iteration - Sliced and broadcast arrays - Edge cases (empty, scalar) Document known divergences: - F-order with MULTI_INDEX: skips axis reordering to preserve indices - K-order on F-contiguous with MULTI_INDEX: same limitation Create docs/NPYITER_NUMPY_DIFFERENCES.md with complete analysis of NumPy nditer vs NumSharp NpyIter implementation differences. Test results: 196 NpyIter tests passing, 5796 total tests passing

…INDEX Add complete NumPy parity for iteration order when MULTI_INDEX is set: F-order (NPY_FORTRANORDER): - First axis changes fastest (column-major iteration) - Reverses axis order so original axis 0 is innermost - Uses Perm array to map internal coords to original axis order K-order (NPY_KEEPORDER): - Follows memory layout (smallest stride innermost) - Sorts axes by stride, largest first when MULTI_INDEX is set - Enables memory-order iteration on transposed/F-contiguous arrays Key implementation changes: - Initialize Perm to identity [0,1,2,...] in AllocateDimArrays - Add forCoalescing parameter to ReorderAxesForCoalescing: - true (no MULTI_INDEX): ascending sort for coalescing formula - false (with MULTI_INDEX): descending sort for iteration order - GetMultiIndex: Apply inverse permutation (outCoords[Perm[d]] = Coords[d]) - GotoMultiIndex: Apply permutation (Coords[d] = coords[Perm[d]]) - Shape property: Return shape in original axis order when MULTI_INDEX set Test results: - F-order: values 0,3,1,4,2,5 on (2,3) array (matches NumPy) - K-order on transposed: values 0,1,2,3,4,5 following memory (matches NumPy) - 196 NpyIter tests passing, 5796 total tests passing

Add GotoIndex() method that jumps to a specific flat index position based on C_INDEX or F_INDEX flag. This enables random access by flat array index while iterating. Implementation details: - Converts flat index to original coordinates using appropriate formula: - C-order: decompose using row-major index strides - F-order: decompose using column-major index strides - Uses Perm array to map original coords to internal coords - Updates data pointers correctly after position change Fix ComputeFlatIndex to use original coordinate order: - Build original coords/shape from internal using Perm array - Compute C or F index in original array's coordinate system - Fixes c_index tracking during F-order iteration Fix Advance() to compute FlatIndex AFTER coords are updated: - FlatIndex was being computed before coord increment (off by one) - Now correctly computes after coordinate update - Fast path (identity perm + C_INDEX) still uses simple increment Add comprehensive tests: - GotoIndex with C_INDEX (2D and 3D arrays) - GotoIndex with F_INDEX - C_INDEX tracking during F-order iteration Test results: 200 NpyIter tests passing, 5800 total tests passing

Add Copy() method that creates an independent copy of the iterator at its current position, matching NumPy's NpyIter_Copy behavior: - Allocates new NpyIterState on heap - Copies all fixed-size fields - Allocates new dimension arrays (Shape, Coords, Perm, Strides) - Returns new NpyIterRef that owns the copied state - Advancing/resetting the copy does not affect the original Add comprehensive tests: - Copy preserves position - Copy is independent (advancing original doesn't affect copy) - Copy preserves flags (MULTI_INDEX, C_INDEX) - Resetting copy doesn't affect original Test results: 203 NpyIter tests passing, 5803 total tests passing

…eration NumPy's nditer flips axes with all-negative strides for cache-efficient memory-order iteration while tracking flipped coordinates via negative Perm entries. This implementation adds full NumPy parity. Key changes: - FlipNegativeStrides(): Negate all-negative axes, adjust base pointers, mark with negative Perm entries, set NEGPERM flag - GetMultiIndex/GotoMultiIndex: Handle NEGPERM by reversing coords for flipped axes (shape - coord - 1) - GotoIndex: Handle NEGPERM in flat index to multi-index conversion - ComputeFlatIndex: Handle NEGPERM for correct C/F index computation - InitializeFlatIndex: Compute initial FlatIndex after axis setup - HasNegPerm/HasIdentPerm: New properties for perm state inspection - DONT_NEGATE_STRIDES: Flag support to preserve view logical order 13 new NumPy parity tests covering: - 1D/2D/3D reversed arrays - Row/col/both reversed 2D arrays - GotoIndex/GotoMultiIndex with flipped axes - Mixed operands (one positive stride prevents flip) - DONT_NEGATE_STRIDES flag behavior - Iteration without MULTI_INDEX flag All 214 NpyIter tests pass, 5814 total tests pass.

GetIterView returns an NDArray view of the i-th operand with the iterator's internal axes ordering. A C-order iteration of this view is equivalent to the iterator's iteration order. Key features: - Returns view with iterator's internal shape and strides (after coalescing/reordering) - For coalesced arrays: returns lower-dimensional view (e.g., 3D->1D) - For sliced/transposed arrays: reflects internal optimization - Throws InvalidOperationException when buffering is enabled - Validates operand index bounds 8 new NumPy parity tests covering: - Contiguous array (coalesced to 1D) - MULTI_INDEX (preserves original shape) - Transposed array with K-order - Sliced arrays with non-contiguous strides - Multiple operands - Buffered iterator exception - Invalid operand index exception - Reversed array with flipped strides All 222 NpyIter tests pass, 5822 total tests pass.

…ered iteration NumPy's nditer supports automatic type conversion when iterating arrays with different dtypes. This implementation adds full NumPy parity for casting during buffered iteration. Key changes: - NpyIterCasting.cs (new): Complete casting validation and conversion - CanCast(): Validates casting rules (no_casting, equiv, safe, same_kind, unsafe) - IsSafeCast(): Safe casts like smaller int -> larger int, any int -> float64 - IsSameKindCast(): Both integers or both floats - ValidateCasts(): Throws InvalidCastException for invalid casts - FindCommonDtype(): For COMMON_DTYPE flag - ConvertValue(): Single value conversion via double intermediate - CopyWithCast(): Strided array copy with type conversion - NpyIter.State.cs: - OpSrcDTypes[]: Track source array dtypes for casting - SrcElementSizes[]: Source element sizes for stride calculation - GetOpSrcDType/SetOpSrcDType accessors - NeedsCast(op): Check if operand requires type conversion - NpyIterBufferManager.cs: - CopyToBufferWithCast(): Copy from source to buffer with conversion - CopyFromBufferWithCast(): Copy from buffer to destination with conversion - Handles 1D and multi-dimensional strided arrays - NpyIter.cs: - Initialize handles COMMON_DTYPE flag to auto-find common dtype - Stores source dtypes and validates casting rules - GetDataPtr returns buffer pointer when buffering enabled - CRITICAL BUG FIX: Dispose was using NativeMemory.Free for buffers allocated with AlignedAlloc. Now correctly uses FreeBuffers which calls AlignedFree. This was causing memory corruption and test crashes. 13 new NumPy parity tests: - Cast_Int32ToFloat64_SafeCasting - Cast_Float64ToInt32_UnsafeCasting - Cast_Float64ToInt32_SafeCasting_Throws - Cast_Int16ToInt32_SafeCasting - Cast_CommonDtype_TwoOperands - Cast_WriteOutput_WithConversion - Cast_SameKindCasting_IntToInt - Cast_SameKindCasting_IntToFloat_Throws - Cast_NoCasting_SameType_Allowed - Cast_NoCasting_DifferentType_Throws - Cast_RequiresBuffered_ThrowsWithoutBuffer - And more... All 233 NpyIter tests pass, 5833 total tests pass.

Adds basic reduction iteration support matching NumPy's nditer API: - Reduction detection via op_axes with -1 entries for READWRITE operands - REDUCE_OK flag validation: throws if reduction detected without flag - IsFirstVisit(operand): checks if current element is first visit (for initialization, e.g., set to 0 before summing) - IsReduction property: check if iterator has reduction operands - IsOperandReduction(op): check if specific operand is reduction - REDUCE flags set on iterator (NpyIterFlags.REDUCE) and operands (NpyIterOpFlags.REDUCE) when reduction is detected Key implementation details: - Modified ApplyOpAxes to detect reduction axes and validate REDUCE_OK - Fixed isWriteable check to only match WRITE flag (READWRITE includes both) - Modified ValidateIterShape to account for op_axes -1 entries - Modified Initialize to set up strides directly from op_axes when provided instead of using np.broadcast_to (which fails for reduction shapes) Tests added (7 new NumPy parity tests): - Reduction_1DToScalar_IteratesCorrectly - Reduction_2DToScalar_IteratesCorrectly - Reduction_2DAlongAxis1_ProducesCorrectResult - Reduction_IsFirstVisit_ReturnsTrueOnFirstElement - Reduction_WithoutReduceOK_Throws - Reduction_ReadOnlyOperand_DoesNotThrow - Reduction_HasReduceFlag_WhenReductionDetected All 240 NpyIter tests and 5840 total tests passing.

- Add READWRITE validation: reduction operands must have both READ and WRITE flags (WRITEONLY throws ArgumentException). NumPy requires this because reduction must read existing value before accumulating. - Add buffer reduction fields to NpyIterState: - ReducePos: current position in reduce outer loop - ReduceOuterSize: size of reduce outer loop - ReduceOuterStrides: per-operand reduce outer strides - GetReduceOuterStride/SetReduceOuterStride accessors - Update IsFirstVisit to check buffer reduce_pos: Part 1: Check coordinates (existing) - if stride=0 and coord!=0, not first Part 2: Check buffer reduce_pos (new) - when BUFFERED flag set, if ReducePos!=0 and operand's reduce outer stride is 0, not first visit - Add Reduction_WriteOnlyOperand_Throws test 241 NpyIter tests passing, 5843 total tests passing (excluding OpenBugs)

…Py parity Implements NumPy's double-loop pattern for efficient buffered reduction (nditer_templ.c.src lines 131-210). This avoids re-buffering during reduction by separating iteration into inner (core) and outer loops. Key changes: NpyIterState new fields: - CoreSize: Number of inputs per output element (reduce dimension size) - CorePos: Current position within core [0, CoreSize) for IsFirstVisit - SetBufStride/GetBufStride: Accessors for buffer strides SetupBufferedReduction: - CoreSize = Shape[outerDim] (reduce axis size) - ReduceOuterSize = transferSize / CoreSize (number of outputs) - BufStrides: 0 for reduce ops (stay at same output), elemSize for others - ReduceOuterStrides: elemSize for reduce ops (next output), elemSize*CoreSize for non-reduce ops BufferedReduceAdvance: - Inner loop: increment CorePos, advance by BufStrides - Outer loop: reset CorePos to 0, advance by ReduceOuterStrides - Returns 0 when buffer exhausted for refill IsFirstVisit for buffered mode: - Uses CorePos = 0 check instead of coordinates - First visit is only at start of each output element's accumulation CopyReduceBuffersToArrays: - For reduce operands: copy ReduceOuterSize elements (number of outputs) - For non-reduce operands: copy full CoreSize * ReduceOuterSize elements - Uses ResetDataPtrs as destination (original array, not buffer) GetDataPtr for buffered reduce: - Returns DataPtrs directly (tracked by BufferedReduceAdvance) - Instead of computing from IterIndex which doesn't work with double-loop Tests added: - BufferedReduction_1DToScalar_ProducesCorrectResult - BufferedReduction_2DAlongAxis1_ProducesCorrectResult - BufferedReduction_IsFirstVisit_WorksWithBuffering - BufferedReduction_LargeArray_ExceedsBuffer (tests buffer refill) - BufferedReduction_WithCasting_WorksCorrectly - BufferedReduction_DoubleLoopFields_AreSetCorrectly 247 NpyIter tests passing, 5847 total tests passing (excluding OpenBugs)

…coreSize) Problem: When buffer size is smaller than core size (reduce dimension size), the buffered reduction double-loop pattern broke down: - BufIterEnd was set to CoreSize instead of bufferSize - coreOffset tracking was misaligned with actual core boundaries - Reduce operand reload decisions were incorrect Root Cause: The coreOffset tracking was based on buffer refill counts, but core boundaries are determined by iteration coordinates. When bufferSize < coreSize, multiple buffer refills occur per core, causing the tracking to desync. Fix: 1. Use pointer comparison to detect new output positions: - After GotoIterIndex, compare current array position with previous writeback - Only reload reduce operand if pointer changed (new output element) 2. Add ArrayWritebackPtrs field to store writeback positions separately: - ResetDataPtrs must stay as base pointers for GotoIterIndex - ArrayWritebackPtrs stores the actual writeback destinations 3. Set BufIterEnd to min(BufferSize, CoreSize) for small buffer support Test Results: - All 252 NpyIter tests pass - Small buffer test (3,8)->(3,) with bufferSize=4 now produces [28, 92, 156] - NumPy parity confirmed for buffered reduction edge cases Analysis documented: - EXLOOP, BUFNEVER, BUF_REUSABLE are performance optimizations not bugs - Current implementation is functionally correct with NumPy

Audit Summary (2026-04-16): - 252 tests passing (101 parity, 70 battle, 41 ref tests) - 32 NumPy APIs fully implemented - All core features complete: iteration, indexing, buffering, casting, reduction Key findings: - Implementation is PRODUCTION READY - No critical missing features for NumSharp operations - Full NumPy parity for buffered reduction including small buffer handling - Intentional divergences documented (unlimited dims, 8 max operands) Remaining low-priority items (performance only): - BUFNEVER per-operand buffer skip - Enhanced buffer reuse logic - EXLOOP increment optimization

BREAKING CHANGE: Removed MaxOperands=8 limit NumPy uses NPY_MAXARGS=64 (was 32 in 1.x) as a runtime constant. NumSharp now achieves full parity by supporting truly unlimited operands through dynamic allocation of all per-operand arrays. Changes: - NpyIterState: Convert all 14 fixed per-operand arrays to dynamically allocated pointers (DataPtrs, ResetDataPtrs, BaseOffsets, OpItFlags, OpDTypes, OpSrcDTypes, ElementSizes, SrcElementSizes, InnerStrides, Buffers, BufStrides, ReduceOuterStrides, ReduceOuterPtrs, ArrayWritebackPtrs) - AllocateDimArrays: Now allocates both dimension arrays AND operand arrays in separate contiguous blocks with proper alignment - FreeDimArrays: Now frees both blocks - All accessor methods simplified (no more fixed() statements needed) - Copy method: Fixed to properly copy operand arrays for all cases including scalar (NDim=0) - NpyIterCoalescing: Updated to use direct pointer access Tests: - Added UnlimitedOperands_100Operands_IteratesCorrectly test - Updated TooManyOperands_Throws to ManyOperands_Works - Updated UnlimitedDimensions_MaxOperands to verify 16 operands work - 253 NpyIter tests passing Memory layout for operand arrays (per NOp elements): - 9 long* arrays (72 bytes each for 64-bit pointers) - 2 int* arrays - 1 ushort* array - 2 byte* arrays All sections 8-byte aligned for optimal cache performance.

Deep audit validates NumSharp NpyIter against NumPy 2.x using: 1. Behavioral Comparison - 55 NumPy vs NumSharp tests 2. Edge Case Matrix - 12 systematic edge cases 3. Source Code Comparison - NumPy C vs C# structural analysis 4. Property Invariants - 13 mathematical invariant tests Total validation: 333 tests (253 unit + 80 behavioral/invariant) Key findings: - All tests pass confirming full NumPy parity - NEGPERM behavior verified (reversed arrays iterate in memory order) - Buffered reduce double-loop matches NumPy structure - Coalescing algorithm structurally identical - All 6 property invariants hold New files: - docs/NPYITER_DEEP_AUDIT.md - Comprehensive audit report - test/audit_behavioral.cs - Runtime audit script - test/NumSharp.UnitTest/.../NpyIterNumPyBattleTests.cs - Battle tests

Three battle tests document NumSharp's iteration order differences: 1. F-order iteration: NumSharp is C-order only (documented limitation) - NumPy: [0, 3, 1, 4, 2, 5] (F-order) - NumSharp: [0, 1, 2, 3, 4, 5] (C-order) 2. Multi-operand broadcasting: Different iteration order - NumPy: [(0,0), (1,1), (2,2), (0,3), (1,4), (2,5)] - NumSharp: [(0,0), (0,3), (1,1), (1,4), (2,2), (2,5)] 3. Non-contiguous view: Memory order vs logical C-order - NumPy: [1, 3, 11, 13] (logical C-order) - NumSharp: [1, 11, 3, 13] (memory order) All tests now pass (277 total NpyIter tests).

NumPy's nditer coalescing strategy: - K-order: Always coalesce for memory efficiency (sort by stride) - C-order on C-contiguous: Coalesce → memory order (== C-order) - F-order on F-contiguous: Coalesce → memory order (== F-order) - F-order on C-contiguous: NO coalescing, reverse axes for F-order Previously NumSharp was coalescing for ALL orders when array was contiguous in any layout, which produced incorrect iteration order for F-order on C-contiguous arrays. Changes: - NpyIter.cs: Add CheckAllOperandsContiguous(bool cOrder) helper to check if arrays are contiguous in requested order - NpyIter.cs: Only coalesce when order matches array contiguity - NpyIterCoalescing.cs: Add IsContiguousForCoalescing() check Test results: - 277 NpyIter tests passing (including 24 new battle tests) - 5813 total tests passing - F-order now produces [0,3,1,4,2,5] instead of [0,1,2,3,4,5] for a 2x3 C-contiguous array (matches NumPy)

…arrays Problem: - K-order iteration on broadcast arrays produced wrong order (stride-based sorting with stride=0 breaks axis ordering) - K-order iteration on non-contiguous views also used wrong order - NumPy: (3,) x (2,3) broadcast iterates C-order: [(0,0),(1,1),(2,2),(0,3),(1,4),(2,5)] - NumSharp was producing: [(0,0),(0,3),(1,1),(1,4),(2,2),(2,5)] Root cause: - For K-order, we sorted axes by stride magnitude - But GetMinStride excludes stride=0, leading to incorrect axis ordering - Non-contiguous views similarly got wrong ordering from stride sort Solution: - For K-order with broadcast dimensions (stride=0), fall back to C-order - For K-order with non-contiguous arrays, fall back to C-order - Added HasBroadcastStrides() helper to detect broadcast dimensions - CheckAllOperandsContiguous now uses absolute strides to handle reversed arrays (negative strides become positive after FlipNegativeStrides) - Separate coalescing logic for C/F/K orders to preserve iteration semantics Changes: - NpyIter.cs: Added broadcast detection, fixed coalescing decision logic - NpyIterNumPyBattleTests.cs: Updated tests to expect correct NumPy behavior (removed [Misaligned] attributes from Battle_MultiOperandBroadcasting and Battle_NonContiguousViewOrder since they now match NumPy) All 277 NpyIter tests passing. All 5877 project tests passing.

Deep audit against NumPy 2.4.2 source revealed 7 behavioral bugs. All fixed via TDD. Bug #1: Negative strides always flipped regardless of order - NumPy (nditer_constr.c:297-307) only flips when NPY_ITFLAG_FORCEDORDER not set - FORCEDORDER is set by C, F, and A orders. Only K-order skips it. - Fix: Only call FlipNegativeStrides for K-order - CheckAllOperandsContiguous now takes allowFlip param (abs strides only when flipping) - Affects: 1D/2D reversed arrays with C/F/A orders Bug #2: NO_BROADCAST flag not enforced - Code was skipping NO_BROADCAST operands instead of enforcing the constraint - Fix: NO_BROADCAST operands must match iterShape without dim-1 stretching - ValidateIterShape now always runs (not just when iterShape is provided) Bug #3: F_INDEX returned C-order indices - Coalescing reduces to NDim=1, losing original axis structure needed for F-index - Fix: Disable coalescing when C_INDEX or F_INDEX is set (like MULTI_INDEX) Bug #4: ALLOCATE with null operand threw NullReferenceException - CalculateBroadcastShape accessed null op[i].ndim - Fix: Skip null operands in broadcast shape calc, then allocate them after with correct shape (from op_axes if provided) and dtype Bug #5,6,7: op_axes reductions broken (axis=0 gave [15,0,0], axis=1 threw) - ApplyOpAxes was re-applying op_axes to strides that were already correctly set in the main operand setup loop, zeroing out non-reduce strides - CalculateBroadcastShape didn't know about op_axes, couldn't compute iter shape - Fix: ApplyOpAxes now only validates and sets REDUCE flags, not strides - Fix: CalculateBroadcastShape now accepts opAxesNDim/opAxes parameters - Uses production Shape.ResolveReturnShape API for all broadcasting Refactoring: Uses production Shape.ResolveReturnShape / np.broadcast_to - Replaces custom broadcast shape calculation - User feedback: production APIs are 1-to-1 with NumPy Testing: - 21 new TDD tests in NpyIterParityFixTests.cs - All 298 NpyIter tests pass - All 5898 project tests pass - Final battletest: 21/21 scenarios match NumPy 2.4.2 exactly Fixed test: NullOperand_Throws now expects ArgumentException (more accurate than NullReferenceException since null operand without ALLOCATE is an argument error).

Adds F-contiguity detection and OrderResolver for NumPy's 4 memory orders at minimum functionality, with zero behavioral change to existing code. Changes: - Shape.cs: F-contig detection via ComputeIsFContiguousStatic (mirror of C-contig algorithm, scan left-to-right). Sets ArrayFlags.F_CONTIGUOUS flag during flag computation. New IsFContiguous property (O(1) flag check). New ComputeFContiguousStrides helper. New Shape(long[] dims, char order) constructor for explicit physical-order construction. - Scalar constructor now sets both C_CONTIGUOUS and F_CONTIGUOUS (matches NumPy). - OrderResolver.cs (NEW): Resolves NumPy order chars (C/F/A/K) to physical storage orders (C or F). 'A' and 'K' require a source Shape for resolution (matches NumPy: creation functions without source throw "only 'C' or 'F' order is permitted"). - np.empty.cs: New overload np.empty(shape, order, dtype) wiring OrderResolver through to Shape. Key insight: transpose already produces F-contig memory layout; previously this went undetected because F_CONTIGUOUS flag was never set. Now: arr = np.arange(24).reshape(2,3,4) arr.T.Shape.IsFContiguous // true (previously: false / undetected) Design: - Only C and F are physical storage layouts; A and K are logical decisions that resolve to C or F based on source array layout. - OrderResolver centralizes the C/F/A/K -> C/F mapping, letting future wiring of np.copy/np.array/flatten/ravel/reshape be a 1-line call. - Existing IsContiguous callers (116 usages across 50 files) unchanged - they still see C_CONTIGUOUS=false for F-contig arrays and take the strided path (which is correct, just not yet SIMD-accelerated). Tests (24 new in Shape.Order.Tests.cs): - Scalar and 1-D arrays are both C and F contig - Multi-dim C-contig is not F-contig and vice versa - Transpose of C-contig now reports IsFContiguous=true - Shape(dims, 'F') produces correct F-order strides (1, 2, 6 for 2x3x4) - Shape(dims, 'A'/'X') throws ArgumentException - OrderResolver: C/F resolve directly; A/K without source throw; A/K with source resolve based on source layout - np.empty(order='C'/'F') produces correct layout - np.empty(order='A'/'K') throws (matches NumPy) Verification: - 6017 tests pass on both net8.0 and net10.0 (zero regressions) - NumPy parity verified via Python side-by-side comparison - All order resolution semantics match NumPy 2.4.2 Future phases unblocked (each a ~1-line change): - ILKernelGenerator fast paths can add || IsFContiguous for element-wise ops - NpyIter.CheckAllOperandsContiguous can use Shape.IsFContiguous directly - np.copy(order), np.array(order), flatten(order), ravel(order) wiring - np.asfortranarray, np.ascontiguousarray

Review of initial F-order support surfaced three design issues where NumSharp diverged from NumPy's patterns. This refactor aligns with NumPy's flagsobject.c:_UpdateContiguousFlags exactly. Changes: 1. Unified contiguity computation (single-pass) - Replaced two separate functions (ComputeIsContiguousStatic, ComputeIsFContiguousStatic) with one combined ComputeContiguousFlagsStatic returning (isC, isF) tuple. - Mirrors NumPy's _UpdateContiguousFlags which computes both in one function with a shared dim==0 early exit. - Fewer call sites, one traversal per contiguity check, cleaner shared logic. 2. Fixed Shape.Order property (was hardcoded to layout = 'C') - Now derives from actual contiguity flags: returns 'F' if strictly F-contiguous (IsFContiguous && !IsContiguous), else 'C'. - Transposed C-contig arrays now correctly report Order='F'. - 1-D and scalar shapes (both C and F contig) report 'C' by convention (NumPy-default reference order). - Non-contiguous shapes report 'C' as default reference. 3. Fixed empty array flag computation (any dim == 0) - NumPy short-circuits _UpdateContiguousFlags on dim==0 setting BOTH C_CONTIGUOUS and F_CONTIGUOUS unconditionally and NOT setting BROADCASTED. Empty arrays have no elements so broadcast semantics are meaningless. - Previously NumSharp computed strides like (0, 3, 1) for shape (2, 0, 3), triggered IsBroadcasted=true, and then skipped contiguity flag assignment entirely. Result was an empty array reporting IsContiguous=false, IsFContiguous=false. - Now matches NumPy: any dim=0 short-circuits to set both C and F contig + WRITEABLE + ALIGNED, clear BROADCASTED. 4. Clarified `layout` const documentation - The internal const char layout = 'C' was misleadingly named (as if it described the shape's physical order) but only ever used as a hash seed in ComputeSizeAndHash. Updated doc comment to clarify this is NOT the physical memory order — use Order / IsContiguous / IsFContiguous for actual layout info. - Value unchanged to preserve existing hash stability. Additional tests (6 new): - Order property for C, F, transpose, 1-D, scalar cases - Empty array is both C and F contiguous (matching NumPy 2.4.2) Test results: - 6023 tests pass on both net8.0 and net10.0 (was 6017; 6 new tests) - Zero regressions NumPy source reference: numpy/_core/src/multiarray/flagsobject.c

…enarios Ports the last NumPy nditer surface gaps identified by the audit, each with 1-to-1 semantic parity verified against NumPy 2.4.2 via Python harness. 10 items implemented (all battletested): 1. NpyIter_ResetBasePointers (nditer_api.c:314) - Populate BaseOffsets during FlipNegativeStrides so ResetBasePointers can recompute ResetDataPtrs[iop] = baseptrs[iop] + baseoffsets[iop]. - Public: NpyIterRef.ResetBasePointers(ReadOnlySpan<IntPtr>) and ResetBasePointers(NDArray[]) convenience overload. 2. NPY_ITFLAG_TRANSFERFLAGS_SHIFT packing (nditer_constr.c:3542) - Pack NpyArrayMethodFlags into top 8 bits of ItFlags (shift=24). - Public: NpyIterRef.GetTransferFlags() + NpyArrayMethodFlags enum + NpyIterConstants.TRANSFERFLAGS_SHIFT/MASK constants. - REQUIRES_PYAPI never set in .NET (no Python GIL). SUPPORTS_UNALIGNED and NO_FLOATINGPOINT_ERRORS always set (raw pointer loops, .NET casts don't raise FPE). IS_REORDERABLE set for numeric casts. 3. NpyIter_GetGetMultiIndex factory (nditer_templ.c.src:481) - Specialized delegate factory returning NpyIterGetMultiIndexFunc with 3 dispatches: IDENTPERM (direct copy), positive perm (apply perm[]), NEGPERM (apply perm+flip). BUFFER and HASINDEX don't affect coords so no specialization needed for them. - Public: GetMultiIndexFunc(), GetMultiIndexFunc(out errmsg), InvokeMultiIndex(fn, coords) — ref-struct-safe invocation. - Also fixes: IDENTPERM flag is now set at construction (after AllocateDimArrays). Previously only set post-coalescing, leaving MULTI_INDEX iterators without the fast-path flag. 4. NpyIter_GetInnerFixedStrideArray (nditer_api.c:1357) - Public: GetInnerFixedStrideArray(Span<long>). - Buffered: copies BufStrides. Non-buffered: innermost-axis stride per operand. Returns BYTE strides (NumPy convention), multiplying NumSharp's element-count strides by ElementSizes[op]. 5. NpyIter_GetAxisStrideArray (nditer_api.c:1309) - Public: GetAxisStrideArray(int axis, Span<long>). - With HASMULTIINDEX: walks perm to find internal axis (handles both positive and NEGPERM-encoded entries). Without: Fortran-order (fastest-first) lookup via NDim-1-axis. Byte strides. 6. NpyIter_CreateCompatibleStrides (nditer_api.c:1058) - Public: CreateCompatibleStrides(long itemsize, Span<long>). - Requires HASMULTIINDEX, rejects flipped axes. Walks perm from innermost (NDim-1) outward, accumulating itemsize into outStrides[axis] in original (C-order) axis slots. 7. NpyIter_DebugPrint (nditer_api.c:1402) - Public: DebugPrint(), DebugPrint(TextWriter), DebugPrintToString(). - Faithful port of NumPy's dump format: ItFlags decoded, NDim/NOp, IterSize/Start/End/Index, Perm, DTypes, DataPtrs, BaseOffsets, OpItFlags, BufferData (when BUFFER), per-axis data. 8. NPY_ITER_REDUCTION_AXIS encoding (common.h:347, nditer_constr.c:1431) - Additive encoding: axis + (1 << 30). Values >= (1<<30)-1 flagged as reduction axes. Value 0x40000000 for axis 0, 0x3FFFFFFF for axis -1. - Public: NpyIterUtils.ReductionAxis(int) encoder and GetOpAxis(int, out bool) decoder. NpyIterConstants.REDUCTION_AXIS_OFFSET = 1<<30. - Integrated into CalculateBroadcastShape (rejects length != 1 on reduction axes), ValidateIterShape, and ApplyOpAxes (enforces REDUCE_OK + sets REDUCE flag). 9. WRITEMASKED + ARRAYMASK + check_mask_for_writemasked_reduction - TranslateOpFlags now maps NpyIterPerOpFlags.WRITEMASKED -> NpyIterOpFlags.WRITEMASKED on op flags. - PreCheckMaskOpPairing validates: WRITEMASKED requires one ARRAYMASK, ARRAYMASK requires >=1 WRITEMASKED, at most one ARRAYMASK, no operand with both flags. - SetMaskOpFromFlags sets NpyIterState.MaskOp index of ARRAYMASK operand. - CheckMaskForWriteMaskedReduction enforces (nditer_constr.c:1328): for any WRITEMASKED + REDUCE operand, no axis may have maskstride!=0 && opstride==0 (would produce multiple mask values per reduction element). - Public: NpyIterRef.MaskOp, HasWriteMaskedOperand. 10. NPY_ITER_OVERLAP_ASSUME_ELEMENTWISE per-op flag - Added NpyIterPerOpFlags.OVERLAP_ASSUME_ELEMENTWISE_PER_OP = 0x40000000 in the correct per-operand flag slot (NumPy's location). Accepted syntactically as a marker for COPY_IF_OVERLAP fast-path elision. Correctness bugs fixed while battletesting: A. SetupBufferedReduction produced inverted strides for non-reduce operands. BufStride was set to elemSize (assumed linear buffer); correct value is the operand's stride along the REDUCE axis (inner loop = reduce axis traversal). ReduceOuterStride was set to elemSize*coreSize; correct is stride along the non-reduce axis. B. SetupBufferedReduction only worked for 2-axis cases (one reduce, one non-reduce). For 3D+ with multiple non-reduce axes, added CoreSize=0 short-circuit that defers to regular N-D Advance() — which correctly carries multiple axes via Coords + per-axis strides. stride=0 on reduce axis naturally keeps y's pointer fixed during reduce iteration. C. GetDataPtr for BUFFER+REDUCE with CoreSize=0 returned a buffer pointer indexed by IterIndex (linear assumption). For reduce this is wrong — DataPtrs already track the correct position. Now returns DataPtrs whenever REDUCE flag is set. D. Reset() didn't reposition to IterStart. IterIndex was set to IterStart but DataPtrs/Coords were reset to array origin, desyncing the iterator state for ranged iterators with IterStart > 0. Now delegates to GotoIterIndex(IterStart) which sets all three consistently. E. K-order fallback to C-order was too aggressive — triggered for all non-contiguous arrays, defeating NumPy's K-order semantic of iterating in memory order. Fixed to fall back only when broadcast axes (stride=0) are present; merely non-contiguous (transposed, strided, negative- stride) now properly sorts axes by |stride| descending. F. CoalesceAxes rejected size-1 axes unless stride==0. Size-1 axes contribute no iteration and should always be absorbed into a neighbor. Fix restores proper 1D coalescing for shapes like (2,4,1) contiguous. G. FlipNegativeStrides now populates BaseOffsets[op] (previously an allocated-but-unused field). Prereq for item #1 (ResetBasePointers). Battletest harness: - Python<->NumSharp scenario harness in a temp workspace with 3 structured waves (25 scenarios each) plus a 491-scenario random fuzz test with deterministic seed (42). All scenarios compare element sequences, stride arrays, multi-indices, reduce outputs, and iteration state byte-for-byte against NumPy 2.4.2 output. - Coverage: 1D-5D shapes; int8/16/32/64, uint16, float32/64 dtypes; contiguous, transposed (2D+3D), strided, negative-stride, size-1 axes, and all combinations; MULTI_INDEX, C_INDEX, F_INDEX; RANGED + goto; explicit/implicit reduction axes; multi-operand broadcast. - Result: 566/566 scenarios pass (25+25+25+491). All semantically equivalent to NumPy's C-level nditer output. Added tests (94 new unit tests): - NpyIterAxisStrideArrayTests (12) - NpyIterCreateCompatibleStridesTests (9) - NpyIterDebugPrintTests (12) - NpyIterGetMultiIndexFuncTests (10) - NpyIterInnerFixedStrideArrayTests (9) - NpyIterOverlapAssumeElementwiseTests (5) - NpyIterReductionAxisEncodingTests (11) - NpyIterResetBasePointersTests (10) - NpyIterTransferFlagsTests (8) - NpyIterWriteMaskedTests (8) Regression: 6023/6023 project tests pass (was 5898 before this work), zero regressions. Project passes ~125 more tests than baseline because fixes C-F unblocked test cases that were previously failing silently.

…rface New file: OrderSupport.OpenBugs.Tests.cs (39 tests, 11 marked [OpenBugs]) Comprehensive TDD test file documenting the gap between NumSharp's current behavior and NumPy 2.x's expected behavior for memory order support. Each test uses NumPy's exact output as the expected value (verified via side-by-side Python scripts). Test sections: 1. Creation APIs (np.zeros/ones/empty/full) — 10 tests 2. Copy/conversion (np.copy, NDArray.copy) — 7 tests 3. Manipulation (flatten, ravel) — 5 tests 4. Arithmetic output layout — 3 tests 5. Reductions on F-contig (math-equivalent) — 6 tests 6. Slicing contiguity preservation — 2 tests 7. Broadcasting output layout — 1 test 8. Transpose behavior — 3 tests 9. Iteration order (C-order via GetOffset) — 1 test 10. Order property derivation — 3 tests Results (net8.0 and net10.0): - 28 tests pass (documents working behavior / NumPy parity) - 11 tests fail (marked [OpenBugs], excluded from CI via filter) Currently failing [OpenBugs] — API gaps to close in future phases: Section 2 — np.copy / NDArray.copy ignore order parameter: - NpCopy_FOrder_ProducesFContig - NpCopy_AOrder_FSource_ProducesFContig - NpCopy_KOrder_FSource_ProducesFContig - NDArrayCopy_FOrder_ProducesFContig - NDArrayCopy_AOrder_FSource_ProducesFContig Section 3 — flatten/ravel ignore/lack order: - Flatten_CContig_FOrder_MatchesNumPy - Flatten_FContig_FOrder_MatchesNumPy - Ravel_FOrder_ApiGap (ravel has no order parameter at all) Section 4 — arithmetic always produces C-contig output: - Arithmetic_FContig_ScalarMul_PreservesFContig - Arithmetic_FPlusF_PreservesFContig Section 7 — broadcast always produces C-contig output: - Broadcast_FContig_PlusFCol_PreservesFContig Currently passing (NumPy-aligned behavior confirmed): - np.zeros/ones/full preserve F-contig when given an F-Shape (workaround for missing order= parameter, but layout IS preserved) - np.empty(order='C'/'F'/'A'/'K') — correct behavior; A/K throw - All reductions (sum, mean, min, max, axis=0, axis=1) work on F-contig - Transpose toggles C<->F contig correctly - Slicing: 1-col slice of F-contig is both C and F contig (matches NumPy) - Slicing: row-slice of F-contig is neither (matches NumPy) - Shape.Order property reports correct char based on flags - Scalar multiply on F-contig produces correct values (just loses layout) - Indexed iteration on F-contig produces C-order logical traversal (matches NumPy's arr.flat semantics) CI verification: - Full suite with CI filter: 6051 pass, 0 fail (net8.0 and net10.0) - With TestCategory=OpenBugs: 11 fail (as expected) - With TestCategory!=OpenBugs: 28 pass (0 regressions) Next steps: fix each [OpenBugs] by wiring order through the respective API using OrderResolver. Remove the attribute after verifying the test passes with NumPy's expected output.

Expands OrderSupport.OpenBugs.Tests.cs from 39 to 67 tests covering every NumPy function that accepts an 'order' parameter. NumPy functions covered (15 total that accept order=): - Creation: empty, empty_like, zeros, zeros_like, ones, ones_like, full, full_like, eye - Conversion: array, asarray, asanyarray, copy - Manipulation: reshape, ravel - Method: astype, flatten, copy New sections added: - Section 11: np.empty_like (default K, preserves source layout) - Section 12: np.zeros_like (default K) + values-are-zero test - Section 13: np.ones_like (default K) + values-are-one test - Section 14: np.full_like (default K) + values-are-fill test - Section 15: np.eye (C/F order) + identity values test - Section 16: np.asarray / np.asanyarray API gaps - Section 17: astype (default K, preserves source layout) - Section 18: np.reshape with F-order (column-major fill) - Section 19: np.ravel with C/F/A/K orders - Section 20: np.array with order (Array input overload) - Section 21: np.asfortranarray / np.ascontiguousarray (missing APIs) Results (net8.0 and net10.0): - 42 tests pass (document working behavior / NumPy parity) - 25 tests fail (marked [OpenBugs], excluded from CI via filter) 25 [OpenBugs] documenting gaps: - *_like don't preserve F-contig (5 tests: empty/zeros/ones/full/astype) - np.copy/NDArray.copy order ignored (7 tests from prior commit) - flatten/ravel order ignored (3 tests) - arithmetic/broadcast don't preserve F-contig (3 tests) - np.eye has no order param (1 test) - np.reshape has no order param (1 test) - np.array order ignored (1 test) - np.asarray/asanyarray have no NDArray+order overload (2 tests) - np.asfortranarray/np.ascontiguousarray don't exist (2 tests) Confirmed NumPy parity (new passing tests): - np.empty_like/zeros_like/ones_like/full_like on C-contig (K default -> C) - np.zeros_like/ones_like/full_like produce correct fill values - np.eye default produces C-contig identity matrix - astype preserves C-contig from C source (K default) - astype preserves values during type conversion - np.reshape default produces row-major fill - np.ravel default is C-order - np.array default produces C-contig CI verification: - TestCategory!=OpenBugs: 6065 pass, 0 fail (net8.0 and net10.0) - TestCategory=OpenBugs: 25 fail (as expected bug reproductions) All NumPy order baselines verified via Python 2.4.2 side-by-side scripts.

…r.Assign to NpyIter.Copy Adds an NpyIter.Copy(dst, src) umbrella that subsumes the legacy MultiIterator.Assign -- same broadcast/stride/cross-dtype semantics, but routed through the SIMD-fast TryCopySameType kernel for matching dtypes and a new strided-cast helper for cross-dtype. All 11 production call sites in NumSharp.Core + the Bitmap project are migrated. The legacy MultiIterator type itself is left in place (still referenced by tests via AsIterator<T> wrappers in a separate migration phase). Background ========== MultiIterator.Assign(lhs, rhs) was the catch-all path for "copy rhs into lhs with broadcast and stride awareness, casting if dtypes differ." It worked by constructing two legacy NDIterator<T> instances (one per operand) and walking them lockstep with per-element delegate dispatch (MoveNext/MoveNextReference Funcs), reading rhs as its stored dtype and converting to lhs.TypeCode via Converts.FindConverter on every read. Performance baseline: - Per-element delegate call (Func<T> for MoveNext / Func<bool> for HasNext) is ~5-10 ns on its own. - The Converts.FindConverter delegate is another per-element call. - Coordinate-based offset calculation re-walks the shape every step via ValueCoordinatesIncrementor. - For matching dtypes this was already wasteful; the existing NpyIter.TryCopySameType IL kernel did the same job 3-10x faster. - For mismatched dtypes there was no faster path. Five of the eleven sites already prefixed an `if (NpyIter.TryCopySameType(...))` guard and only fell through to MultiIterator.Assign on dtype mismatch -- we just collapse that into one call now. NpyIter.Copy(dst, src) -- new umbrella ====================================== Lives next to TryCopySameType in the static NpyIter class (Backends/Iterators/NpyIter.cs). Strategy: 1. Same dtype -> TryCopySameType. Existing IL copy kernel covers contiguous (memcpy via cpblk) and strided/broadcast (struct-generic INumber<T> kernel that the JIT auto-vectorizes per dtype). 2. Cross dtype -> CopyStridedToStridedWithCast. Reuses the same CreateCopyState (broadcast-aware shape resolution + axis coalescing) but runs a per-element NpyIterCasting.ConvertValue loop. Adds the stride-aware-on-both-sides variant -- the existing NpyIterCasting helpers covered (strided->contig) and (contig->strided) but not (strided->strided), which is what arises when both operands are non-contiguous (broadcast src AND transposed/sliced dst). The cast path is still scalar (one ConvertValue per element) but it benefits from coalesced-axis iteration (1-D walk on contiguous-pair inputs). The JIT-friendly INumber<T> SIMD path for matching dtypes was the bigger common-case win; the cast path is rare in practice and parity-correct here. NpyIterCasting.cs -- new helper =============================== CopyStridedToStridedWithCast: combines the shape-driven coordinate walking of CopyStridedToContiguousWithCast and CopyContiguousToStridedWithCast. Walks innermost-axis-first (C-order), applies element-size multiplication via InfoOf.GetSize on both source and destination, and accepts stride=0 dims (broadcast) without special- casing -- they just contribute zero to srcOffset. Migrated call sites (10 in NumSharp.Core + 1 in Bitmap) ======================================================= A. UnmanagedStorage.Setters.cs -- 4 sites (lines 316, 382, 432, 483): Each was the broadcast/sliced fallback in SetData(NDArray|IArraySlice, int[]|long[]). Mechanical 1-line API swap. Comment-only reference at line 327 also updated for consistency. B. UnmanagedStorage.Cloning.cs:378 -- non-contig CloneData() fallback. Already had the TryCopySameType guard at line 377; collapsed both into one Copy() call. -1 LOC. C. UnmanagedStorage.cs:1439 -- CopyTo<T>(T*) non-contig path. Wraps the destination raw pointer in an UnmanagedStorage and invokes Copy. D. NDArray.String.cs:91 -- char-array string materialization for non-contig sliced storage. Same-dtype char->char hits the SIMD TryCopySameType fast path now. E. np.copyto.cs:29 -- np.copyto's broadcast/cast fallback. Also had a TryCopySameType guard; collapsed. -2 LOC. F. np.concatenate.cs:107 -- per-axis-index slice copy in the concatenate inner loop. G. NDArray.Copy.cs:35 -- F-order copy fallback. Had the TryCopySameType guard; collapsed. -1 LOC. H. NumSharp.Bitmap/np_.extensions.cs:254 -- non-contiguous bitmap scanline copy. Required adding InternalsVisibleTo("NumSharp.Bitmap") to NumSharp.Core/Assembly/Properties.cs since NpyIter.Copy is internal (consistent with the rest of the NpyIter surface -- kept internal because NpyIterRef is an internal ref struct). Also fixes the pre-existing OpenBugs.Bitmap.cs:48 / Bug 4 ("ToBitmap on non-contiguous sliced array fails inside MultiIterator") because NpyIter.Copy correctly handles arbitrary stride. Behavioral divergence found and fixed ===================================== NDArray.Indexing.Masking.cs:SetBooleanMaskAxis0 had a hidden dependency on legacy MultiIterator.Assign's BIDIRECTIONAL broadcast (Shape.Broadcast returns a common shape; both operands stretch to fit). NumPy's np.copyto (and now NpyIter.Copy) is one-directional: src must be broadcastable to dst.shape. The Case6_Assignment_BroadcastValue test exposed this. The NumPy-level operation was: arr2d = np.arange(12).reshape(3, 4) arr2d[[True, False, True]] = np.array([[100, 101, 102, 103]]) NumPy resolves this by computing the masked target shape (2, 4) and broadcasting the (1, 4) value to it. NumSharp iterates row-by-row and called MultiIterator.Assign(row_of_shape_4, value_of_shape_1_4) per selected row. Legacy bidirectional broadcast made this work accidentally; the new strict NpyIter.Copy correctly rejects (1,4) -> (4) because that is not a valid NumPy broadcast. Fix: in the per-row branch (the catch-all "Broadcast value to destination" path), squeeze leading singleton axes from value until its ndim matches destSlice.ndim. This matches what the NumPy mask-assign high-level operation produces row-by-row. Verification ============ - Smoke test (8 cases) compares NpyIter.Copy vs MultiIterator.Assign for: same-dtype contig->contig, cross-dtype int64->float64 broadcast, cross-dtype int32->float32 with transpose stride, cross-dtype double->int truncation, same-dtype broadcast 1D->2D, empty array, random parity, broadcast+stride combo. All 8 pass byte-for-byte. - Full suite (TestCategory!=OpenBugs&TestCategory!=HighMemory) on net8.0 + net10.0: 6748/6748 pass, zero failures. Baseline before this PR: 6710. Net delta +38 (22 new TileTests in prior commit + 1 unmarked Tile_ApiGap + 15 from misc fixes/ un-skipped tests since baseline measurement). Performance notes ================= - Same-dtype callers (90%+ of traffic in practice -- most SetData/ copyto/concatenate calls are typed assignments where source and dest match): no measurable change. Already used the IL copy kernel via the existing TryCopySameType guards or now hit it through the umbrella. - Cross-dtype callers (rare): per-element scalar ConvertValue is on par with the legacy delegate-based path -- both pay one virtual call per element. The NpyIter coordinate-walking is slightly tighter (single stackalloc coords vs ValueCoordinatesIncrementor allocation) but this is not on a hot path. - The MultiIterator.Assign call-graph is now empty in src/. The file itself (314 LOC) and the 12 NDIterator.Cast.* partials remain because test code still uses the AsIterator<T> extension method (which the next migration phase will redirect to an NpyIter wrapper, deleting the legacy types entirely). Risks ===== - The NumSharp.Bitmap project now depends on NumSharp.Core internals via InternalsVisibleTo. Bitmap was already consuming UnmanagedStorage/UnmanagedMemoryBlock<byte> directly so the internal-types coupling already existed; this just adds NpyIter.Copy. - One existing call site (SetBooleanMaskAxis0 catch-all) was relying on undocumented bidirectional broadcast leniency. The fix makes NumSharp's mask-assign more NumPy-aligned (1-directional broadcast per copy call) and the singleton-squeeze handles the row-by-row case the legacy code was getting "for free" from the bidirectional broadcast. Other call sites that went through MultiIterator.Assign with intentional bidirectional broadcast would break the same way -- none surfaced in the 6748-test suite.

Expose the entire NpyIter / NpyIterRef / NpyIterState / NpyExpr system as a first-class public API so NumSharp consumers can drive the nditer core, plug custom inner loops, compose expression trees, and implement custom reduction/axis kernels — matching NumPy's `np.nditer` extensibility. === NDArray-first API on NpyIter === NpyIter's static helpers (Copy, ReduceBool, TryCopySameType) previously required consumers to unwrap `.Storage` manually. Added NDArray overloads that forward to the existing UnmanagedStorage overloads — keeping Storage as a secondary entry point for low-level code that constructs fresh buffers from raw pointers (bitmap Scan0, pinned strings, etc.). Copy's full XML doc now lives on the NDArray overload; the Storage overload inherits it via <inheritdoc>, making NDArray the primary form. Migrated 7 in-hand NDArray callsites to drop the `.Storage` wart: - np.copyto(NDArray, NDArray) - NDArray.copy() - np.concatenate (per-axis writeTo/writeFrom inner loop) - Default.All / Default.Any impl dispatch (both generic and decimal) Remaining `.Storage` callsites are legitimately Storage-native (inside UnmanagedStorage methods, or wrapping raw T* / ArraySlice / bitmap Scan0 buffers where no NDArray exists). === Full public API promotion === Every `internal` access modifier in src/NumSharp.Core/Backends/Iterators/ has been removed. Users can now: 1. Call NpyIter's static helpers: NpyIter.Copy(ndDst, ndSrc) NpyIter.ReduceBool<T, NpyAllKernel<T>>(nd) NpyIter.TryCopySameType(ndDst, ndSrc) NpyIter.CreateCopyState / CreateReductionState 2. Drive NpyIterRef directly: NpyIterRef.New / MultiNew / AdvancedNew Iternext, Reset, GotoIterIndex, GotoIndex, GotoMultiIndex, GetDataPtrArray, GetInnerStrideArray, GetInnerLoopSizePtr, GetIterView, GetValue<T>, SetValue<T>, RemoveAxis, RemoveMultiIndex, EnableExternalLoop, Copy, Dispose, RawState 3. Plug custom inner loops via: NpyInnerLoopFunc delegate (raw) INpyInnerLoop interface (struct-generic, zero-alloc) INpyReducingInnerLoop<TAccum> (accumulator-threaded) NpyIterInnerLoopFunc / NpyIterNextFunc / NpyIterGetMultiIndexFunc 4. Implement custom reduction/axis kernels via: INpyBooleanReductionKernel<T> INpyAxisNumericReductionKernel<T> INpyAxisSameTypeKernel<T> INpyAxisDoubleReductionKernel INpyIterKernel 5. Use predefined kernel structs: NpyAllKernel<T>, NpyAnyKernel<T> NpySumAxisKernel<T>, NpyProdAxisKernel<T>, NpyMaxAxisKernel<T>, NpyMinAxisKernel<T> CumSumAxisKernel<T>, CumProdAxisKernel<T> VarAxisDoubleKernel, StdAxisDoubleKernel CountNonZeroKernel<T> (INpyReducingInnerLoop<long>) 6. Drive NpyAxisIter for axis-scoped ops: ExecuteSameType, ReduceDouble, ReduceBool, ReduceNumeric 7. Build expression trees via NpyExpr (Tier 3C custom-op API): NpyExpr.Input / Const / Add / Mul / Div / Power / Sqrt / Exp / ... expr.Compile(inputTypes, outputType, cacheKey) -> NpyInnerLoopFunc Subclass NpyExpr and override EmitScalar / EmitVector / SupportsSimd / AppendSignature for custom IL-emitting nodes Use NpyExprCompileContext in custom overrides InputNode, ConstNode, BinaryNode, UnaryNode, ComparisonNode, MinMaxNode, WhereNode, CallNode sealed impls DelegateSlots static registry for user-bound delegates 8. Access supporting infrastructure: NpyIterState (iterator state, low-level) NpyAxisState (axis-iterator state) NpyIterCasting (cross-dtype cast helpers) NpyIterCoalescing (axis coalescing) NpyIterBufferManager (aligned buffer alloc) NpyIterPathSelector / NpyIterExecution (path dispatch) Low-level pointer helpers on NpyIter: CoalesceAxes, UpdateLayoutFlags, IsContiguous, Advance Constants: StackAllocThreshold, MaxDims Only `private` members remain — those are backing fields, private helpers, and private caches (e.g., DelegateSlots._delegates). These are standard encapsulation, not API surface. === Scope === Files promoted (Iterators/ folder, 12 files): - NpyIter.cs + .State.cs + .Execution.cs + .Execution.Custom.cs - NpyIterKernels.cs, NpyLogicalReductionKernels.cs - NpyAxisIter.cs + .State.cs - NpyExpr.cs (all expression nodes + compile context + DelegateSlots) - NpyIterCasting.cs, NpyIterCoalescing.cs, NpyIterBufferManager.cs Callsite migrations (5 files): - Creation/NDArray.Copy.cs, Creation/np.concatenate.cs - Manipulation/np.copyto.cs - Backends/Default/Logic/Default.All.cs + Default.Any.cs === Verification === Full build clean across net8.0 + net10.0. All 6,748 tests pass on both frameworks (filter: !OpenBugs & !HighMemory).

… count_nonzero, np.where to NpyIter Continues the legacy iterator migration started in commit 65e64618. Replaces remaining AsIterator<T> call sites in five production paths with the NpyIter-based iteration machinery, keeping the full test suite at 6,748 passing on net8.0 and net10.0. API surface change: elevates NpyIter types (NpyIterRef, NpyIter static class, NpyIterState, kernel interfaces, delegates, flags, NPY_ORDER/NPY_CASTING, NpyExpr nodes, INpyInnerLoop / INpyReducingInnerLoop, boolean/axis kernel interfaces) from internal to public so external callers — including the downstream NumSharp.Bitmap project — can consume the new iterator without an InternalsVisibleTo entry. Dropping the NumSharp.Bitmap InternalsVisibleTo in Properties.cs because NpyIter.Copy is now public. Adds NDArray-accepting overloads for NpyIter.Copy, NpyIter.TryCopySameType, and NpyIter.ReduceBool so call sites can pass NDArray directly without going through .Storage. Call sites in np.copyto, np.concatenate, NDArray.Copy, Default.All, and Default.Any use the new overload. Step 5: NaN reductions (~150 LOC saved) --------------------------------------- New file: src/NumSharp.Core/Backends/Iterators/NpyNanReductionKernels.cs Adds INpyReducingInnerLoop<TAccum> struct kernels for scalar NaN reductions: - NanSumFloatKernel / NanSumDoubleKernel (accum = float/double) - NanProdFloatKernel / NanProdDoubleKernel (accum = float/double) - NanMinFloatKernel / NanMinDoubleKernel (accum = NanMinMax*Accum) - NanMaxFloatKernel / NanMaxDoubleKernel (accum = NanMinMax*Accum) - NanMeanFloatKernel / NanMeanDoubleKernel (accum = NanMeanAccumulator, sum+count) - NanSquaredDeviationFloatKernel / ...Double (pass 2 of two-pass variance) Accumulator types: - NanMeanAccumulator { double Sum; long Count; } - NanMinMaxFloatAccumulator { float Value; bool Found; } - NanMinMaxDoubleAccumulator { double Value; bool Found; } All kernels process one inner-loop chunk at a time using per-operand byte strides, so they transparently support contiguous, sliced, broadcast, and transposed layouts without any code branching. Migrated AsIterator call sites: - np.nanmean.cs (nanmean_scalar): one-pass sum+count, AsIterator -> ExecuteReducing - np.nanvar.cs (nanvar_scalar): two-pass mean + squared deviation - np.nanstd.cs (nanstd_scalar): two-pass mean + squared deviation + sqrt - Default.Reduction.Nan.cs (NanReduceScalarFloat, NanReduceScalarDouble): Sum/Prod/Min/Max over 12 AsIterator loops collapsed to 8 ExecuteReducing calls The two-pass variance path preserves numerical behavior of the legacy code exactly (no Welford switch) — pass 1 accumulates sum/count, pass 2 accumulates sum of (value - mean)^2 with mean held in the kernel struct's field. Step 6: count_nonzero + BooleanMask (~40 LOC saved) --------------------------------------------------- New kernel: CountNonZeroKernel<T> in NpyLogicalReductionKernels.cs Generic over any unmanaged T, uses EqualityComparer<T>.Default.Equals (devirtualized by the JIT per struct instantiation) for the != default check. Migrated: - Default.NonZero.cs (count_nonzero<T>): strided fallback now uses NpyIterRef + ExecuteReducing<CountNonZeroKernel<T>, long>. Contiguous fast path untouched. - Default.BooleanMask.cs (BooleanMaskFallback): two-pass migration. Pass 1 counts trues via NpyIter (same kernel reused). Pass 2 gathers via a 2-operand NpyIter (arr + mask, NPY_CORDER for lockstep C-order traversal) with an accumulator-threaded BooleanMaskGatherKernel that stores the write cursor and destination pointer as ref-updated state. NPY_CORDER is required on the gather pass because boolean indexing is defined in logical C-order, not memory-efficient iteration order; NPY_KEEPORDER on a transposed array produces wrong results (Case12_TransposedArray_BooleanMask). Step 7: np.where (3-5x perf win on non-contig path) --------------------------------------------------- Migrated np.where.cs WhereImpl<T>: the 4-lockstep AsIterator<bool> + AsIterator<T> x 3 path is replaced with a single 4-operand NpyIter compiling Where(Input(0), Input(1), Input(2)) into a SIMD-capable IL kernel via ExecuteExpression. Cache key is $"np.where.{dtype}" so repeated calls hit the cached compiled kernel. Also moves the condition-to-bool cast from implicit AsIterator<bool> per-element casting into an explicit cond.astype(Boolean, copy: false) at the top of where_internal. This also lets the SIMD fast path (canUseKernel) handle non-bool conditions, closing a pre-existing behavioral asymmetry. Bug discovered (collected, not fixed here) ------------------------------------------- NpyIterRef.New(arr) / MultiNew without NpyIterGlobalFlags.EXTERNAL_LOOP exposes wrong inner-loop counts. Each kernel invocation is called with count == IterSize but the base data pointer only advances by one element between calls, so the kernel reads past the end of the array. Workaround: pass NpyIterGlobalFlags.EXTERNAL_LOOP on every NpyIterRef.New call for bulk iteration. All migrated call sites above use EXTERNAL_LOOP consistently. Test impact ----------- Full suite (TestCategory!=OpenBugs&TestCategory!=HighMemory): Before: 6,748 passed After: 6,748 passed (no regressions, no new tests added this phase) Framework coverage: net8.0 + net10.0.

…merator migrated off AsIterator Step 8 — random sampling and multi-dim array casting ----------------------------------------------------- np.random.dirichlet: When alpha comes in as an NDArray (possibly strided / wrong dtype), the foreach + AsIterator<double> copy into a local flat double buffer is replaced by an NpyIter.Copy wrapping the destination ArraySlice as an UnmanagedStorage (with Shape.Vector(k)). NpyIter.Copy absorbs both the source layout and the any-numeric-dtype->double cast in one call. np.random.multivariate_normal: Same pattern as dirichlet — mean.AsIterator<double> copy into meanSlice becomes NpyIter.Copy(meanStorage, mean.Storage). The cov copy loop is left as GetDouble-per-element for now because it already needs element traversal (SVD decomposition consumes cov immediately after). NDArray.ToMuliDimArray<T>: Replaces the AsIterator + ValueCoordinatesIncrementor + per-element Array.SetValue loop (one boxed runtime type check per element) with a single Storage.ToArray<T>() call followed by Buffer.BlockCopy into the multi-dimensional destination array. Both .NET multi-dim arrays and NumSharp arrays are row-major (C-order), so the flat buffer lines up directly with the destination's linear backing storage. Decimal is not a primitive so Buffer.BlockCopy rejects it; that one dtype falls back to the coordinate-walk + SetValue path. All 11 other supported dtypes now take the bulk-memcpy fast path (expected 5-10x for primitives depending on array size, mostly due to eliminating the runtime type-check per SetValue). Step 9 — NDArray.GetEnumerator ------------------------------ The 12-dtype switch over `new NDIterator<T>(this, false).GetEnumerator()` collapses to a single `_iter1D<T>()` helper that materializes via Storage.ToArray<T>() (which already has a Buffer.MemoryCopy fast path for contiguous arrays and a coordinate-walk for strided). Foreach over a flat T[] avoids the per-element Func<T> delegate calls of the legacy iterator. For large 1-D arrays this allocates an additional T[] equal to the array size. Consumers of GetEnumerator are typically pretty-printing / format routines, which already allocate strings proportional to the data, so the transient extra allocation is not a net regression. np.broadcast.iters — deferred ----------------------------- Broadcast.iters is declared `public NDIterator[]` which is part of the external API surface. Migrating it requires first reworking NDIterator itself (Step 10) so that AsIterator returns an NpyIter-backed wrapper. Left unchanged in this commit; the type-name stays but the underlying implementation gets swapped when Step 10 lands. Test impact ----------- Full suite still at 6,748 / 6,748 passing on both net8.0 and net10.0 with the CI filter (TestCategory!=OpenBugs&TestCategory!=HighMemory).

…gacy files (-3870 LOC) Step 10 of the legacy iterator migration. With the production call sites in Phases 1+2 already migrated off of MultiIterator.Assign and the raw delegate-based AsIterator hot loops, the entire legacy iterator core can now be removed. What remains is a thin backward-compatibility shim so that ~86 existing call sites in tests and documentation continue to compile untouched. What gets deleted (3,870 net LOC removed) ----------------------------------------- - src/NumSharp.Core/Backends/Iterators/NDIteratorCasts/ NDIterator.Cast.Boolean.cs (254 LOC) NDIterator.Cast.Byte.cs (254 LOC) NDIterator.Cast.Char.cs (254 LOC) NDIterator.Cast.Decimal.cs (254 LOC) NDIterator.Cast.Double.cs (254 LOC) NDIterator.Cast.Int16.cs (254 LOC) NDIterator.Cast.Int32.cs (254 LOC) NDIterator.Cast.Int64.cs (254 LOC) NDIterator.Cast.Single.cs (254 LOC) NDIterator.Cast.UInt16.cs (254 LOC) NDIterator.Cast.UInt32.cs (254 LOC) NDIterator.Cast.UInt64.cs (254 LOC) (12 Regen-generated partials dispatching per source dtype) - src/NumSharp.Core/Backends/Iterators/NDIterator.template.cs (255 LOC) (the Regen source template the Cast files were generated from) - src/NumSharp.Core/Backends/Iterators/MultiIterator.cs (313 LOC) (now dead — all MultiIterator.Assign call sites migrated in Phase 1) - src/NumSharp.Core/Backends/Iterators/IteratorType.cs (9 LOC) (the Scalar/Vector/Matrix/Tensor dispatch enum — NpyIter makes it redundant since it picks the right traversal automatically) What stays (with a new, much simpler implementation) ---------------------------------------------------- - NDIterator.cs — reduced from 482 LOC of per-path dispatch (AutoReset x Sliced x Scalar/Vector/Matrix/Tensor x NoCast/Cast and their full factorial combinations with delegate captures) to 172 LOC of a single path: 1. ctor wraps the src IMemoryBlock as an UnmanagedStorage via CreateBroadcastedUnsafe (bypasses the size-match check so broadcast shapes with stride=0 axes work). 2. Materialize allocates a fresh contiguous NDArray with fresh C-order strides of TOut dtype and calls NpyIter.Copy — which absorbs source layout (contiguous/sliced/strided/transposed), broadcast, and any-to-TOut dtype conversion in one SIMD-capable pass. 3. MoveNext/HasNext/Reset/MoveNextReference become trivial pointer arithmetic on the materialized buffer. The target shape's dimensions are cloned and passed to `new Shape(dims)` so the destination has fresh row-major strides and is writeable (the input shape may be a read-only broadcast view with stride=0 axes that would trip NumSharpException.ThrowIfNotWriteable otherwise). - INDIterator.cs — removed `IteratorType Type { get; }` member since the enum is gone. The rest of the interface (Block, Shape, BroadcastedShape, AutoReset, MoveNext<T>, MoveNextReference<T>, HasNext, Reset) is preserved. np.Broadcast.iters of type NDIterator[] continues to work unchanged. Trade-off --------- Iteration now allocates O(size) backing memory up front instead of walking coordinates lazily. In exchange the per-element hot path is just `*(ptr + cursor++)` — no delegate dispatch, no ValueCoordinatesIncrementor arithmetic, no Converts.FindConverter closure capture, no IteratorType-based switch. For iteration patterns that read all or most of the elements (which is the common case for .AsIterator users) this is a net perf win; for patterns that early-exit after reading a handful of elements, the up-front materialization is wasted work. MoveNextReference now always returns a reference into the materialized buffer rather than the source, so callers that used MoveNextReference as a write-port into the source array will silently write to the local buffer. This is a behavioral change from the legacy path which supported MoveNextReference over the original storage in certain non-cast Scalar/ Vector paths. No in-tree caller relied on that behavior (the remaining test and benchmark usages are all read-only). Test impact ----------- Full CI suite still 6,748 / 6,748 passing on net8.0 and net10.0 with (TestCategory!=OpenBugs&TestCategory!=HighMemory). The NumSharp.Bitmap and NumSharp.Benchmark projects both build. Cumulative Phase 1 + Phase 2 impact ----------------------------------- Twenty-five legacy iterator call sites across eleven files migrated to NpyIter across six commits. Roughly 4,000 lines of legacy iterator code deleted, with the same test suite pass count (6,748) maintained at each step. The remaining AsIterator callers (tests, benchmarks, a couple of documentation ref comments) no longer pull in any per-dtype Cast switch or MultiIterator code path — they go through the thin NDIterator wrapper backed entirely by NpyIter.

…thout EXTERNAL_LOOP Three symptoms of one bug in NpyIter.Execution.cs. The driver loops — ForEach, ExecuteGeneric(Single/Multi), and ExecuteReducing — pulled their per-call count from `GetInnerLoopSizePtr()`, which always returns `&_state->Shape[NDim - 1]` when the iterator isn't BUFFER'd. In EXLOOP mode that's correct: `iternext` (via ExternalLoopNext) advances `IterIndex` by `Shape[NDim - 1]` per call. But in the default non-EXLOOP non-BUFFER mode, `iternext` (via StandardNext) only advances by one element per call — `state.Advance()` increments `IterIndex` by 1. The kernel was still told `count = Shape[NDim - 1]`, so: 1. The kernel reads `Shape[NDim - 1]` elements starting at the current data pointer, which extends past the last valid element of the source array. 2. The driver then calls iternext, which advances the pointer by one element. 3. The next kernel call reads `Shape[NDim - 1]` elements starting one element later — again past the end — and so on. Net effect: an N-element 1-D array triggers N kernel invocations, each reading N "elements" (with massive overlap), the last ~N-1 of which read uninitialized memory. For `np.array([1, 2, NaN, 4, 5])` the returned NanSum was 46 instead of 12 because the kernel saw the array plus four trailing garbage floats added together four times over. Discovered during the Phase 2 migration when wiring the NaN reduction kernels into NpyIter. Worked around at the call sites by always passing `NpyIterGlobalFlags.EXTERNAL_LOOP`, which keeps iterNext and GetInnerLoopSizePtr in agreement. This commit fixes the bug at the source so future callers don't need the workaround. Approach: - New helper `ResolveInnerLoopCount()` returns the correct count given the current flag combination: BUFFER: _state->BufIterEnd EXLOOP: _state->Shape[NDim - 1] else: 1 - ForEach, ExecuteGenericSingle, ExecuteGenericMulti, ExecuteReducing use ResolveInnerLoopCount instead of dereferencing GetInnerLoopSizePtr. BUFFER mode still reads the pointer per iteration because buffer fills can shrink at the tail. Both EXLOOP and non-EXLOOP paths now produce correct results. The existing Phase 2 call sites keep EXLOOP because it's the SIMD-optimal mode (one call covers the whole inner dimension), but callers who omit the flag no longer get silently-wrong output. Test impact: 6,748 / 6,748 passing on net8.0 and net10.0, plus the bug-repro smoke test (NanSum over a strided 1-D array without EXTERNAL_LOOP) now returns the correct sum on the fly.

Project-specific CLAUDE.md at examples/NeuralNetwork.NumSharp/.claude/ so future agents working in the example project get the right context without needing to rediscover everything from the code. Contents (~280 lines): * Build / Run — csproj setup (Exe, net8+net10, AllowUnsafeBlocks), InternalsVisibleTo scope, where to drop real MNIST IDX files, current demo defaults (epochs=100, batch=128, Adam lr=1e-3, synthetic sigma=2.5, eval cadence min(5, epochs)). * Directory Map — every file with a one-line purpose. * MnistMlp fusion — the three NpyExpr trees that collapse the post-matmul element-wise chunks into single NpyIter kernels (forward ReLU bias+activation, forward linear bias-only, backward ReLU gradient mask). * Layer/Cost/Optimizer contract — what every BaseLayer subclass must populate (Input/Output/Grads/InputGrad, Parameters["w"/"b"]). * Sharp edges — 8 gotchas: historical np.dot strided 100x cliff (now fixed by the stride-aware GEMM), 2-index `x[i,j]` vs slice, argmax needing axis, np.allclose mutating its arguments via astype(copy:false), argmax returning Int64 not Int32, Adam's step counter needing monotonic iteration, pre-fix FC weight init, slice dtype. * Perf characteristics — 100-epoch run numbers, fusion probe, kernel cache + delegate-slot instrumentation. * Testing — the in-line `dotnet_run` smoke-test pattern. * Q&A — why Accuacy/BinaryAccuacy keep the typo, why SoftmaxCrossEntropy lives in MnistMlp/ rather than Cost/, when to use NeuralNet.Train vs MlpTrainer, real-MNIST expected accuracy. * Known limitations — no shuffling, no validation split, Adam re-allocates per step, no serialization, string-vs-enum activation inconsistency between FullyConnected and FullyConnectedFused.

… copy Previous rewrite (commit 87f90a2b) backed NDIterator<TOut> with an eagerly-materialized NDArray buffer — it ran NpyIter.Copy at construction time into a contiguous TOut-typed buffer and walked that buffer on MoveNext. Simple, but allocated O(size * sizeof(TOut)) up front even for callers that read one element and walk away, or abandon iteration early. This commit drops the materialization. MoveNext now reads each element lazily from the source layout: - Same-type, contiguous, offset == 0: Direct `*((TOut*)addr + cursor++)`. One pointer increment per call, no coordinate arithmetic, no branch. Matches the legacy contiguous fast path. - Same-type, strided / sliced / broadcast / offset != 0: Walks offsets with ValueOffsetIncrementor (or ValueOffsetIncrementorAutoresetting when AutoReset is set). The incrementor updates one coordinate per call amortized O(1), with occasional O(ndim) carry-propagation for wrap-around. Same algorithm the legacy code used for its Matrix/Tensor sliced paths. - Cross-type (source dtype != TOut): Offset-walks the source at its native dtype, reads a TSrc element, and passes it through `Converts.FindConverter<TSrc, TOut>()` before returning TOut. One switch at construction dispatches to a typed BuildCastingMoveNext<TSrc>() helper — the per-element hot path is then a `TSrc v = *(...)` read followed by a `conv(v)` delegate call, matching the legacy cast-iterator performance profile. For consistency with the legacy path, MoveNextReference throws when a cast is involved — you can't hand out a stable ref to a converted value. AutoReset is implemented inline (`if (cursor >= size) cursor = 0` in the contig path, ValueOffsetIncrementorAutoresetting in the strided path) rather than via modulo-per-call so the steady-state cost is a single predictable branch per MoveNext. Memory: iteration now costs O(1) for contig, O(ndim) for the incrementor's Index[] and internal state on strided. No full-array allocation regardless of source size. Test impact: 6,748 / 6,748 passing on net8.0 + net10.0 with the CI filter (TestCategory!=OpenBugs&TestCategory!=HighMemory). Smoke test covering contig / strided / transposed / cross-type / auto-reset / Reset / foreach round-trip all match expected element sequences.

Replaces the lazy-but-standalone ValueOffsetIncrementor path with one that constructs an NpyIter state and drives MoveNext / HasNext / Reset directly off that state. NDIterator is now an honest thin wrapper over NpyIter — the same traversal machinery used by all the Phase 2 production call sites — rather than reimplementing the coord-walk logic with legacy incrementors. How it works ------------ - ctor calls NpyIterRef.New(arr, NPY_CORDER) to build the state, then transfers ownership of the NpyIterState* pointer out of the ref struct (see NpyIterRef.ReleaseState / FreeState below). The class holds that pointer for its lifetime and frees it in Dispose (or in the finalizer as a safety net). - MoveNext reads `*(TOut*)state->DataPtrs[0]` then calls `state->Advance()`. IterIndex tracks position, IterEnd bounds the non-AutoReset case, and `state->Reset()` restarts from IterStart on AutoReset wraparound and on explicit Reset. - Cross-dtype wraps the same read with a Converts.FindConverter<TSrc, TOut> lookup — one switch at construction picks the typed helper, so the per-element hot path is still just one read + one converter delegate call. MoveNextReference throws when casting is in play, matching the legacy contract. - NPY_CORDER is explicit so iterating a transposed view yields the logical row-major order the old NDIterator provided. Without it, KEEPORDER would give memory-efficient order (which e.g. `b.T.AsIterator<int>()` would surface as `0 1 2 ... 11` instead of the expected `0 4 8 1 5 9 2 6 10 3 7 11`). NpyIter additions ----------------- - NpyIterRef.ReleaseState(): hand the owned NpyIterState* to a caller who needs it across a non-ref-struct boundary (e.g. a class field). Marks the ref struct as non-owning so its Dispose is a no-op. - NpyIterRef.FreeState(NpyIterState*): static tear-down mirror of Dispose's cleanup path — frees buffers (when BUFFER set), calls FreeDimArrays, and NativeMemory.Free's the state pointer. The long-lived owner calls this from its own Dispose/finalizer. Bug fixes along the way ----------------------- NpyIter initialization previously computed base pointers as `(byte*)arr.Address + (shape.offset * arr.dtypesize)` in two places (initial broadcast setup on line 340 and ResetBasePointers on line 1972). `arr.dtypesize` goes through `Marshal.SizeOf(bool) == 4` because bool is marshaled to win32 BOOL, but the in-memory `bool[]` storage is 1 byte per element. For strided bool arrays this produced a base pointer 4× too far into the buffer. Switched both sites to `arr.GetTypeCode.SizeOf()` which returns the actual in-memory size (1 for bool). Surfaced by `Boolean_Strided_Odd` once NDIterator started routing through NpyIter — previously only LATENT because the legacy NDIterator path computed offsets in element units, not bytes, and sidestepped the NpyIter init. Test impact: 6,748 / 6,748 passing on net8.0 and net10.0 (CI filter: TestCategory!=OpenBugs&TestCategory!=HighMemory). Smoke test of same-type contig / cross-type / strided / transposed / broadcast / AutoReset / Reset / foreach all produce the expected element sequences.

`UnmanagedStorage.DTypeSize` (exposed via `NDArray.dtypesize`) was delegating to `Marshal.SizeOf(_dtype)`. For every numeric dtype that matches, but for bool, `Marshal.SizeOf(typeof(bool)) == 4` because bool is marshaled to win32 BOOL (32-bit). The in-memory layout of `bool[]` is 1 byte per element, so every caller computing a byte offset as `ptr + index * arr.dtypesize` was reading/writing 4× too far into the buffer for bool arrays. Switches to `_typecode.SizeOf()` which correctly returns 1 for bool and matches `Marshal.SizeOf` for every other type. 21 existing call sites (matmul, binary/unary/comparison/reduction ops, nan reductions, std/var, argmax, random shuffle, boolean mask gather, etc.) now get the right value without any downstream change. The bug had been latent until the Phase 2 iterator migration started routing more code paths through NpyIter.Copy and the new NDIterator wrapper; it surfaced most visibly as `sliced_bool[mask]` returning the wrong elements when the source was non-contiguous. With the root fix: var full = np.array(new[] { T,F,T,F,T,F,T,F,T }); var sliced = full["::2"]; // [T,T,T,T,T] non-contig var result = sliced[new_bool_mask]; // now correct per-element np.save.cs already special-cases bool before falling through to `Marshal.SizeOf`, so serialization was unaffected. Remaining Marshal.SizeOf references in the codebase are either in comments that explain this exact issue, or in the `InfoOf<T>.Size` fallback that only runs for types outside the 12 supported dtypes (e.g. Complex). Tests: 6,748 / 6,748 passing on net8.0 and net10.0 with the CI filter (TestCategory!=OpenBugs&TestCategory!=HighMemory).

- Delete 4 NPYITER analysis docs (audit, buffered reduce, deep audit, numpy differences) — information consolidated into codebase - Delete 3 NDIterator.Cast files (Complex, Half, SByte) — casting now handled by unified NDIterator<T> backed by NpyIter state - Update NDIterator.cs: minor adjustments from NpyIter backing refactor - Update ILKernelGenerator.Scan.cs: scan kernel changes - Update Default.MatMul.Strided.cs: add INumber<T> constraint support for generic matmul dispatch preparation - Update Default.ClipNDArray.cs: initial NpFunc dispatch refactoring replacing 6 switch blocks (~84 cases) with generic dispatch methods - Update np.full_like.cs: minor fix - Update RELEASE_0.51.0-prerelease.md release notes

…neric dispatch NpFunc is a reflection-cached generic dispatch utility that bridges runtime NPTypeCode values to compile-time generic type parameters. Hot path (cache hit) runs at ~32ns via Delegate[] array indexed by NPTypeCode ordinal. Cold path uses MakeGenericMethod + CreateDelegate, cached after first call per (method, typeCode) pair. Core NpFunc changes: - Dynamic table sizing: Delegate[] sized from max NPTypeCode enum value (was hardcoded [32], broke for NPTypeCode.Complex=128) - Overloads for 0-6 args × void/returning × 1-3 NPTypeCodes + 1-2 Types - SmartMatchTypes for multi-type dispatch (1→broadcast, N=N→positional, M<N→type-identity matching) - Per-arity ConcurrentDictionary caches for multi-type dispatch Files refactored (12 files, ~400 cases eliminated): Previous session (5 files, ~196 cases): - Default.ClipNDArray.cs: 6 dispatch methods for contiguous/general clip - Default.Clip.cs: 3 dispatch methods for scalar clip with ChangeType - Default.NonZero.cs: 3 dispatch methods for nonzero/count operations - Default.BooleanMask.cs: 1 dispatch method for masked copy - Default.Shift.cs: 2 dispatch methods for array/scalar shift This session (7 files, ~202 cases): - NDIteratorExtensions.cs: 5 overloads → 5 dispatch methods creating NDIterator<T> from NDArray/UnmanagedStorage/IArraySlice - Default.Reduction.CumAdd.cs: axis dispatch via CumSumAxisKernel<T>, elementwise via IAdditionOperators<T,T,T> with default(T) init - Default.Reduction.CumMul.cs: axis dispatch via CumProdAxisKernel<T>, elementwise via IMultiplyOperators + T.MultiplicativeIdentity init - np.where.cs: iterator fallback + IL kernel dispatch via pointer cast - np.random.randint.cs: int/long fill via INumberBase<T>.CreateTruncating - NDArray.NOT.cs: IEquatable<T>.Equals(default) unifies bool NOT and numeric ==0 comparison into single generic method - Default.LogicalReduction.cs: direct dispatch to ExecuteLogicalAxis<T> Net: -1243 lines removed across 12 files, replacing repetitive per-type switch cases with single generic dispatch methods.

Complex does not implement IComparable<T>, so NpFunc.Invoke into ClipArrayBoundsDispatch/ClipArrayMinDispatch/ClipArrayMaxDispatch crashed with ArgumentException on MakeGenericMethod. Fix: add NPTypeCode.Complex pre-checks in ClipNDArrayContiguous, ClipNDArrayGeneral, and ClipCore that route to dedicated Complex clip methods using lexicographic comparison (real first, then imag). NaN handling preserves the NaN-containing element as-is (not replaced with NaN+NaN*i), matching NumPy np.maximum/np.minimum behavior where "NaN wins" but the original value is returned. Half NaN propagation: ILKernelGenerator.ClipArrayBoundsScalar, ClipArrayMinScalar, ClipArrayMaxScalar fell through to the generic CompareTo path for Half, which treats NaN as less-than-all (IEEE totalOrder) instead of propagating it. Added Half-specific scalar methods that check Half.IsNaN explicitly before comparison. Also fix NpFunc table sizing: Delegate[] was hardcoded to [32] but NPTypeCode.Complex=128 caused IndexOutOfRangeException. Now computed dynamically from max NPTypeCode enum value at static init. Fixes 14 test failures (12 Complex clip/maximum/minimum constraint violations, 2 Half NaN propagation in maximum).

…ast paths Replaces the broken `PowerInteger` fast-path (which crashed on sliced/broadcast arrays via `Unsafe.Address`) with a stride-aware integer power emitted by the existing IL kernel infrastructure. Adds NumPy's "Integers to negative integer powers are not allowed." ValueError, fast paths for scalar exponents {0,1,2, 0.5,-1.0}, and switches f32 to single-precision `MathF.Pow` (no f64 round-trip). Audit-v2 tickets resolved: - T1.3a — np.power(sliced_int32, int32) no longer crashes - T1.3b — np.power(broadcast_int32, int32) no longer crashes - T1.36 — int**(-int) now raises ArgumentException matching NumPy ValueError What changed ============ NEW: src/NumSharp.Core/Utilities/NpyIntegerPower.cs Public squared-exponentiation helpers for all 9 integer NumSharp dtypes (sbyte/byte/int16/uint16/char/int32/uint32/int64/uint64) — preserves dtype-native wraparound (uint8 ** 8 = 0, 15**15 = 437893890380859375). Caller validates non-negative exponent. REWRITE: src/NumSharp.Core/Backends/Default/Math/Default.Power.cs - Removes the `Unsafe.Address`-based fast-path that crashed on sliced/broadcast operands and ignored strides. - Adds pre-scan: for `int**int` with signed-int exponent, scans rhs for any negative element and throws `ArgumentException("Integers to negative integer powers are not allowed.")`. Matches NumPy's unconditional check (rejects base ∈ {±1} too, per NumPy spec). - Adds scalar-exponent fast paths when `rhs.size == 1`: exp = 0 → ones_like(lhs) exp = 1 → lhs.copy() (or cast) exp = 2 → lhs * lhs (SIMD-optimized Multiply kernel) exp = 0.5 → np.sqrt(lhs) exp =-1.0 → np.reciprocal(lhs) (float base only) Each path verifies the resolved result dtype matches what the IL kernel would produce before substituting, so NEP50 promotion is preserved. - Falls through to `ExecuteBinaryOp` for the general case, which now walks strides correctly via the IL kernel paths. src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs - `EmitPowerOperation(il, resultType)`: dispatches to the matching `NpyIntegerPower.Pow*` helper for integer result types (replaces the `int → double → Math.Pow → int` round-trip that lost precision for values outside [-2^52, 2^52]). float32 → `MathF.Pow`; float64 → `Math.Pow`; Boolean and other fallthrough types use the original double round-trip to keep the kernel verifiable. - Cached `MethodInfo` lookups added for all 9 integer power helpers and `MathF.Pow`. src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Binary.cs - `EmitPowerOperation<T>(il)` (same-type contiguous kernel path): same dispatch as the mixed-type version. Generic `T` is mapped to the matching `NpyIntegerPower.Pow*` helper via `GetIntegerPowMethod<T>()`. src/NumSharp.Core/Backends/Default/Math/DefaultEngine.BinaryOp.cs - Updates the Power promotion comment to document NEP50 weak/strict behavior accurately (NumSharp matches NumPy in the common cases; the one documented misalignment is 0-D integer arrays explicitly constructed via `np.array(2, int32)`, which are indistinguishable from C# `int 2` after `np.asanyarray`). Tests ===== NEW: test/NumSharp.UnitTest/Math/NDArray.power.Comprehensive.Test.cs (24 tests) - Integer dtype-native wrapping (uint8/int8/int32 overflow) - Stride + broadcast layouts (sliced, broadcast_to, 2D-vs-1D) - Signed integer negative exponent throws (incl. base = ±1) - Unsigned integer exponent never throws - Float special values (0^0, NaN, ±inf base/exp, fractional neg base) - NEP50 promotion (f32 ** int{8,16,32}, f64 ** int*, f32 ** scalar) - All 9 integer dtypes smoke-tested via 2^3 = 8 REMOVED [Misaligned]: PowerEdgeCaseTests.Power_Integer_LargeValues Integer power now preserves exact precision; the test now asserts equality. UPDATED: NewDtypesCoverageSweep_Arithmetic_Tests.B35_SByte_Power_NegativeExponent* Previously documented the wrong (silent 0/±1) behavior; now asserts the NumPy-correct ArgumentException. UPDATED (removed [OpenBugs]): - AuditV2_MathReductions.T1_3a_Power_SlicedInt32_ShouldNotCrash - AuditV2_MathReductions.T1_3b_Power_BroadcastInt32_ShouldNotCrash - AuditV2_ILKernelSimd.T1_36_* (4 tests) Validation ========== cd test/NumSharp.UnitTest dotnet test --no-build --framework net10.0 \ --filter "TestCategory!=OpenBugs&TestCategory!=HighMemory" → Passed: 8255, Failed: 0 dotnet test --no-build --framework net10.0 \ --filter "FullyQualifiedName~Power" → Passed: 129, Failed: 0 Microbench (1M-element float32, x100 iterations): power(arr, 2) 121ms (fast path → mul; matches multiply baseline 117ms) power(arr, 0.5) 121ms (fast path → sqrt) power(arr, 2.7) 518ms (general path via MathF.Pow) Behavior changes vs. prior NumSharp =================================== - int**(-int) now throws (was: silently returned 0, 1, or -1). Matches NumPy 2.4.2 ValueError exactly.

Adds the iterator-subsystem branch audit documents that drove this branch's bug-fix and refactor work: - `NDITER_BRANCH_QUALITY_AUDIT.md` — original (V1) audit walking every changed src file and ranking findings by severity (bugs → perf → parity gaps → refactors → clean review). Bug catalog includes: np.maximum/minimum NaN handling, np.power stride mishandling, np.searchsorted incompleteness, np.repeat missing axis, NpyIter Iternext+EXLOOP path, nan{mean,std,var} perf, np.argsort LINQ perf, linspace/eye boxing. - `NDITER_BRANCH_QUALITY_AUDIT_V2.md` — V2 (fact-check) audit driven by 8 parallel agents auditing file-by-file with results verified via `python -c` against NumPy 2.4.2 and `dotnet_run` against NumSharp. 60 of 65 Tier 1 findings confirmed with failing OpenBugs reproducers written under `test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs`, plus a list of 4 false positives and 4 newly discovered bugs. - `docs/plans/audit_v2/01..08*.md` — per-batch audit chapters, each including: file scope tables (LoC + role), reference to NumPy source, reproduction commands, line-precise references, and a findings table with severity tags (bug / parity-gap / perf / refactor / clean). Chapters cover Iterators, ILKernel+SIMD, Default math/reductions, Logic+Shape+Storage, NDArray creation, Manipulation APIs+logic, Math ops + selection/sorting/stats, and Casting+random+utilities. These files are pure documentation and contain no code; they're the reference material for the bug fixes and tests that follow on the nditer branch.

Adds the per-batch test classes that the V2 audit fact-check pass produced to back its Tier 1 findings with concrete failing tests. Tests are marked `[OpenBugs]` so CI skips them until the underlying defect is fixed; running them locally with `TestCategory=OpenBugs` documents each bug's current behavior versus NumPy 2.4.2. Each test references both the master `NDITER_BRANCH_QUALITY_AUDIT_V2.md` and the matching `docs/plans/audit_v2/XX_*.md` chapter where the finding is documented in detail, and includes the file:line of the suspected defect plus a `python -c` NumPy 2.4.2 expectation. Test classes added (matching the 6 untracked batches): - `AuditV2_Iterators.cs` — NpyIter Iternext/EXLOOP issues, buffer refill, cast support gaps, NDIterator broadcast strides, etc. (Batch 1). - `AuditV2_LogicShapeStorage.cs` — Shape mutating indexer on a readonly struct, storage and logic op edge cases (Batch 4). - `AuditV2_NDArrayCreation.cs` — `np.array(NDArray, copy=false)` default aliasing, creation API edge cases (Batch 5). - `AuditV2_ManipulationApis.cs` — np.expand_dims on empty, manipulation parity gaps (Batch 6). - `AuditV2_MathSelectionSorting.cs` — SetIndicesNDNonLinear NIE, math/selection/sort bugs (Batch 7). - `AuditV2_CastingRandomUtilities.cs` — NpFunc/cast/random/utilities bugs (Batch 8). Batches 2 (`AuditV2_ILKernelSimd.cs`) and 3 (`AuditV2_MathReductions.cs`) already exist on the branch; this commit fills the remaining 6. Build is verified to pass with the new files included.

Updates `.claude/CLAUDE.md` so the project instructions match the code's current state: - "C-order only" entry replaced with "Order-aware layout": Shape tracks F-contiguity, and APIs with an `order` parameter resolve NumPy `C`/`F`/`A`/`K` through `OrderResolver`. Verified by: - `Shape.IsFContiguous` flag (`View/Shape.cs:115-118`) - `Shape.Order` property (`View/Shape.cs:437`) - F-aware construction (`View/Shape.cs:160`) - `F_CONTIGUOUS` entry in the flags table updated from "Reserved" to "Data is column-major contiguous" (matches `ArrayFlags.F_CONTIGUOUS` bit `0x0002` in `View/Shape.cs:24`). - Added `IsFContiguous — O(1) check via F_CONTIGUOUS flag` to the key Shape properties list. - Missing Functions count corrected from 19 → 18 and `np.where` removed from the Selection gap because `APIs/np.where.cs` implements it; new `### Selection` section under "Supported np.* APIs" lists `where`. - Iterators path updated from `MultiIterator.cs` to `NpyIter.cs` and `NpyExpr.cs` (verified — `MultiIterator` no longer exists; only `NDIterator`, `NpyIter`, `NpyExpr` are present in `Backends/Iterators`). - Q&A entries for NDIterator and NpyIter rewritten to match the current legacy-wrapper / NumPy-aligned multi-operand iterator split. Pure documentation change — no behavioral impact.

…per / Memory block Multiple `CopyTo` overloads in the unmanaged memory layer were calling `Buffer.MemoryCopy(...)` with source/destination swapped — the BCL signature is `MemoryCopy(void* source, void* destination, long destBytes, long sourceBytesToCopy)`, but the existing code passed the destination pointer first. The result was that data was copied *from the destination buffer into the source slice*, silently corrupting the caller's source data instead of populating the destination. ArraySlice`1.cs: - `TryCopyTo(Span<T>)`, `CopyTo(Span<T>)`, `IArraySlice.CopyTo<T1>(Span<T1>)`, `IArraySlice.CopyTo<T1>(UnmanagedSpan<T1>)`: swap source / destination pointers so data flows source→destination. - `CopyTo(IntPtr dst)`: also fix the byte-size argument — previous code passed `Count` (element count) for both destination size and bytes to copy, leaving non-byte dtypes with under-counted bounds. Replace with `Count * ItemLength` for both byte arguments and flip the source / destination order. - `CopyTo(IntPtr dst, long sourceOffset, long sourceCount)`: this overload was previously identical to `CopyTo(IntPtr dst)` (ignored the offset arguments). Add `sourceOffset` / `sourceCount` bounds checks, honour `sourceOffset` when computing the source pointer, and use `sourceCount * ItemLength` for the copy. - `CopyTo(Span<T>, long sourceOffset, long sourceLength)`: previous body recursed into itself (`CopyTo(destination, sourceOffset, sourceLength);`) causing a stack overflow. Replace with a bounds-checked `Buffer.MemoryCopy` from `Address + sourceOffset`. - `CopyTo(UnmanagedSpan<T>, long sourceOffset, long sourceLength)`: same direction swap as above. - `IArraySlice.CopyTo<T1>(Span<T1>)` / `IArraySlice.CopyTo<T1>(UnmanagedSpan<T1>)`: bytes-based comparison (`Count * ItemLength` vs `destination.Length * InfoOf<T1>.Size`) instead of element-count comparison, fixing the "destination too short" check for reinterpret-cast cases. - `IArraySlice.Clone<T1>()`: previous code used `UnmanagedMemoryBlock<T1>. Copy(Address, Count)` which treats `Count` as the *T1* element count while reading from a `T`-element buffer. Now compute the byte size and divide by `InfoOf<T1>.Size` so the clone preserves the whole byte payload (with a hard error if the byte count is not a multiple of the target element size). UnmanagedHelper.cs: - `CopyTo(IMemoryBlock src, IMemoryBlock dst, long countOffsetDestination)`: validate `countOffsetDestination` against `dst.Count` and ensure the source fits in the *remaining* destination capacity. Fix the destination-size argument to `(dst.Count - countOffsetDestination) * dst.ItemLength` instead of the source byte length (which under-counts by the offset when the destination is just big enough). UnmanagedMemoryBlock`1.cs: - `CopyTo(UnmanagedMemoryBlock<T> memoryBlock, long arrayIndex)`: swap pointers so data is copied source→destination (`memoryBlock.Address + arrayIndex` as dst), add null + bounds checks, and use the remaining destination capacity for the destination size argument. All fixes are direct corrections of misuses of `Buffer.MemoryCopy`'s signature; behavior for legitimate callers now matches the docstrings. The added regression tests live under `test/NumSharp.UnitTest/Backends/CloneRegressionTests.cs` (separate commit) and call each repaired overload to lock the contract in place.

….Clone bugs Shape.cs: The `Clone(deep, unview, unbroadcast)` branch logic was inconsistent and dropped the `offset`/`bufferSize` of scalar views in the most common call (`Clone()` with default args). Rewrite the cascade so behavior is predictable: - Empty shape → `default`. - Scalar shape: - `unview` or `unbroadcast` → return the static `Scalar` (offset=0). - Otherwise honour `deep`: copy-constructor preserves both `offset` and `bufferSize` for sliced scalar views like `np.arange(10)["5"]`. - Non-scalar shape: - `!deep && !unview && !unbroadcast` → return `this` (the readonly struct copy is itself a value-copy). - `unview` or `unbroadcast` → `new Shape((long[])dimensions.Clone())`, which the constructor fills with C-contiguous strides (no offset). This replaces the previous one-off `ComputeContiguousStrides` / `deep && !unbroadcast` mixed branches that produced different shapes depending on call combination. - Plain `deep` → deep copy via the copy constructor. Old behavior failure: `scalar.Shape.Clone()` on `np.arange(10)["5"]` returned the canonical `Scalar` shape with `offset == 0`, hiding the fact that the data lives at index 5. The regression test `Shape_Clone_PreservesScalarViewOffset` in `CloneRegressionTests` locks the fix. ArrayConvert.cs: - `Clone(Array sourceArray)` had two issues: 1. It walked `elementType.IsArray` past the array's actual element type, so a jagged `int[][]` was treated as a flat `int[]` and the subsequent `Array.Copy` produced wrong results (or threw). Now the immediate element type is used, preserving jaggedness. 2. Arrays with a non-zero lower bound (created via `Array.CreateInstance(elementType, lengths, lowerBounds)`) were not supported — they fell through to the multi-dim branch with all-zero lower bounds. Capture each axis' lower bound and use `Array.CreateInstance(elementType, lengths, lowerBounds)` whenever the source is multi-dim or has any non-zero lower bound. - `Clone<T>(T[,,,] sourceArray)` had a `GetLength(4)` typo for what should be the fourth (zero-indexed: 3) dimension. `GetLength(4)` throws `IndexOutOfRangeException` for any 4-D array. Changed to `GetLength(3)`. (Coverage: `CloneRegressionTests .ArrayConvert_Clone_FourDimensionalArray_UsesFourthDimensionLength`.)

…ontig NDArray (`Backends/NDArray.cs`): - All three `UnmanagedStorage`-based constructors now back-fill the engine when storage doesn't already have one, and mirror the chosen engine onto `Storage.Engine` so the array and storage stay in sync. Previously `Storage.Engine` could be null while the NDArray reported a valid `TensorEngine`, leaking back through chained constructors that read storage.Engine directly. - `TensorEngine` setter now propagates the resolved engine to `Storage.Engine` so changing the engine on an NDArray cascades to storage-side consumers. - `Clone()` is now `virtual` and uses the property setter (instead of the private field) so engine assignment propagates to storage. `NDArray<TDType>.Clone()` overrides it to preserve the typed wrapper — before this commit, `((NDArray<int>)x).Clone()` returned the non-generic NDArray base type, breaking generic callers (see `CloneRegressionTests.NDArray_Clone_PreservesGenericRuntimeType`). - `View`/`GetData(int[])`/`GetData(long[])`/the foreach yield path all switch from setting the private `tensorEngine` field to the property, so storage gets the engine too. UnmanagedStorage (`Backends/Unmanaged/UnmanagedStorage.cs`): - `CreateBroadcastedUnsafe(...)` now copies `storage.Engine` onto the new broadcast view. UnmanagedStorage cloning (`Backends/Unmanaged/UnmanagedStorage.Cloning.cs`): - All `Alias(...)` overloads, the `Slice` builder, both `Cast<T>` / `Cast(typeCode)`, both `CastIfNecessary<T>` / `CastIfNecessary(typeCode)`, and the empty-storage clone now propagate `Engine`. - Cast correctness fix: `Cast<T>` / `Cast(typeCode)` / `CastIfNecessary<T>` / `CastIfNecessary(typeCode)` used to cast the raw backing array via `InternalArray.CastTo(...)`. For strided or F-contiguous views that buffer holds elements in the *physical* layout, so the cast result contained values in the wrong logical order. They now run `CloneData()` first — which materializes the logical element sequence (via `NpyIter.Copy` for non-contiguous paths) — and cast that, so casting an F-contiguous view of `np.arange(6).reshape(2,3).T` yields the same values NumPy produces. (Verified by `CloneRegressionTests .UnmanagedStorage_CastGeneric_FContiguousSource_CopiesLogicalValuesAndEngine` and siblings.) - `Clone()` gains a fast `CanCloneRawLayout()` path: when the storage owns its buffer (no offset, no broadcast, no buffer/size mismatch) and is either C- or F-contiguous, the underlying ArraySlice is cloned raw and the same `Shape` is reused. Non-trivial slices and scalar views still fall back to `CloneData()`. Empty storages return a new typed empty storage and preserve the engine instead of trying to clone a null buffer. - `CastIfNecessary` early-return for same-dtype skips the `IsEmpty` check so empty storages of the requested dtype don't re-materialize.

The DefaultEngine helpers for `astype` and `transpose` created new `NDArray` instances via the `UnmanagedStorage`-only constructor, which falls back to `BackendFactory.GetEngine()`. Code that explicitly set `nd.TensorEngine` on the source (e.g. tests pinning a custom engine) would silently see its engine swapped for the default after a cast or transpose. `Default.Cast.cs` (`DefaultEngine.Cast`): - Capture `nd.TensorEngine` once at the top. - Empty/scalar/`(1,)` early returns now carry `engine` forward both on the returned `NDArray` and on `nd.Storage` (when reused in-place). - Both `copy` and in-place branches of the generic cast attach `TensorEngine = engine` to the resulting NDArray and to the re-assigned `nd.Storage`. `Default.Transpose.cs` (`DefaultEngine.TransposeAlias`): - The transpose alias returned via `Storage.Alias(newShape)` now carries `nd.TensorEngine` so transposed views keep their engine. Without this the call dropped back to the global default, breaking propagation through compounded operations. Coverage: `CloneRegressionTests.NDArray_AstypeCopyPath_PreservesTensorEngine` and the engine-propagation siblings.

…/ ravel paths All paths that build a fresh `NDArray` from an existing storage now preserve the caller's `TensorEngine`. Previously the `NDArray (UnmanagedStorage)` constructor would fall back to `BackendFactory.GetEngine()` when the supplied storage didn't carry an engine (which is common after `Clone()`/`Alias()`/`CloneData()`). `Creation/NDArray.Copy.cs` (`copy(char physical)`): - The C-order shortcut now requires the source to already be C-contiguous. Before, `copy('C')` on an F-contiguous view returned a `Clone()` whose shape preserved the F-strides — yielding a non-C-contiguous "copy". Now any non-C-contiguous source falls through to the iterator-driven materialization path. - The destination shape uses the requested `physical` order instead of hard-coding `'F'`. Combined with the fix above this gives correct C/F selection regardless of source layout. - Destination NDArray carries `TensorEngine = TensorEngine` of the source. Coverage: `CloneRegressionTests.NDArray_CopyCOrder_FromFContiguousSource_ProducesCContiguousCopy` and `NDArray_CopyFOrder_PreservesTensorEngine`. `Creation/NdArray.ReShape.cs`: - The F-order reshape return (`fFlat.Storage.InternalArray`-backed storage) and both non-contiguous fallback paths (`new NDArray(CloneData(), Shape.Clean())`) now attach the source `TensorEngine`. Coverage: `CloneRegressionTests.NDArray_ReshapeCopyPath_PreservesTensorEngine`. `Creation/np.array.cs`: - `np.array(nd, copy)` propagates `nd.TensorEngine` for both the alias (`copy=false`) and clone (`copy=true`) paths. Coverage: `NpArray_FromNDArray_PreservesTensorEngineForAliasAndCopy`. `Manipulation/np.expand_dims.cs`, `Manipulation/np.ravel.cs`, `Manipulation/NDArray.flatten.cs`: - The view (`Storage.Alias(...)`) and materialize (`CloneData()`) branches both forward the source `TensorEngine`. No semantic API changes other than the `copy('C')` correctness fix above; engine propagation is a transparent improvement.

NumPy's np.where iterator allocates the result with an order chosen from the *full-size* operands' contiguity flags: - Any full-size, multi-dim operand that is C-contiguous (but not F) → output is C-contiguous. - All full-size, multi-dim operands that are F-contiguous (and at least one is strictly F, not also C) → output is F-contiguous. - Operands that are scalar, 1-D, or broadcasted do not vote. - Mixed C/F (or any full-size non-contiguous operand) → fall back to C. Verified against NumPy 2.4.2: f = np.arange(12).reshape(3,4).T # F-contig view np.where(f > 5, f, 0) .flags # F_CONTIGUOUS=True np.where(f > 5, f.copy('C'), f) .flags # C_CONTIGUOUS=True np.where(np.array([True,False,True]), f, 0).flags # F_CONTIGUOUS=True Previously `np.where` always allocated the output as C-contiguous, losing the F layout that NumPy preserves for F-contiguous inputs. `APIs/np.where.cs`: - New `ResolveWhereOrder(params NDArray[] operands)` mirrors the rules above. Returns 'C' or 'F'. - The result `Shape` is now constructed via `new Shape((long[])cond .shape.Clone(), resultOrder)` so the resulting strides match the resolved order. - Drop the `NpFunc.Invoke(outType, WhereImpl<int>, ...)` generic dispatch: the actual `WhereImpl` body never used its `T` type parameter (the iterator-driven IL kernel keys off the runtime dtype string), so the switch-per-dtype indirection was dead weight. Replace with a direct non-generic `WhereImpl(cond, x, y, result)` call. `test/NumSharp.UnitTest/Logic/np.where.BattleTest.cs`: - New "Output Layout" region with three NumPy-anchored tests: * `Where_FContiguousInputs_ResultIsFContiguous` * `Where_MixedCAndFInputs_ResultFallsBackToC` * `Where_BroadcastConditionWithFInput_ResultIsFContiguous`

…sh order tests NumPy's `np.tile(A, reps)` keeps the source memory order on the "no expansion" shortcut (`tup == (1,)*outDim`): F-contiguous input stays F-contiguous, C-contiguous input stays C-contiguous, and views with strides outside C/F materialize as C-contiguous. Verified against NumPy 2.4.2: src = np.arange(12).reshape(3, 4).T # F-contig np.tile(src, (1, 1)).flags # F_CONTIGUOUS=True np.tile(src, ()).flags # F_CONTIGUOUS=True np.tile(np.arange(12).reshape(3, 4)[:, ::-1], (1, 1)).flags # C_CONTIGUOUS=True `Manipulation/np.tile.cs`: - The all-ones shortcut previously called `A.copy()` which defaults to `'C'` — silently flipping F-contiguous inputs to C-contiguous output. Replace with `A.copy('K')` (and the reshape variant gets the same treatment) so `OrderResolver.Resolve('K', shape)` picks the source's physical order. The comment is updated to describe the keep-order semantics. `test/NumSharp.UnitTest/Manipulation/np.tile.Test.cs`: - Three new tests covering the F-contig preservation, the `np.tile(a)` no-reps overload, and the non-contiguous fall-back. Each test also verifies element values against the source via index-based reads to guard against logical-order regressions. `test/NumSharp.UnitTest/View/OrderSupport.OpenBugs.Tests.cs`: - `Tile_ApiGap` is renamed to `Tile_RepeatsLastAxis_ValuesMatchNumPy` and its assertion stays — the API gap comment is replaced with the matching NumPy reference output. Header rewritten from "Missing functions" to "Manipulation helpers" since this section now documents working APIs. - `Where_ApiGap` (previously `[OpenBugs]` because np.where was thought missing) is now `Where_FContig_PreservesFContig`. It asserts that `np.where(f_arr > 5, f_arr, 0)` returns an F-contiguous result on F-contiguous input — the same property covered by the new where layout tests in the prior commit. The `[OpenBugs]` attribute is removed because the feature exists and now matches NumPy.

…IterBattleTests `benchmark/NumSharp.Benchmark.Exploration/Program.cs`: - `Options.Clone()` reused the same `RemainingArgs` `string[]` reference on the cloned `Options` instance. Any post-clone mutation of the array (or its elements via index assignment) would have leaked back to the original `Options`. Clone the array (`(string[])RemainingArgs .Clone()`) so the two `Options` instances are independent. `test/NumSharp.UnitTest/Backends/Iterators/NpyIterBattleTests.cs`: - Remove a single trailing blank line at end of file. No code change.

… after RemoveAxis `NpyIter.Clone()` (in `Backends/Iterators/NpyIter.cs`) previously copied the `Buffers[op]` pointer field directly from the source state, so the original and the cloned iterator shared the *same* per-operand buffer. After `Iternext()` on either iterator the writes from one would clobber the other's data, and disposing one would free the buffer out from under the other. The fix: - After copying the operand metadata (`ElementSizes`, `BufStrides`, etc.), allocate a fresh buffer per operand via `NpyIterBufferManager .AllocateAligned(BufferSize, opDtype)` and `Buffer.MemoryCopy` the source bytes into it. If allocation fails the catch block calls `NpyIterBufferManager.FreeBuffers` for buffered states before releasing dim arrays + state memory. - `DataPtrs[op]` is fixed up: if the source `DataPtrs[op]` pointed into the original `Buffers[op]` byte range we translate the offset onto the newly allocated buffer so iteration continues at the same logical position. - The clone now calls `AllocateDimArrays(_state->NDim, _state->NOp, _state->StridesNDim)` — see below. `NpyIterState.AllocateDimArrays(int ndim, int nop, int stridesNDim)`: - Previously, the strides block was always sized as `ndim * nop`. After `RemoveAxis` lowers `NDim` but leaves `StridesNDim` at its original width, cloning the iterator allocated a strides block that was too small, causing later reads from `Strides[k]` (where `k >= NDim*NOp`) to access freed or unrelated memory. - The third parameter defaults to `ndim` (preserving the existing contract for all other call sites) but accepts an explicit `stridesNDim >= ndim` so `Clone()` can carry the original allocated stride width forward. `StridesNDim` is now stored on the state and the strides allocation uses `stridesNDim * nop * sizeof(long)`. The scalar fast-path now requires both `ndim == 0` and `stridesNDim == 0` to skip the allocation. Also moves the `GetInnerFixedStrideArray` docblock so it sits directly above its method (it had drifted onto an unrelated method when the preceding doc was edited). Coverage: - `CloneRegressionTests.NpyIterCopy_BufferedIterator_AllocatesIndependentBuffers` asserts the two iterators have distinct `DataPtr` addresses and that advancing one does not advance the other. - `CloneRegressionTests.NpyIterCopy_AfterRemoveAxis_PreservesAllocatedStrideWidth` builds an iterator over `(2,3,4)`, removes axis 1, clones it, and checks `NDim`, `Shape`, and the first value match.

… clone fixes Adds `test/NumSharp.UnitTest/Backends/CloneRegressionTests.cs`, which locks in the contracts established by the preceding fix commits. Each test reproduces a specific bug or contract that previously regressed and asserts the corrected behavior. 27 tests; all pass on net8.0 and net10.0. Coverage map (each pair = test → fix commit): ArraySlice CopyTo direction / range fixes → `fix(unmanaged): correct CopyTo direction + bounds in ArraySlice` - `ArraySlice_CopyToSpan_CopiesFromSliceToDestination` - `ArraySlice_TryCopyToSpan_CopiesFromSliceToDestination` - `ArraySlice_CopyToSpan_WithSourceRange_CopiesRequestedRange` - `ArraySlice_CopyToIntPtr_WithSourceRange_CopiesRequestedRange` - `ArraySlice_InterfaceCopyToSpan_CopiesFromSliceToDestination` - `ArraySlice_InterfaceCloneGeneric_ReinterpretsWholeBytePayload` ArrayConvert.Clone jagged / non-zero lower-bound / 4-D GetLength fixes → `fix(shape+convert): preserve scalar offset on Clone; fix ArrayConvert.Clone bugs` - `ArrayConvert_Clone_PreservesJaggedElementType` - `ArrayConvert_Clone_PreservesNonZeroLowerBounds` - `ArrayConvert_Clone_FourDimensionalArray_UsesFourthDimensionLength` Shape.Clone scalar view offset preservation → same commit as above - `Shape_Clone_PreservesScalarViewOffset` UnmanagedStorage.Clone empty + F-contig + engine → `fix(storage+ndarray): keep TensorEngine in sync; correct cast for F-contig` - `UnmanagedStorage_Clone_DtypeOnlyStorage_DoesNotDereferenceMissingData` - `UnmanagedStorage_Clone_PreservesEngineAndFContiguousShape` UnmanagedStorage.Cast / CastIfNecessary uses CloneData + propagates engine → same commit - `UnmanagedStorage_CastTypeCode_FContiguousSource_CopiesLogicalValuesAndEngine` - `UnmanagedStorage_CastGeneric_FContiguousSource_CopiesLogicalValuesAndEngine` - `UnmanagedStorage_CastIfNecessary_FContiguousSource_CopiesLogicalValuesAndEngine` - `UnmanagedStorage_CastEmptyStorage_PreservesEngine` UnmanagedMemoryBlock.CopyTo arrayIndex offset → `fix(unmanaged): correct CopyTo direction + bounds in ArraySlice` - `UnmanagedMemoryBlock_CopyToWithIndex_CopiesToDestinationOffset` UnmanagedHelper.CopyTo destination-offset bounds → same commit - `UnmanagedHelper_CopyToWithInvalidDestinationOffset_Throws` NDArray.Clone / engine propagation → `fix(storage+ndarray): ...` + `fix(creation+manipulation): ...` + `fix(default-engine): ...` - `NDArray_Clone_PreservesGenericRuntimeType` - `NDArray_Clone_PreservesTensorEngineOnArrayAndStorage` - `NpArray_FromNDArray_PreservesTensorEngineForAliasAndCopy` - `NDArray_CopyFOrder_PreservesTensorEngine` - `NDArray_CopyCOrder_FromFContiguousSource_ProducesCContiguousCopy` - `NDArray_ReshapeCopyPath_PreservesTensorEngine` - `NDArray_AstypeCopyPath_PreservesTensorEngine` NpyIter.Clone buffered deep copy + RemoveAxis stride width → `fix(npyiter): deep-copy buffered Clone buffers; preserve stride width after RemoveAxis` - `NpyIterCopy_BufferedIterator_AllocatesIndependentBuffers` - `NpyIterCopy_AfterRemoveAxis_PreservesAllocatedStrideWidth`

…aths Promotes SByte, Half (float16), and Complex from "partially supported" to first-class dtypes, matching what NPTypeCode already declares and what NumPy 2.4.2 ships. The audit (NDITER_BRANCH_QUALITY_AUDIT_V2.md, Tier 1) flagged 9 production crash sites and 5 perf gaps where these three dtypes silently fell out of 12-dtype switches. After this commit every np.power(lhs, rhs) combination across the 15x15 dtype matrix works end-to-end, and the existing 12-dtype fast paths remain intact. CRASH FIXES (Tier 1): * NpyIterCasting (T1.9, T1.12, T1.38, T1.39): IsSafeCast / ReadAsDouble / WriteFromDouble / ConvertValue / PromoteTypes now handle SByte / Half / Complex. Complex routes through a dedicated Complex intermediate so the imaginary component is preserved on Complex->Complex copies and dropped cleanly (per NumPy's ComplexWarning) on Complex->real. Adds Half/SByte to IsFloatingPoint/IsSignedInteger predicates. * NpyIterBufferManager (related to T1.12): same-type buffered iteration was throwing for Complex base case. Adds SByte/Half/Complex branches to CopyToBuffer/CopyFromBuffer dtype dispatch. * UnmanagedStorage (T1.13, T1.57): SetValue(object,int[]/long[]), SetData(NDArray,long[]) scalar fast path, and the void*/IMemoryBlock CopyTo overloads all gained the three missing dtype branches. * ArrayConvert.cs (T1.30): 13 ToX(Array) destination switches were missing SByte/Half/Complex source cases. Plus ~40 new typed converter methods covering the previously-missing (Src -> Dst) corners. Total ~550 LOC added. * np.asanyarray (T1.49): adds IEnumerable<sbyte>, IEnumerable<Half>, IEnumerable<Complex> case branches; corresponding Memory<T>/ ReadOnlyMemory<T> dispatch; ConvertObjectListToNDArray branches; and FindCommonNumericType expansion (the seenMask bitset was bounded to 12 dtypes; Complex's typecode=128 also previously aliased bit 0 due to unbounded shift -- now masked by `(int)code & 31`). * np.copyto T1.55: now passes via the NpyIterCasting fix. * ILKernelGenerator.EmitDecimalConversion: Half<->Decimal and Complex<->Decimal routes were missing. np.power(Half, Decimal) now works (was the only np.power(15x15) failure after the casting fixes). PERF FIXES (Tier 2): * ILKernelGenerator.Binary.IsSimdSupported<T> (R9): adds sbyte. Vector*<sbyte> arithmetic is natively supported in .NET. * Converts.FindConverter (R18, R33): 12x12 type-pair fast-path ladder expanded to 15x15 (225 entries). Eliminates the IConvertible-interface boxing and object-cast boxing that the CreateFallbackConverter path imposes for SByte/Half/Complex sources or destinations. * Default.Reduction.ArgMax (R23): the per-slice NDArray view allocation in the Half/Complex axis fallback was costing one new NDArray per slice (1000 allocations for a (1000,1000) axis-reduce). Replaced with a stride-aware loop driven from a stackalloc coord vector. SByte path is removed from the fallback entirely since the IL kernel already handles it via CreateAxisArgReductionKernelTyped<sbyte>. * Default.BooleanMask gather (T1.58): the strided/broadcast fallback was calling Buffer.MemoryCopy(_, _, elemSize, elemSize) per matched element (~1us/element). Specialized on element size (1/2/4/8/16 bytes); all 15 dtypes hit a typed pointer write now, including Half (2B) and Complex (16B as two longs). VERIFICATION: * test/Math/NDArray.power.DtypeMatrix.Test.cs (new): - 15x15 dtype matrix smoke test (225 (lhs, rhs) combinations). - SByte ** -SByte raises ValueError-style ArgumentException. - Half ** Half preserves Half. - Complex ** Complex preserves Complex. - Float ** Complex promotes to Complex. - Half ** Single promotes to Single (NEP50). - SByte/Half/Complex List/IEnumerable inputs no longer throw. * Removed [OpenBugs] attribute from 11 AuditV2 tests that are now CI-green: T1_9 (3x), T1_12 (2x), T1_13 (2x), T1_30 (3x), T1_49 (3x), T1_55, T1_57, T1_58. They now run as regular tests. * Full suite: 8281 passed, 0 failed (was 8255 before this commit, including the new dtype-matrix tests and 26 promoted-from-OpenBugs tests). DOCS: * .claude/CLAUDE.md: "Supported Types (12)" -> "Supported Types (15)". Adds Half/SByte/Complex rows and a "Perf notes" section that documents Half/Complex/Decimal as scalar paths (no Vector<Half> arithmetic in .NET BCL; Complex.Pow is the BCL routine). OUT OF SCOPE FOR THIS COMMIT: * T1.34 NpyExpr Const/Where/Call SByte/Half/Complex support: not on np.power's critical path; left for a separate pass. * T1.39 Int64/UInt64 -> double precision loss above 2^53: separate audit item, unrelated to the three target dtypes. For full audit context see docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md section "Major themes" item 2 (missing SByte/Half/Complex).

Nucs added 30 commits April 22, 2026 23:27

Replace Claude imports with symlinks

ab02e85

fix(tests): Convert TUnit [Test] to MSTest [TestMethod]

5c2d6fd

- NpyIterRefTests.cs: Fix ref struct lambda capture issue - np.logical_reduction.iterator.tests.cs: Add [TestClass], replace [Test] with [TestMethod] All 5742 tests pass (excluding OpenBugs category)

docs(NpyIter): Update remaining features list

a620349

Nucs added 12 commits April 22, 2026 23:39

Nucs force-pushed the nditer branch from f5c05a7 to 574a0d8 Compare April 23, 2026 09:34

Nucs mentioned this pull request Apr 28, 2026

Add NDIterator<T> overload with support for specific axis. #363

Open

Nucs added 15 commits May 13, 2026 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611
Nucs wants to merge 94 commits into
masterfrom
nditer

Nucs commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nucs commented Apr 22, 2026

Summary

TL;DR

Contents

NpyIter -- Full NumPy nditer Port

Files (new)

Capability Matrix

Bugs found and fixed during port

NpyExpr DSL + 3-tier Custom-Op API

Tiers

NpyExpr Node Catalog

Call(...) escape hatch (commit 8da3e693)

Bugs caught during DSL battletest

Multi-Order Memory Layout (C/F/A/K)

Shape changes (src/NumSharp.Core/View/Shape.cs, +171 lines)

API surface wiring

Post-hoc F-contig preservation across ILKernel dispatch (Group F, 41 bugs fixed)

TDD coverage

Stride-Native MatMul

New files

Dead code removed (~165 lines)

Performance (MLP backward shapes)

Test coverage

Char8 Dtype

Files (new, ~1,450 lines)

Python parity (caught by oracle diff)

Status

MNIST MLP Example

Bug Fixes

Behavioral Changes

Documentation

Test Plan

Known Issues / Out of Scope

Migration / Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`NpyExpr` Node Catalog

`Call(...)` escape hatch (commit `8da3e693`)

Shape changes (`src/NumSharp.Core/View/Shape.cs`, +171 lines)