Skip to content

NFS stability and performance enhancements#851

Open
filippomc wants to merge 6 commits intodevelopfrom
feature/CH-261
Open

NFS stability and performance enhancements#851
filippomc wants to merge 6 commits intodevelopfrom
feature/CH-261

Conversation

@filippomc
Copy link
Copy Markdown
Collaborator

@filippomc filippomc commented Apr 27, 2026

Closes CH-261

Implemented solution

  • Updated to recent nfs-volume provisioner
  • Scripting rewritten in go for concurrent run
  • Failover strategy implementation

How to test this PR

There is a test included in the deployment that tries some common stress cases

Sanity checks:

  • The pull request is explicitly linked to the relevant issue(s)
  • The issue is well described: clearly states the problem and the general proposed solution(s)
  • In this PR it is explicitly stated how to test the current change
  • The labels in the issue set the scope and the type of issue (bug, feature, etc.)
  • The relevant components are indicated in the issue (if any)
  • All the automated test checks are passing
  • All the linked issues are included in one Sprint
  • All the linked issues are in the Review state
  • All the linked issues are assigned

Breaking changes (select one):

  • The present changes do not change the preexisting api in any way
  • This PR and the issue are tagged as a breaking-change and the migration procedure is well described above

Possible deployment updates issues (select one):

  • There is no reason why deployments based on CloudHarness may break after the current update
  • This PR and the issue are tagged as alert:deployment

Test coverage (select one):

  • Tests for the relevant cases are included in this pr
  • The changes included in this pr are out of the current test coverage scope

Documentation (select one):

  • The documentation has been updated to match the current changes
  • The changes included in this PR are out of the current documentation scope

Nice to have (if relevant):

  • Screenshots of the changes
  • Explanatory video/animated gif

if err != nil {
return 0, fmt.Errorf("open %s: %w", loopControlPath, err)
}
defer ctrl.Close()
if err != nil {
return "", fmt.Errorf("open backing file %s: %w", backingFile, err)
}
defer bf.Close()
if err != nil {
return "", fmt.Errorf("open loop device %s: %w", loopPath, err)
}
defer lf.Close()
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the CloudHarness NFS server stack to improve stability during restarts/reschedules, add a watchdog/health endpoint, and introduce CI coverage for failover/multi-node mount behavior.

Changes:

  • Upgraded the embedded nfs-subdir-external-provisioner dependencies and modified provision/create/delete paths to use a new Go-based nfsvol helper.
  • Added nfsvol (Go) to mount existing loopback volumes concurrently at startup, regenerate stable per-PV exports, and run a watchdog with /healthz.
  • Extended Codefresh pipelines and samples values to build/deploy nfsserver and run a failover & multi-node stress test.

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
deployment/codefresh-test.yaml Builds/publishes the nfsserver image and adds an NFS failover CI step; adjusts rollout waits and API test ordering.
deployment-configuration/codefresh-template-test.yaml Mirrors the NFS failover CI step in the template and quotes cmd_ps.
applications/samples/deploy/values-test.yaml Declares nfsserver as a hard dependency and enables NFS-backed volumes for tests.
applications/samples/deploy/values-prod.yaml Fixes indentation for resource request memory.
applications/samples/deploy/values-minimal.yaml Changes minimal dependency shape (now only soft with common).
applications/nfsserver/test/failover-test.sh Adds an end-to-end failover/multi-node mount/ESTALE regression script.
applications/nfsserver/resources/run_nfs.sh Switches bootstrap to nfsvol mount-all + starts watchdog; narrows NFS daemon version flags; adds ulimit clamp.
applications/nfsserver/resources/rmlimdir.sh Converts to a thin wrapper around nfsvol delete.
applications/nfsserver/resources/remount.sh Converts to a thin wrapper around nfsvol mount-all.
applications/nfsserver/resources/pre-startup.sh Converts to a thin wrapper around nfsvol mount-all.
applications/nfsserver/resources/mklimdir.sh Converts to a thin wrapper around nfsvol create (and mount-all for --mountonly).
applications/nfsserver/nfsvol/main.go Introduces the nfsvol CLI entrypoint (mount-all/create/delete/watchdog).
applications/nfsserver/nfsvol/internal/watchdog/watchdog.go Adds the mount watchdog loop and /healthz HTTP server.
applications/nfsserver/nfsvol/internal/mount/mount.go Implements loop attach/mount lifecycle, concurrent bootstrap, and create/delete logic.
applications/nfsserver/nfsvol/internal/mount/exports.go Implements deterministic fsid exports fragments under /etc/exports.d.
applications/nfsserver/nfsvol/internal/loop/loop.go Implements loop device allocation/attach/detach and stale-loop cleanup.
applications/nfsserver/nfsvol/go.mod Adds a new Go module for nfsvol.
applications/nfsserver/nfs-subdir-external-provisioner/go.mod Updates Go/toolchain settings and bumps Kubernetes/provisioner dependencies.
applications/nfsserver/nfs-subdir-external-provisioner/cmd/nfs-subdir-external-provisioner/provisioner.go Switches provision/delete to call nfsvol and adds parsing helpers/env defaults.
applications/nfsserver/deploy/values.yaml Adds a configurable PDB toggle and adjusts comments.
applications/nfsserver/deploy/templates/pdb.yaml Adds an optional PodDisruptionBudget template.
applications/nfsserver/deploy/templates/nfs-server.yaml Forces Recreate, disables leader election, adds health probes/port, and sets SA name.
applications/nfsserver/README-PROD.md Documents production caveats, outage behavior, and operational responsibilities.
applications/nfsserver/Dockerfile Switches base image, installs e2fsprogs, builds/copies nfsvol, and runs go mod tidy during builds.


const (
provisionerNameKey = "PROVISIONER_NAME"
annotationPrefix = "k8s-sigs.io"
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotationPrefix is declared but never used, which will cause the Go build to fail with an “declared and not used” compilation error. Remove the constant or wire it into the annotation parsing logic so the package compiles.

Suggested change
annotationPrefix = "k8s-sigs.io"

Copilot uses AI. Check for mistakes.
# run pre startup script
bash -c "/usr/local/bin/pre-startup.sh"
# Mount all existing PVC volumes in parallel, then start the watchdog.
/usr/local/bin/nfsvol mount-all
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nfsvol mount-all failures are currently ignored (the script doesn’t set -e and there’s no explicit exit-on-error). That can leave volumes unmounted while the container continues starting and the health endpoint reports OK. Please make startup fail fast if mount-all returns non-zero (e.g., enable set -e / check the exit code and exit 1).

Suggested change
/usr/local/bin/nfsvol mount-all
if ! /usr/local/bin/nfsvol mount-all; then
echo "ERROR: /usr/local/bin/nfsvol mount-all failed" >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +48
# -V 3: enable NFSv3 (matches client mount options elsewhere).
/usr/sbin/rpc.mountd -V 3

/usr/sbin/exportfs -r
# -G 10 to reduce grace time to 10 seconds (the lowest allowed)
/usr/sbin/rpc.nfsd -G 10 -N 2 -V 3 -V 4
/usr/sbin/rpc.nfsd -G 10 -V 3
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script now starts only NFSv3 (rpc.mountd -V 3 / rpc.nfsd -V 3), but the default Helm values don’t include an explicit nfsvers=3 mountOption (it’s only present in values-local.yaml). Please ensure the StorageClass mountOptions force nfsvers=3 (or re-enable v4 on the server) so client mounts don’t depend on version-negotiation/fallback behavior.

Copilot uses AI. Check for mistakes.
@@ -1,8 +1,8 @@
harness:
secured: false
dependencies:
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

harness.dependencies.hard was removed from the minimal values. Helm templates in this repo assume both hard and soft are lists (e.g. they call concat .app.harness.dependencies.hard .app.harness.dependencies.soft), so omitting hard can cause template rendering failures. Add hard: [] (or keep the previous empty list) to make the values shape consistent.

Suggested change
dependencies:
dependencies:
hard: []

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
module metacell/nfsvol

go 1.24

require golang.org/x/sys v0.33.0
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New Go module nfsvol adds a go.mod but does not include a go.sum. The repo’s other Go module (nfs-subdir-external-provisioner) commits go.sum, and omitting it makes dependency resolution less reproducible (and can break builds that run with -mod=readonly). Please add the generated go.sum for this module.

Copilot uses AI. Check for mistakes.
Comment on lines +252 to +259
# Find the loop device backing the test PVC on the server.
# mountpoint format: /exports/<pvName> where pvName includes the PV_NAME.
MOUNTPOINT=$(k exec "$NEW_NFS_POD" -- sh -c "ls -d /exports/*${PV_NAME}* 2>/dev/null | grep -v '\.exports$' | grep -v 'archived-' | head -1" || true)
if [ -z "$MOUNTPOINT" ]; then
log "skipping test 5: could not locate server-side mountpoint for $PV_NAME"
else
LOOPDEV=$(k exec "$NEW_NFS_POD" -- sh -c "losetup -a | grep '$MOUNTPOINT.quota' | cut -d: -f1" || true)
if [ -z "$LOOPDEV" ]; then
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mountpoint discovery for test 5 can return either the mount directory (/exports/<pv>) or the backing quota file (/exports/<pv>.quota), because the ls glob matches both and the filters don’t exclude .quota. If it picks the .quota file, the subsequent losetup grep uses "$MOUNTPOINT.quota" (becoming ...quota.quota) and the test will be skipped incorrectly. Filter out \.quota$ and/or assert the result is a directory (e.g., test -d).

Copilot uses AI. Check for mistakes.
Comment on lines +146 to +160
// Clean up any pre-existing state left from a failed previous attempt.
_ = Delete(mountpoint)

if err := createQuotaFile(quotaFile, sizeBytes); err != nil {
return err
}

cmd := exec.Command("mkfs.ext4", "-F", quotaFile)
if out, err := cmd.CombinedOutput(); err != nil {
_ = os.Remove(quotaFile)
return fmt.Errorf("mkfs.ext4: %w\n%s", err, out)
}

if err := mountOne(mountpoint); err != nil {
return err
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create() calls Delete(mountpoint) as a cleanup step, but Delete() renames an existing ${mountpoint}.quota to ${mountpoint} (a regular file). If a previous failed attempt left a .quota behind, this makes mountpoint a file, and the subsequent mountOne() will fail at os.MkdirAll(mountpoint, ...) (“not a directory”). Consider either (a) making mountOne() remove/recreate mountpoint when it exists as a file, or (b) splitting Delete into “cleanup for recreate” vs “delete preserving data for archive” so Create doesn’t leave a file at mountpoint.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants