Skip to content

feat: enable site clusters to run Nautobot Celery workers on site clusters#1908

Open
haseebsyed12 wants to merge 8 commits intomainfrom
site-nautobot-worker
Open

feat: enable site clusters to run Nautobot Celery workers on site clusters#1908
haseebsyed12 wants to merge 8 commits intomainfrom
site-nautobot-worker

Conversation

@haseebsyed12
Copy link
Copy Markdown
Contributor

@haseebsyed12 haseebsyed12 commented Apr 2, 2026

Nautobot currently runs entirely on the global cluster, including its Celery workers. Sites that generate heavy background task load have no way to offload that processing closer to where the work originates, and a single global worker pool becomes a bottleneck as sites scale.

This adds a site-scoped ArgoCD Application that deploys only the Celery worker portion of the Nautobot helm chart. The web server, Redis, and PostgreSQL are all disabled because they remain on the global cluster — site workers connect back to those shared services.

This lets operators scale worker capacity per-site independently, run queue-specific workers closer to the hardware they manage, and reduce cross-cluster task latency for site-driven automation.

  • ArgoCD Application template (application-nautobot-worker.yaml) gated behind site.nautobot_worker.enabled
  • It deploys only the Celery worker portion of the Nautobot helm chart into the nautobot namespace on the site cluster.

@haseebsyed12 haseebsyed12 force-pushed the site-nautobot-worker branch 9 times, most recently from f49206f to f47fced Compare April 7, 2026 10:22
@haseebsyed12 haseebsyed12 requested a review from a team April 7, 2026 13:36
@haseebsyed12 haseebsyed12 marked this pull request as ready for review April 7, 2026 13:36
@haseebsyed12 haseebsyed12 changed the title feat: enable site clusters to run Nautobot Celery workers locally feat: enable site clusters to run Nautobot Celery workers on site clusters Apr 7, 2026
@haseebsyed12 haseebsyed12 force-pushed the site-nautobot-worker branch 7 times, most recently from b5e37b0 to 16e016a Compare April 14, 2026 09:36
Comment thread components/nautobot-worker/kustomization.yaml
Comment thread charts/argocd-understack/templates/application-nautobot-worker.yaml
@haseebsyed12 haseebsyed12 force-pushed the site-nautobot-worker branch 8 times, most recently from b3c0d4b to 67436e4 Compare April 15, 2026 12:52
@haseebsyed12 haseebsyed12 requested review from cardoe and skrobul April 15, 2026 14:50
@haseebsyed12 haseebsyed12 force-pushed the site-nautobot-worker branch 2 times, most recently from 441a65a to 231db0c Compare April 16, 2026 05:32
Comment thread charts/argocd-understack/templates/application-nautobot-worker.yaml
Comment thread components/nautobot-worker/values.yaml Outdated
enabled: false

metrics:
enabled: false
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still want metrics to be exposed for the worker to monitor it correctly.
We don't need the metrics.nginxExporter.enabled and metrics.uWSGI.enabled but they are false by default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh right, I completely forgot about those - my bad! This will be resolved after we upgrade to the newer chart that allows configuring scrapeProtocols.

Comment thread components/nautobot/nautobot_config.py
Comment thread components/nautobot/nautobot_config.py Outdated
Comment thread docs/deploy-guide/components/nautobot-worker.md
Comment thread docs/deploy-guide/components/nautobot-worker.md Outdated
Comment thread docs/deploy-guide/components/nautobot-worker.md Outdated
Comment on lines +190 to +266

```text
<site-name>/nautobot-worker/
```

### Step 2: Create ExternalSecrets for credentials

Create ExternalSecret resources that pull credentials from your secrets
provider into the `nautobot` namespace. You need four:

| ExternalSecret | Target Secret | Purpose |
|---|---|---|
| `externalsecret-nautobot-django.yaml` | `nautobot-django` | Django `SECRET_KEY` -- must match the global instance |
| `externalsecret-nautobot-db.yaml` | `nautobot-db` | CNPG app user password (satisfies Helm chart requirement) |
| `externalsecret-nautobot-worker-redis.yaml` | `nautobot-redis` | Redis password |
| `externalsecret-dockerconfigjson-github-com.yaml` | `dockerconfigjson-github-com` | Container registry credentials |

Each ExternalSecret should reference your `ClusterSecretStore` and map
the credential into the key format the Nautobot Helm chart expects.

### Step 3: Create the mTLS CA key pair ExternalSecret

Create `externalsecret-mtls-ca-key-pair.yaml` to distribute the mTLS CA
certificate and private key to this site cluster. The resulting secret
must be a `kubernetes.io/tls` type with these keys:

| Key | Content |
|---|---|
| `tls.crt` | CA certificate (PEM) |
| `tls.key` | CA private key (PEM) |
| `ca.crt` | CA certificate (PEM, same as `tls.crt`) |

cert-manager's CA Issuer reads `tls.crt` and `tls.key` from this secret
to sign client certificates.

### Step 4: Create the cert-manager CA Issuer

Create `issuer-mtls-ca-issuer.yaml`:

```yaml
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: mtls-ca-issuer
namespace: nautobot
spec:
ca:
secretName: mtls-ca-key-pair
```

### Step 5: Create the client certificate

Create `certificate-nautobot-mtls.yaml`. The `commonName` must match the
PostgreSQL database user (typically `app`) because `pg_hba cert` maps
the certificate CN to the DB user.

```yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: nautobot-mtls-client
namespace: nautobot
spec:
secretName: nautobot-mtls-client
duration: 8760h # 1 year
renewBefore: 720h # 30 days
commonName: app
usages:
- client auth
privateKey:
algorithm: ECDSA
size: 256
issuerRef:
name: mtls-ca-issuer
kind: Issuer
```

Copy link
Copy Markdown
Collaborator

@skrobul skrobul Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skipped reading those section in detail because the process seems to be fundamentally wrong.
The CA should live only on the global cluster and there is absolutely no need to create a copy of a CA on a site clusters.
Having CA keys exposed in each of the sites means that it would require only one site to be compromised to give attacker ability to impersonate not only other sites, but also impersonate a server/global. It also allows to perform MITM attacks very easily. It is like giving every branch office the key-making machine to the company's building badges, nut just a badge for that office.

Normally the process can be:

  1. Provision CA and ClusterIssuer on a global cluster
  2. When you need to deploy the site, you create a relevant Certificate object on a global cluster so that it can be issued.
  3. The issued certificate and key gets transferred to a site-level (possibly through a secrets provider)
  4. Workloads on the site use that certificate.

The alternative process could be similar and slightly more automated:

  1. Provision CA and ClusterIssuer on a global cluster
  2. When you need to deploy the site components, the site-cluster creates a CertificateRequest.
  3. It kicks off periodic checks if the certificate has been issued (as indicated by Ready condition
  4. Operator approves the certificate issuance on a global
  5. The site-level automation downloads the certificate and places it in appropriate paths, then sleeps until a certificate expiration is coming up and repeats everything from step 2 again.
  6. Workloads on the site use that certificate.

This would be very similar to how kube-apiserver CA and kubelet certificate management works today, just with cert-manager.

I understand this may be complicated so the fully automated rotation can be implemented as a separate PR, but if you choose to go down this route at minimum, we would need to:

  • switch to a 3 year client certificate expiration time
  • explicitly document that transfer of certificates happens through manual operator intervention and is required every [EXPIRATION_PERIOD] years for each cluster. This can quickly grow into big maintenance burden if it's 2 certs per each site.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented as suggested

  • CA hierarchy (selfsigned issuer, root CA, CA issuer) lives on the global cluster only.
  • site clusters pull only the pre-issued client cert+key and the CA public cert via ExternalSecret -- no CA private key, no cert-manager Issuer on site clusters.
  • per-site client certificates are issued on the global cluster by cert-manager (e.g. certificate-nautobot-mtls-client-rax-dev-iad3.yaml), then the issued cert is uploaded to the secrets provider. (Automation of this manual coping this cert is not implemented yet).

Copy link
Copy Markdown
Collaborator

@skrobul skrobul Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's looking much better now.
I am okay with not having the automation for this yet, but if we are leaving this as manual work, can we please:

  1. Get the detailed documentation (in Operator Guide) on what the process of adding new site (already there) and renewing certificates for existing site is?
  2. The current client certificate lifetime is 1 year:
nautobot@nautobot-default-bddd598b9-j82hf:~$ openssl x509 -in /etc/nautobot/mtls/tls.crt -noout -text | grep Not
            Not Before: Apr 14 12:58:22 2026 GMT
            Not After : Apr 14 12:58:22 2027 GMT
nautobot@nautobot-default-bddd598b9-j82hf:~$

Can we bump this to 2 or 3 years to give us more breathing room, especially in rapid growth period where we expect many sites to be deployed in a short span of time?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @cardoe found out - this introduces 3rd place or 3rd copy of the nautobot_config.py:

  • the default container already has /opt/nautobot/nautobot_config.py
  • our custom container build adds /opt/nautobot_config/nautobot_config.py
  • this PR adds another one that is provided through helm.file

Can we standardise on using just one way of delivering that config?
Ideally I think that should just be a volume that mounts to a Nautobot default's path and allows to be changed on per-env basis.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global application-nautobot.yaml and the site application-nautobot-worker.yaml reference the same single nautobot_config.py from $understack/components/nautobot/nautobot_config.py via the Helm chart's fileParameters mechanism.

It is hardcoded to /opt/nautobot/nautobot_config.py
https://github.com/nautobot/helm-charts/blob/develop/charts/nautobot/templates/celery-deployment.yaml#L160
https://github.com/nautobot/helm-charts/blob/develop/charts/nautobot/templates/nautobot-deployment.yaml#L144

shall I address issue of default container path(/opt/nautobot/nautobot_config.py)
and the custom container build path(/opt/nautobot_config//nautobot_config.py) in another PR ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it has to be fixed before merge - switching the global cluster to the site-nautobot-worker branch results in missing SSO and lack of installed plugins:

image

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed both the issues and updated the docs with relevant details on root cause and fix

@haseebsyed12 haseebsyed12 force-pushed the site-nautobot-worker branch 5 times, most recently from c2e24b8 to d4915ed Compare April 17, 2026 03:55
@haseebsyed12 haseebsyed12 requested a review from skrobul April 17, 2026 07:15
@haseebsyed12 haseebsyed12 force-pushed the site-nautobot-worker branch from d4915ed to b8d2d0a Compare April 20, 2026 07:38
Copy link
Copy Markdown
Collaborator

@skrobul skrobul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left few comments inline

@haseebsyed12 haseebsyed12 force-pushed the site-nautobot-worker branch 9 times, most recently from 7828885 to 7d71b6e Compare April 20, 2026 18:02
Sites need to run background task processing locally to reduce
cross-cluster latency and scale worker capacity independently. Workers
connect back to the global PostgreSQL and Redis, so cross-cluster
connections require stronger auth than passwords alone.

Adds a site-scoped ArgoCD Application that deploys only the Celery
worker portion of the Nautobot Helm chart. The web server, Redis, and
PostgreSQL remain on the global cluster.

All cross-cluster connections use end-to-end mTLS:
- nautobot_config.py gains conditional SSL/mTLS logic for both
  PostgreSQL (NAUTOBOT_DB_SSLMODE) and Redis (auto-detected from
  mounted CA cert)
- nautobot-worker component values disable everything except celery
- envoy-configs gateway template supports gatewayPort on TLS
  passthrough listeners for non-443 ports (5432, 6379)
- envoy-configs schema adds gatewayPort to the tls route type
- Deploy guide documents the full architecture, step-by-step site
  onboarding, certificate infrastructure, and troubleshooting
… issued cert+key to sites via the external secrets provider.
@haseebsyed12 haseebsyed12 force-pushed the site-nautobot-worker branch from 8c34aca to 7423e9a Compare April 20, 2026 19:10
@haseebsyed12 haseebsyed12 requested a review from skrobul April 21, 2026 07:34
Allows for a different nautobot config file to be stored in the deploy
repo and supplied to Nautobot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants