feat: enable site clusters to run Nautobot Celery workers on site clusters by haseebsyed12 · Pull Request #1908 · rackerlabs/understack

haseebsyed12 · 2026-04-02T09:56:45Z

Nautobot currently runs entirely on the global cluster, including its Celery workers. Sites that generate heavy background task load have no way to offload that processing closer to where the work originates, and a single global worker pool becomes a bottleneck as sites scale.

This adds a site-scoped ArgoCD Application that deploys only the Celery worker portion of the Nautobot helm chart. The web server, Redis, and PostgreSQL are all disabled because they remain on the global cluster — site workers connect back to those shared services.

This lets operators scale worker capacity per-site independently, run queue-specific workers closer to the hardware they manage, and reduce cross-cluster task latency for site-driven automation.

ArgoCD Application template (application-nautobot-worker.yaml) gated behind site.nautobot_worker.enabled
It deploys only the Celery worker portion of the Nautobot helm chart into the nautobot namespace on the site cluster.

skrobul · 2026-04-16T08:34:32Z

+  enabled: false
+
+metrics:
+  enabled: false


I think we still want metrics to be exposed for the worker to monitor it correctly.
We don't need the metrics.nginxExporter.enabled and metrics.uWSGI.enabled but they are false by default.

copied the same from disabled in favor of explicit definitions through kustomize

ahh right, I completely forgot about those - my bad! This will be resolved after we upgrade to the newer chart that allows configuring scrapeProtocols.

skrobul · 2026-04-16T09:19:17Z

+
+```text
+<site-name>/nautobot-worker/
+```
+
+### Step 2: Create ExternalSecrets for credentials
+
+Create ExternalSecret resources that pull credentials from your secrets
+provider into the `nautobot` namespace. You need four:
+
+| ExternalSecret | Target Secret | Purpose |
+|---|---|---|
+| `externalsecret-nautobot-django.yaml` | `nautobot-django` | Django `SECRET_KEY` -- must match the global instance |
+| `externalsecret-nautobot-db.yaml` | `nautobot-db` | CNPG app user password (satisfies Helm chart requirement) |
+| `externalsecret-nautobot-worker-redis.yaml` | `nautobot-redis` | Redis password |
+| `externalsecret-dockerconfigjson-github-com.yaml` | `dockerconfigjson-github-com` | Container registry credentials |
+
+Each ExternalSecret should reference your `ClusterSecretStore` and map
+the credential into the key format the Nautobot Helm chart expects.
+
+### Step 3: Create the mTLS CA key pair ExternalSecret
+
+Create `externalsecret-mtls-ca-key-pair.yaml` to distribute the mTLS CA
+certificate and private key to this site cluster. The resulting secret
+must be a `kubernetes.io/tls` type with these keys:
+
+| Key | Content |
+|---|---|
+| `tls.crt` | CA certificate (PEM) |
+| `tls.key` | CA private key (PEM) |
+| `ca.crt` | CA certificate (PEM, same as `tls.crt`) |
+
+cert-manager's CA Issuer reads `tls.crt` and `tls.key` from this secret
+to sign client certificates.
+
+### Step 4: Create the cert-manager CA Issuer
+
+Create `issuer-mtls-ca-issuer.yaml`:
+
+```yaml
+apiVersion: cert-manager.io/v1
+kind: Issuer
+metadata:
+  name: mtls-ca-issuer
+  namespace: nautobot
+spec:
+  ca:
+    secretName: mtls-ca-key-pair
+```
+
+### Step 5: Create the client certificate
+
+Create `certificate-nautobot-mtls.yaml`. The `commonName` must match the
+PostgreSQL database user (typically `app`) because `pg_hba cert` maps
+the certificate CN to the DB user.
+
+```yaml
+apiVersion: cert-manager.io/v1
+kind: Certificate
+metadata:
+  name: nautobot-mtls-client
+  namespace: nautobot
+spec:
+  secretName: nautobot-mtls-client
+  duration: 8760h    # 1 year
+  renewBefore: 720h  # 30 days
+  commonName: app
+  usages:
+    - client auth
+  privateKey:
+    algorithm: ECDSA
+    size: 256
+  issuerRef:
+    name: mtls-ca-issuer
+    kind: Issuer
+```
+


I skipped reading those section in detail because the process seems to be fundamentally wrong.
The CA should live only on the global cluster and there is absolutely no need to create a copy of a CA on a site clusters.
Having CA keys exposed in each of the sites means that it would require only one site to be compromised to give attacker ability to impersonate not only other sites, but also impersonate a server/global. It also allows to perform MITM attacks very easily. It is like giving every branch office the key-making machine to the company's building badges, nut just a badge for that office.

Normally the process can be:

Provision CA and ClusterIssuer on a global cluster

When you need to deploy the site, you create a relevant Certificate object on a global cluster so that it can be issued.

The issued certificate and key gets transferred to a site-level (possibly through a secrets provider)

Workloads on the site use that certificate.

The alternative process could be similar and slightly more automated:

Provision CA and ClusterIssuer on a global cluster

When you need to deploy the site components, the site-cluster creates a CertificateRequest.

It kicks off periodic checks if the certificate has been issued (as indicated by Ready condition

Operator approves the certificate issuance on a global

The site-level automation downloads the certificate and places it in appropriate paths, then sleeps until a certificate expiration is coming up and repeats everything from step 2 again.

Workloads on the site use that certificate.

This would be very similar to how kube-apiserver CA and kubelet certificate management works today, just with cert-manager.

I understand this may be complicated so the fully automated rotation can be implemented as a separate PR, but if you choose to go down this route at minimum, we would need to:

switch to a 3 year client certificate expiration time

explicitly document that transfer of certificates happens through manual operator intervention and is required every [EXPIRATION_PERIOD] years for each cluster. This can quickly grow into big maintenance burden if it's 2 certs per each site.

Implemented as suggested

CA hierarchy (selfsigned issuer, root CA, CA issuer) lives on the global cluster only.

site clusters pull only the pre-issued client cert+key and the CA public cert via ExternalSecret -- no CA private key, no cert-manager Issuer on site clusters.

per-site client certificates are issued on the global cluster by cert-manager (e.g. certificate-nautobot-mtls-client-rax-dev-iad3.yaml), then the issued cert is uploaded to the secrets provider. (Automation of this manual coping this cert is not implemented yet).

Thanks, that's looking much better now.
I am okay with not having the automation for this yet, but if we are leaving this as manual work, can we please:

Get the detailed documentation (in Operator Guide) on what the process of adding new site (already there) and renewing certificates for existing site is?

The current client certificate lifetime is 1 year:

nautobot@nautobot-default-bddd598b9-j82hf:~$ openssl x509 -in /etc/nautobot/mtls/tls.crt -noout -text | grep Not Not Before: Apr 14 12:58:22 2026 GMT Not After : Apr 14 12:58:22 2027 GMT nautobot@nautobot-default-bddd598b9-j82hf:~$

Can we bump this to 2 or 3 years to give us more breathing room, especially in rapid growth period where we expect many sites to be deployed in a short span of time?

skrobul · 2026-04-16T09:41:08Z

as @cardoe found out - this introduces 3rd place or 3rd copy of the nautobot_config.py:

the default container already has /opt/nautobot/nautobot_config.py

our custom container build adds /opt/nautobot_config/nautobot_config.py

this PR adds another one that is provided through helm.file

Can we standardise on using just one way of delivering that config?
Ideally I think that should just be a volume that mounts to a Nautobot default's path and allows to be changed on per-env basis.

global application-nautobot.yaml and the site application-nautobot-worker.yaml reference the same single nautobot_config.py from $understack/components/nautobot/nautobot_config.py via the Helm chart's fileParameters mechanism.

It is hardcoded to /opt/nautobot/nautobot_config.py
https://github.com/nautobot/helm-charts/blob/develop/charts/nautobot/templates/celery-deployment.yaml#L160
https://github.com/nautobot/helm-charts/blob/develop/charts/nautobot/templates/nautobot-deployment.yaml#L144

shall I address issue of default container path(/opt/nautobot/nautobot_config.py)
and the custom container build path(/opt/nautobot_config//nautobot_config.py) in another PR ?

Nope, it has to be fixed before merge - switching the global cluster to the site-nautobot-worker branch results in missing SSO and lack of installed plugins:

fixed both the issues and updated the docs with relevant details on root cause and fix

skrobul

left few comments inline

Sites need to run background task processing locally to reduce cross-cluster latency and scale worker capacity independently. Workers connect back to the global PostgreSQL and Redis, so cross-cluster connections require stronger auth than passwords alone. Adds a site-scoped ArgoCD Application that deploys only the Celery worker portion of the Nautobot Helm chart. The web server, Redis, and PostgreSQL remain on the global cluster. All cross-cluster connections use end-to-end mTLS: - nautobot_config.py gains conditional SSL/mTLS logic for both PostgreSQL (NAUTOBOT_DB_SSLMODE) and Redis (auto-detected from mounted CA cert) - nautobot-worker component values disable everything except celery - envoy-configs gateway template supports gatewayPort on TLS passthrough listeners for non-443 ports (5432, 6379) - envoy-configs schema adds gatewayPort to the tls route type - Deploy guide documents the full architecture, step-by-step site onboarding, certificate infrastructure, and troubleshooting

… issued cert+key to sites via the external secrets provider.

Allows for a different nautobot config file to be stored in the deploy repo and supplied to Nautobot.

haseebsyed12 force-pushed the site-nautobot-worker branch 9 times, most recently from f49206f to f47fced Compare April 7, 2026 10:22

haseebsyed12 requested a review from a team April 7, 2026 13:36

haseebsyed12 marked this pull request as ready for review April 7, 2026 13:36

haseebsyed12 changed the title ~~feat: enable site clusters to run Nautobot Celery workers locally~~ feat: enable site clusters to run Nautobot Celery workers on site clusters Apr 7, 2026

haseebsyed12 force-pushed the site-nautobot-worker branch 7 times, most recently from b5e37b0 to 16e016a Compare April 14, 2026 09:36

cardoe reviewed Apr 14, 2026

View reviewed changes

Comment thread components/nautobot-worker/kustomization.yaml

Comment thread charts/argocd-understack/templates/application-nautobot-worker.yaml

haseebsyed12 force-pushed the site-nautobot-worker branch 8 times, most recently from b3c0d4b to 67436e4 Compare April 15, 2026 12:52

haseebsyed12 requested review from cardoe and skrobul April 15, 2026 14:50

haseebsyed12 force-pushed the site-nautobot-worker branch 2 times, most recently from 441a65a to 231db0c Compare April 16, 2026 05:32

skrobul requested changes Apr 16, 2026

View reviewed changes

haseebsyed12 force-pushed the site-nautobot-worker branch 5 times, most recently from c2e24b8 to d4915ed Compare April 17, 2026 03:55

haseebsyed12 requested a review from skrobul April 17, 2026 07:15

haseebsyed12 force-pushed the site-nautobot-worker branch from d4915ed to b8d2d0a Compare April 20, 2026 07:38

skrobul reviewed Apr 20, 2026

View reviewed changes

haseebsyed12 force-pushed the site-nautobot-worker branch 9 times, most recently from 7828885 to 7d71b6e Compare April 20, 2026 18:02

syedhaseebahmed added 7 commits April 21, 2026 00:40

set appLabels UNDERSTACK_PARTITION, to site job queues

7a17317

refactoring

e4bfc15

issue client certificates on the global cluster and transfer only the…

7427547

… issued cert+key to sites via the external secrets provider.

site specific non default queues

2e9443d

NAUTOBOT_EXTRA_PLUGINS and NAUTOBOT_EXTRA_PLUGINS_CONFIG

805b776

authenticate CNPG with certs

7423e9a

haseebsyed12 force-pushed the site-nautobot-worker branch from 8c34aca to 7423e9a Compare April 20, 2026 19:10

haseebsyed12 requested a review from skrobul April 21, 2026 07:34

allow alternative nautobot config to be supplied

3797541

Allows for a different nautobot config file to be stored in the deploy repo and supplied to Nautobot.

Conversation

haseebsyed12 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skrobul Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skrobul Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skrobul left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haseebsyed12 commented Apr 2, 2026 •

edited

Loading

skrobul Apr 16, 2026 •

edited

Loading

skrobul Apr 20, 2026 •

edited

Loading