feat: enable site clusters to run Nautobot Celery workers on site clusters#1908
feat: enable site clusters to run Nautobot Celery workers on site clusters#1908haseebsyed12 wants to merge 8 commits intomainfrom
Conversation
f49206f to
f47fced
Compare
b5e37b0 to
16e016a
Compare
b3c0d4b to
67436e4
Compare
441a65a to
231db0c
Compare
| enabled: false | ||
|
|
||
| metrics: | ||
| enabled: false |
There was a problem hiding this comment.
I think we still want metrics to be exposed for the worker to monitor it correctly.
We don't need the metrics.nginxExporter.enabled and metrics.uWSGI.enabled but they are false by default.
There was a problem hiding this comment.
copied the same from disabled in favor of explicit definitions through kustomize
There was a problem hiding this comment.
ahh right, I completely forgot about those - my bad! This will be resolved after we upgrade to the newer chart that allows configuring scrapeProtocols.
|
|
||
| ```text | ||
| <site-name>/nautobot-worker/ | ||
| ``` | ||
|
|
||
| ### Step 2: Create ExternalSecrets for credentials | ||
|
|
||
| Create ExternalSecret resources that pull credentials from your secrets | ||
| provider into the `nautobot` namespace. You need four: | ||
|
|
||
| | ExternalSecret | Target Secret | Purpose | | ||
| |---|---|---| | ||
| | `externalsecret-nautobot-django.yaml` | `nautobot-django` | Django `SECRET_KEY` -- must match the global instance | | ||
| | `externalsecret-nautobot-db.yaml` | `nautobot-db` | CNPG app user password (satisfies Helm chart requirement) | | ||
| | `externalsecret-nautobot-worker-redis.yaml` | `nautobot-redis` | Redis password | | ||
| | `externalsecret-dockerconfigjson-github-com.yaml` | `dockerconfigjson-github-com` | Container registry credentials | | ||
|
|
||
| Each ExternalSecret should reference your `ClusterSecretStore` and map | ||
| the credential into the key format the Nautobot Helm chart expects. | ||
|
|
||
| ### Step 3: Create the mTLS CA key pair ExternalSecret | ||
|
|
||
| Create `externalsecret-mtls-ca-key-pair.yaml` to distribute the mTLS CA | ||
| certificate and private key to this site cluster. The resulting secret | ||
| must be a `kubernetes.io/tls` type with these keys: | ||
|
|
||
| | Key | Content | | ||
| |---|---| | ||
| | `tls.crt` | CA certificate (PEM) | | ||
| | `tls.key` | CA private key (PEM) | | ||
| | `ca.crt` | CA certificate (PEM, same as `tls.crt`) | | ||
|
|
||
| cert-manager's CA Issuer reads `tls.crt` and `tls.key` from this secret | ||
| to sign client certificates. | ||
|
|
||
| ### Step 4: Create the cert-manager CA Issuer | ||
|
|
||
| Create `issuer-mtls-ca-issuer.yaml`: | ||
|
|
||
| ```yaml | ||
| apiVersion: cert-manager.io/v1 | ||
| kind: Issuer | ||
| metadata: | ||
| name: mtls-ca-issuer | ||
| namespace: nautobot | ||
| spec: | ||
| ca: | ||
| secretName: mtls-ca-key-pair | ||
| ``` | ||
|
|
||
| ### Step 5: Create the client certificate | ||
|
|
||
| Create `certificate-nautobot-mtls.yaml`. The `commonName` must match the | ||
| PostgreSQL database user (typically `app`) because `pg_hba cert` maps | ||
| the certificate CN to the DB user. | ||
|
|
||
| ```yaml | ||
| apiVersion: cert-manager.io/v1 | ||
| kind: Certificate | ||
| metadata: | ||
| name: nautobot-mtls-client | ||
| namespace: nautobot | ||
| spec: | ||
| secretName: nautobot-mtls-client | ||
| duration: 8760h # 1 year | ||
| renewBefore: 720h # 30 days | ||
| commonName: app | ||
| usages: | ||
| - client auth | ||
| privateKey: | ||
| algorithm: ECDSA | ||
| size: 256 | ||
| issuerRef: | ||
| name: mtls-ca-issuer | ||
| kind: Issuer | ||
| ``` | ||
|
|
There was a problem hiding this comment.
I skipped reading those section in detail because the process seems to be fundamentally wrong.
The CA should live only on the global cluster and there is absolutely no need to create a copy of a CA on a site clusters.
Having CA keys exposed in each of the sites means that it would require only one site to be compromised to give attacker ability to impersonate not only other sites, but also impersonate a server/global. It also allows to perform MITM attacks very easily. It is like giving every branch office the key-making machine to the company's building badges, nut just a badge for that office.
Normally the process can be:
- Provision CA and ClusterIssuer on a global cluster
- When you need to deploy the site, you create a relevant Certificate object on a global cluster so that it can be issued.
- The issued certificate and key gets transferred to a site-level (possibly through a secrets provider)
- Workloads on the site use that certificate.
The alternative process could be similar and slightly more automated:
- Provision CA and ClusterIssuer on a global cluster
- When you need to deploy the site components, the site-cluster creates a
CertificateRequest. - It kicks off periodic checks if the certificate has been issued (as indicated by
Readycondition - Operator approves the certificate issuance on a global
- The site-level automation downloads the certificate and places it in appropriate paths, then sleeps until a certificate expiration is coming up and repeats everything from step 2 again.
- Workloads on the site use that certificate.
This would be very similar to how kube-apiserver CA and kubelet certificate management works today, just with cert-manager.
I understand this may be complicated so the fully automated rotation can be implemented as a separate PR, but if you choose to go down this route at minimum, we would need to:
- switch to a 3 year client certificate expiration time
- explicitly document that transfer of certificates happens through manual operator intervention and is required every [EXPIRATION_PERIOD] years for each cluster. This can quickly grow into big maintenance burden if it's 2 certs per each site.
There was a problem hiding this comment.
Implemented as suggested
- CA hierarchy (selfsigned issuer, root CA, CA issuer) lives on the global cluster only.
- site clusters pull only the pre-issued client cert+key and the CA public cert via ExternalSecret -- no CA private key, no cert-manager Issuer on site clusters.
- per-site client certificates are issued on the global cluster by cert-manager (e.g. certificate-nautobot-mtls-client-rax-dev-iad3.yaml), then the issued cert is uploaded to the secrets provider. (Automation of this manual coping this cert is not implemented yet).
There was a problem hiding this comment.
Thanks, that's looking much better now.
I am okay with not having the automation for this yet, but if we are leaving this as manual work, can we please:
- Get the detailed documentation (in Operator Guide) on what the process of adding new site (already there) and renewing certificates for existing site is?
- The current client certificate lifetime is 1 year:
nautobot@nautobot-default-bddd598b9-j82hf:~$ openssl x509 -in /etc/nautobot/mtls/tls.crt -noout -text | grep Not
Not Before: Apr 14 12:58:22 2026 GMT
Not After : Apr 14 12:58:22 2027 GMT
nautobot@nautobot-default-bddd598b9-j82hf:~$
Can we bump this to 2 or 3 years to give us more breathing room, especially in rapid growth period where we expect many sites to be deployed in a short span of time?
There was a problem hiding this comment.
as @cardoe found out - this introduces 3rd place or 3rd copy of the nautobot_config.py:
- the default container already has
/opt/nautobot/nautobot_config.py - our custom container build adds
/opt/nautobot_config/nautobot_config.py - this PR adds another one that is provided through helm.file
Can we standardise on using just one way of delivering that config?
Ideally I think that should just be a volume that mounts to a Nautobot default's path and allows to be changed on per-env basis.
There was a problem hiding this comment.
global application-nautobot.yaml and the site application-nautobot-worker.yaml reference the same single nautobot_config.py from $understack/components/nautobot/nautobot_config.py via the Helm chart's fileParameters mechanism.
It is hardcoded to /opt/nautobot/nautobot_config.py
https://github.com/nautobot/helm-charts/blob/develop/charts/nautobot/templates/celery-deployment.yaml#L160
https://github.com/nautobot/helm-charts/blob/develop/charts/nautobot/templates/nautobot-deployment.yaml#L144
shall I address issue of default container path(/opt/nautobot/nautobot_config.py)
and the custom container build path(/opt/nautobot_config//nautobot_config.py) in another PR ?
There was a problem hiding this comment.
fixed both the issues and updated the docs with relevant details on root cause and fix
c2e24b8 to
d4915ed
Compare
d4915ed to
b8d2d0a
Compare
skrobul
left a comment
There was a problem hiding this comment.
left few comments inline
7828885 to
7d71b6e
Compare
Sites need to run background task processing locally to reduce cross-cluster latency and scale worker capacity independently. Workers connect back to the global PostgreSQL and Redis, so cross-cluster connections require stronger auth than passwords alone. Adds a site-scoped ArgoCD Application that deploys only the Celery worker portion of the Nautobot Helm chart. The web server, Redis, and PostgreSQL remain on the global cluster. All cross-cluster connections use end-to-end mTLS: - nautobot_config.py gains conditional SSL/mTLS logic for both PostgreSQL (NAUTOBOT_DB_SSLMODE) and Redis (auto-detected from mounted CA cert) - nautobot-worker component values disable everything except celery - envoy-configs gateway template supports gatewayPort on TLS passthrough listeners for non-443 ports (5432, 6379) - envoy-configs schema adds gatewayPort to the tls route type - Deploy guide documents the full architecture, step-by-step site onboarding, certificate infrastructure, and troubleshooting
… issued cert+key to sites via the external secrets provider.
8c34aca to
7423e9a
Compare
Allows for a different nautobot config file to be stored in the deploy repo and supplied to Nautobot.

Nautobot currently runs entirely on the global cluster, including its Celery workers. Sites that generate heavy background task load have no way to offload that processing closer to where the work originates, and a single global worker pool becomes a bottleneck as sites scale.
This adds a site-scoped ArgoCD Application that deploys only the Celery worker portion of the Nautobot helm chart. The web server, Redis, and PostgreSQL are all disabled because they remain on the global cluster — site workers connect back to those shared services.
This lets operators scale worker capacity per-site independently, run queue-specific workers closer to the hardware they manage, and reduce cross-cluster task latency for site-driven automation.
application-nautobot-worker.yaml) gated behindsite.nautobot_worker.enablednautobotnamespace on the site cluster.