feat(helm): Add monitoring.scrapeAnnotations to control whether or not to add prometheus.io/* annotations to Service#825
Conversation
…nnotations When monitoring.enabled is true (or auto with CRDs installed), the chart creates ServiceMonitor CRDs that instruct the Prometheus Operator to scrape CloudZero Agent metrics. In that configuration, the prometheus.io/* annotations on Services become redundant. In clusters where both annotation-based and CRD-based discovery are active simultaneously, metrics may be scraped twice, generating unnecessary load on the metrics pipeline. Adds components.monitoring.scrapeAnnotations (default: true) to address this. When set to false, prometheus.io/* annotations are omitted from the agent, aggregator, and webhook Services, eliminating the risk of double scraping while keeping ServiceMonitor-based discovery intact. Updates monitoring-infrastructure.md to document the new flag and the double scraping scenario, and adds helm unit tests covering both true and false states for all three affected Services. The default value of true preserves full backward compatibility for users relying on annotation-based Prometheus discovery without the Operator. This approach is purely additive, users without ServiceMonitors are unaffected, while Operator users gain a clean way to eliminate redundant scrape targets. A future consideration would be auto-disabling annotations when monitoring.enabled is true, though that would be a breaking change requiring a major version bump.
f570231 to
20a7bf7
Compare
A customer contributed PR #825 proposing a components.monitoring.scrapeAnnotations boolean to drop the prometheus.io/* annotations and avoid double-scraping when ServiceMonitor-based discovery is in use. Rather than add a single-purpose boolean, this generalizes the chart's monitoring configuration: it decides what to emit based on whether the Prometheus Operator is present, and exposes the discovery mechanism as one explicit choice. It supersedes the scrapeAnnotations flag from PR #825. Implementation Approach: The chart emits two kinds of monitoring resources -- Prometheus Operator CRDs (ServiceMonitor + PrometheusRule) and prometheus.io/* annotations on the agent Services. Both are now driven by two values resolved in helm/templates/_helpers.tpl and consumed by the Service, ServiceMonitor, and PrometheusRule templates. Functional Requirements: 1. The Prometheus Operator resources must be created when the Operator is present and omitted otherwise, without the user having to configure it. Added components.monitoring.enabled (null | true | false, default null), resolved by cloudzero-agent.monitoring.operatorEnabled: null auto-detects the Operator via the monitoring.coreos.com/v1 capability, true forces on, false forces off -- the same idiom as integrations.istio.enabled. The four servicemonitor-*.yaml templates and prometheusrule.yaml are gated on it. Previously null defaulted to disabled while the feature was being validated; it now auto-detects, so an Operator cluster that relied on the default begins emitting these resources on upgrade. 2. Operators must be able to choose the discovery mechanism, including an annotation-only mode for clusters without the Operator. Added components.monitoring.discovery.method (auto | serviceMonitors | annotations, default auto). auto resolves to serviceMonitors today and is left as an enum so a future auto-detectable backend can extend it; annotations is an explicit opt-in and is never auto-selected. The prometheus.io/* annotations on agent-service.yaml, aggregator-service.yaml, and webhook-service.yaml are gated on cloudzero-agent.monitoring.annotationsActive. 3. The ServiceMonitors and the PrometheusRule must form a single Operator bundle; annotation-based discovery is discovery-only. serviceMonitorsActive and rulesActive resolve together on the serviceMonitors path, so choosing annotations yields no ServiceMonitors and no alert rules. 4. The new values must validate and be documented for users. Updated helm/values.schema.yaml (enabled: null|boolean; discovery.method enum) and regenerated helm/values.schema.json; set defaults and comments in helm/values.yaml; rewrote helm/docs/monitoring-infrastructure.md as user-facing reference, removing the prior validation-log content. Validation: - helm lint passing - helm unit tests passing; helm/tests/monitoring_integration_test.yaml and defaults_service_test.yaml were rewritten for the new model - helm schema tests passing, including new fixtures under tests/helm/schema for the default, serviceMonitors, annotations, none, and invalid-value cases. kubeconform validates the rendered ServiceMonitor and PrometheusRule against the CRD catalog - golden manifests regenerated; a default `helm template` (no Operator detected) emits no ServiceMonitors, no PrometheusRule, and no prometheus.io/* annotations - helm subchart tests passing Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A customer contributed PR #825 proposing a components.monitoring.scrapeAnnotations boolean to drop the prometheus.io/* annotations and avoid double-scraping when ServiceMonitor-based discovery is in use. Rather than add a single-purpose boolean, this generalizes the chart's monitoring configuration into an explicit, opt-in model and supersedes the scrapeAnnotations flag from PR #825. Implementation Approach: The chart emits two kinds of monitoring resources -- Prometheus Operator CRDs (ServiceMonitor + PrometheusRule) and prometheus.io/* annotations on the agent Services. Both are driven by two values resolved in helm/templates/_helpers.tpl and consumed by the Service, ServiceMonitor, and PrometheusRule templates. There is no cluster auto-detection; the configuration is explicit. Functional Requirements: 1. Monitoring must be off by default and turned on explicitly. Added components.monitoring.enabled as a plain boolean defaulting to false. The four servicemonitor-*.yaml templates, prometheusrule.yaml, and the prometheus.io/* annotations on the three Service templates are all gated on it via helpers in _helpers.tpl. Previously the operator CRDs defaulted to disabled while the feature was validated and the prometheus.io/* annotations were always emitted; both are now opt-in, so an upgrade emits nothing until enabled, and annotation-based scrapers must set discovery.method: annotations. 2. When enabled, the operator must be able to choose the discovery mechanism. Added components.monitoring.discovery.method (auto | serviceMonitors | annotations, default auto). auto resolves to serviceMonitors today and is left as an enum so a future mechanism can extend it; annotations is an explicit opt-in. serviceMonitors emits monitoring.coreos.com/v1 CRDs, so the install fails if the Prometheus Operator is absent; annotations needs no CRDs. 3. The ServiceMonitors and the PrometheusRule must form a single bundle; annotation-based discovery is discovery-only. serviceMonitorsActive and rulesActive resolve together on the serviceMonitors path, so method=annotations yields no ServiceMonitors and no alert rules. 4. The new values must validate and be documented for users. Updated helm/values.schema.yaml (enabled: boolean; discovery.method enum) and regenerated helm/values.schema.json; set defaults and comments in helm/values.yaml; rewrote helm/docs/monitoring-infrastructure.md as user-facing reference. Validation: - helm lint passing - helm unit tests passing; helm/tests/monitoring_integration_test.yaml and defaults_service_test.yaml were rewritten for the new model - helm schema tests passing, including new fixtures under tests/helm/schema for the default, serviceMonitors, annotations, none, and invalid-value cases. kubeconform validates the rendered ServiceMonitor and PrometheusRule against the CRD catalog - golden manifests regenerated; a default `helm template` emits no ServiceMonitors, no PrometheusRule, and no prometheus.io/* annotations - helm subchart tests passing Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Hi @azagarelz. First off, thank you very much for the contribution! You're right, this is definitely an issue... sorry about that. I've been thinking about this a bit, trying to decide on the best API, and I have some reservations on monitoring:
enabled: false
method: autoThe method is one of I opened up a new new PR #840 to implement it, largely because we can't actually merge external PRs right now (it's an issue with accessing secrets from people outside the organization), LMK if that sounds good to you, and would resolve your problem. |
Why?
When
components.monitoring.enabledistrue(orautowith CRDs installed), the chart creates ServiceMonitor that instruct the Prometheus Operator on how to scrape CZ metrics. In that configuration, theprometheus.io/*annotations on Services become redundant.In clusters where both annotation-based discovery and CRD-based discovery are active simultaneously, CZ metrics may be scraped twice, generating unnecessary load or inconsistencies.
Setting
scrapeAnnotations: falseremoves the annotations from Services, eliminating the risk of double scraping while keeping the ServiceMonitor based discovery intact. The default remainstruefor full backward compatibility.What
components.monitoring.scrapeAnnotationsvalue (default:true)false, removeprometheus.io/*annotations from the agent, aggregator, and webhook Servicesmonitoring-infrastructure.mdto document the new flag and the double scraping scenarioHow Tested
Helm unit tests added in
charts/cloudzero-agent/tests/defaults_service_test.yamlcovering:prometheus.io/scrapeannotation present on all three ServicesscrapeAnnotations: false