Skip to content

feat(helm): Add monitoring.scrapeAnnotations to control whether or not to add prometheus.io/* annotations to Service#825

Open
azagarelz wants to merge 1 commit into
Cloudzero:developfrom
azagarelz:feat/monitoring-scrape-annotations-toggle
Open

feat(helm): Add monitoring.scrapeAnnotations to control whether or not to add prometheus.io/* annotations to Service#825
azagarelz wants to merge 1 commit into
Cloudzero:developfrom
azagarelz:feat/monitoring-scrape-annotations-toggle

Conversation

@azagarelz

@azagarelz azagarelz commented May 25, 2026

Copy link
Copy Markdown

Why?

When components.monitoring.enabled is true (or auto with CRDs installed), the chart creates ServiceMonitor that instruct the Prometheus Operator on how to scrape CZ metrics. In that configuration, the prometheus.io/* annotations on Services become redundant.
In clusters where both annotation-based discovery and CRD-based discovery are active simultaneously, CZ metrics may be scraped twice, generating unnecessary load or inconsistencies.

Setting scrapeAnnotations: false removes the annotations from Services, eliminating the risk of double scraping while keeping the ServiceMonitor based discovery intact. The default remains true for full backward compatibility.

What

  • Adds components.monitoring.scrapeAnnotations value (default: true)
  • When set to false, remove prometheus.io/* annotations from the agent, aggregator, and webhook Services
  • Updates monitoring-infrastructure.md to document the new flag and the double scraping scenario
  • Adds helm unit tests covering both true and false scenarios

How Tested

Helm unit tests added in charts/cloudzero-agent/tests/defaults_service_test.yaml covering:

  • Default behaviour: prometheus.io/scrape annotation present on all three Services
  • Opt-out: annotation absent on all three Services when scrapeAnnotations: false

@azagarelz azagarelz changed the title Add monitoring.scrapeAnnotations to control whether or not to add prometheus.io/* annotations to Service feat(helm): Add monitoring.scrapeAnnotations to control whether or not to add prometheus.io/* annotations to Service May 25, 2026
@azagarelz azagarelz marked this pull request as ready for review May 25, 2026 22:32
@azagarelz azagarelz requested a review from a team as a code owner May 25, 2026 22:32
…nnotations

When monitoring.enabled is true (or auto with CRDs installed), the chart
creates ServiceMonitor CRDs that instruct the Prometheus Operator to scrape
CloudZero Agent metrics. In that configuration, the prometheus.io/* annotations
on Services become redundant. In clusters where both annotation-based and
CRD-based discovery are active simultaneously, metrics may be scraped twice,
generating unnecessary load on the metrics pipeline.

Adds components.monitoring.scrapeAnnotations (default: true) to address this.
When set to false, prometheus.io/* annotations are omitted from the agent,
aggregator, and webhook Services, eliminating the risk of double scraping while
keeping ServiceMonitor-based discovery intact. Updates monitoring-infrastructure.md
to document the new flag and the double scraping scenario, and adds helm unit
tests covering both true and false states for all three affected Services.

The default value of true preserves full backward compatibility for users
relying on annotation-based Prometheus discovery without the Operator. This
approach is purely additive, users without ServiceMonitors are unaffected,
while Operator users gain a clean way to eliminate redundant scrape targets.
A future consideration would be auto-disabling annotations when monitoring.enabled
is true, though that would be a breaking change requiring a major version bump.
@azagarelz azagarelz force-pushed the feat/monitoring-scrape-annotations-toggle branch from f570231 to 20a7bf7 Compare May 25, 2026 22:37
evan-cz added a commit that referenced this pull request Jun 10, 2026
A customer contributed PR #825 proposing a components.monitoring.scrapeAnnotations
boolean to drop the prometheus.io/* annotations and avoid double-scraping when
ServiceMonitor-based discovery is in use. Rather than add a single-purpose
boolean, this generalizes the chart's monitoring configuration: it decides what
to emit based on whether the Prometheus Operator is present, and exposes the
discovery mechanism as one explicit choice. It supersedes the scrapeAnnotations
flag from PR #825.

Implementation Approach:

The chart emits two kinds of monitoring resources -- Prometheus Operator CRDs
(ServiceMonitor + PrometheusRule) and prometheus.io/* annotations on the agent
Services. Both are now driven by two values resolved in
helm/templates/_helpers.tpl and consumed by the Service, ServiceMonitor, and
PrometheusRule templates.

Functional Requirements:

1. The Prometheus Operator resources must be created when the Operator is present
   and omitted otherwise, without the user having to configure it.

   Added components.monitoring.enabled (null | true | false, default null),
   resolved by cloudzero-agent.monitoring.operatorEnabled: null auto-detects the
   Operator via the monitoring.coreos.com/v1 capability, true forces on, false
   forces off -- the same idiom as integrations.istio.enabled. The four
   servicemonitor-*.yaml templates and prometheusrule.yaml are gated on it.
   Previously null defaulted to disabled while the feature was being validated;
   it now auto-detects, so an Operator cluster that relied on the default begins
   emitting these resources on upgrade.

2. Operators must be able to choose the discovery mechanism, including an
   annotation-only mode for clusters without the Operator.

   Added components.monitoring.discovery.method (auto | serviceMonitors |
   annotations, default auto). auto resolves to serviceMonitors today and is left
   as an enum so a future auto-detectable backend can extend it; annotations is an
   explicit opt-in and is never auto-selected. The prometheus.io/* annotations on
   agent-service.yaml, aggregator-service.yaml, and webhook-service.yaml are gated
   on cloudzero-agent.monitoring.annotationsActive.

3. The ServiceMonitors and the PrometheusRule must form a single Operator bundle;
   annotation-based discovery is discovery-only.

   serviceMonitorsActive and rulesActive resolve together on the serviceMonitors
   path, so choosing annotations yields no ServiceMonitors and no alert rules.

4. The new values must validate and be documented for users.

   Updated helm/values.schema.yaml (enabled: null|boolean; discovery.method enum)
   and regenerated helm/values.schema.json; set defaults and comments in
   helm/values.yaml; rewrote helm/docs/monitoring-infrastructure.md as user-facing
   reference, removing the prior validation-log content.

Validation:

- helm lint passing
- helm unit tests passing; helm/tests/monitoring_integration_test.yaml and
  defaults_service_test.yaml were rewritten for the new model
- helm schema tests passing, including new fixtures under tests/helm/schema for
  the default, serviceMonitors, annotations, none, and invalid-value cases.
  kubeconform validates the rendered ServiceMonitor and PrometheusRule against the
  CRD catalog
- golden manifests regenerated; a default `helm template` (no Operator detected)
  emits no ServiceMonitors, no PrometheusRule, and no prometheus.io/* annotations
- helm subchart tests passing

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
evan-cz added a commit that referenced this pull request Jun 10, 2026
A customer contributed PR #825 proposing a components.monitoring.scrapeAnnotations
boolean to drop the prometheus.io/* annotations and avoid double-scraping when
ServiceMonitor-based discovery is in use. Rather than add a single-purpose
boolean, this generalizes the chart's monitoring configuration into an explicit,
opt-in model and supersedes the scrapeAnnotations flag from PR #825.

Implementation Approach:

The chart emits two kinds of monitoring resources -- Prometheus Operator CRDs
(ServiceMonitor + PrometheusRule) and prometheus.io/* annotations on the agent
Services. Both are driven by two values resolved in helm/templates/_helpers.tpl
and consumed by the Service, ServiceMonitor, and PrometheusRule templates. There
is no cluster auto-detection; the configuration is explicit.

Functional Requirements:

1. Monitoring must be off by default and turned on explicitly.

   Added components.monitoring.enabled as a plain boolean defaulting to false.
   The four servicemonitor-*.yaml templates, prometheusrule.yaml, and the
   prometheus.io/* annotations on the three Service templates are all gated on it
   via helpers in _helpers.tpl. Previously the operator CRDs defaulted to disabled
   while the feature was validated and the prometheus.io/* annotations were always
   emitted; both are now opt-in, so an upgrade emits nothing until enabled, and
   annotation-based scrapers must set discovery.method: annotations.

2. When enabled, the operator must be able to choose the discovery mechanism.

   Added components.monitoring.discovery.method (auto | serviceMonitors |
   annotations, default auto). auto resolves to serviceMonitors today and is left
   as an enum so a future mechanism can extend it; annotations is an explicit
   opt-in. serviceMonitors emits monitoring.coreos.com/v1 CRDs, so the install
   fails if the Prometheus Operator is absent; annotations needs no CRDs.

3. The ServiceMonitors and the PrometheusRule must form a single bundle;
   annotation-based discovery is discovery-only.

   serviceMonitorsActive and rulesActive resolve together on the serviceMonitors
   path, so method=annotations yields no ServiceMonitors and no alert rules.

4. The new values must validate and be documented for users.

   Updated helm/values.schema.yaml (enabled: boolean; discovery.method enum) and
   regenerated helm/values.schema.json; set defaults and comments in
   helm/values.yaml; rewrote helm/docs/monitoring-infrastructure.md as user-facing
   reference.

Validation:

- helm lint passing
- helm unit tests passing; helm/tests/monitoring_integration_test.yaml and
  defaults_service_test.yaml were rewritten for the new model
- helm schema tests passing, including new fixtures under tests/helm/schema for
  the default, serviceMonitors, annotations, none, and invalid-value cases.
  kubeconform validates the rendered ServiceMonitor and PrometheusRule against the
  CRD catalog
- golden manifests regenerated; a default `helm template` emits no ServiceMonitors,
  no PrometheusRule, and no prometheus.io/* annotations
- helm subchart tests passing

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@evan-cz

evan-cz commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hi @azagarelz. First off, thank you very much for the contribution! You're right, this is definitely an issue... sorry about that.

I've been thinking about this a bit, trying to decide on the best API, and I have some reservations on scrapeAnnotations as a boolean. What do you think of this for the defaults:

monitoring:
  enabled: false
  method: auto

The method is one of auto, serviceMonitors, annotations, and auto would be serviceMonitors if the monitoring.coreos.com/v1 CRD is present, annotations if not. So in your case, just setting components.monitoring.enabled = true should give you service monitors without annotations.

I opened up a new new PR #840 to implement it, largely because we can't actually merge external PRs right now (it's an issue with accessing secrets from people outside the organization), LMK if that sounds good to you, and would resolve your problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants