Skip to content

perf(datadog): reuse one UDP socket per batch and coalesce metrics#13653

Open
shreemaan-abhishek wants to merge 2 commits into
apache:masterfrom
shreemaan-abhishek:perf/datadog-batch-udp
Open

perf(datadog): reuse one UDP socket per batch and coalesce metrics#13653
shreemaan-abhishek wants to merge 2 commits into
apache:masterfrom
shreemaan-abhishek:perf/datadog-batch-udp

Conversation

@shreemaan-abhishek

Copy link
Copy Markdown
Contributor

Description

The datadog plugin opened a fresh UDP socket for every entry in a flushed batch and sent 6 separate datagrams per entry (request.counter, request.latency, upstream.latency, apisix.latency, ingress.size, egress.size). For a batch of N entries this cost N socket setups and 6N sendto syscalls in the batch-processor flush.

This PR:

  • reuses a single UDP socket across the whole batch instead of one per entry, and
  • coalesces each entry's metrics into one newline-delimited DogStatsD datagram.

Per-batch work drops from N sockets + 6N sends to 1 socket + N sends. Newline-delimited multi-metric packets are standard DogStatsD and are parsed by the Datadog agent, so this is wire-compatible with existing receivers. The plugin schema and the metrics emitted are unchanged.

The test mock (t/lib/mock_layer4.lua dogstatsd()) now splits received datagrams on newlines, as a real DogStatsD server does, so the existing per-metric assertions in t/plugin/datadog.t continue to hold without changes.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change (existing t/plugin/datadog.t assertions cover the emitted metrics; mock updated to reflect coalesced framing)
  • I have updated the documentation accordingly (no user-facing behavior/config change; docs already describe batch behavior)
  • I have verified linting passes

The datadog plugin opened a fresh UDP socket for every entry in a
flushed batch and sent 6 separate datagrams per entry (request.counter,
request.latency, upstream.latency, apisix.latency, ingress.size,
egress.size). For a batch of N entries this cost N socket setups and
6N sendto syscalls.

Reuse a single socket across the whole batch and coalesce each entry's
metrics into one newline-delimited DogStatsD datagram, cutting the
per-batch work to 1 socket setup and N sends. Newline-delimited
multi-metric packets are standard DogStatsD and parsed by the agent.

The test mock now splits received datagrams on newlines, as a real
DogStatsD server does, so existing metric assertions still hold.

Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. performance generate flamegraph for the current PR labels Jul 3, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the performance characteristics of the datadog plugin’s batch flush path by reusing a single UDP socket for the whole batch and by coalescing each entry’s metrics into one newline-delimited DogStatsD datagram (with the test mock updated to parse newline-delimited datagrams like a real DogStatsD server).

Changes:

  • Reuse one UDP socket per batch flush rather than creating/closing a socket per entry.
  • Coalesce per-entry metrics into a single newline-delimited DogStatsD payload (1 datagram per entry).
  • Update the layer4 DogStatsD mock to split incoming datagrams on newlines for existing per-metric assertions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
t/lib/mock_layer4.lua Logs each newline-delimited metric within a received DogStatsD datagram to preserve existing per-metric test assertions.
apisix/plugins/datadog.lua Builds multi-metric DogStatsD payloads per entry and reuses a single UDP socket across the batch flush.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 198 to 205
for i = 1, #entries do
local ok, err = send_metric_over_udp(entries[i], metadata)
if not ok then
return false, err, i
-- coalesce the per-request metrics into one datagram
local send_ok, send_err = sock:send(build_metrics(entries[i], metadata))
if not send_ok then
sock:close()
return false, "failed to send metrics to dogstatsd server: host[" .. host
.. "] port[" .. tostring(port) .. "] err: " .. send_err, i
end
Comment on lines +191 to +196
local sock = udp()
local ok, err = sock:setpeername(host, port)
if not ok then
return false, "failed to connect to UDP server: host[" .. host
.. "] port[" .. tostring(port) .. "] err: " .. err
end
Collapse the send loop to a single exit so there is one checked
sock:close() instead of an unchecked close on the failure path plus a
checked one at the end, matching the existing udp-logger idiom.

Signed-off-by: Abhishek Choudhary <shreemaan.abhishek@gmail.com>

@membphis membphis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Coalesced DogStatsD datagram has no size guard

This PR changes the Datadog plugin from sending each metric in its own UDP datagram to sending all metrics for one entry as a newline-delimited DogStatsD datagram. DogStatsD supports newline-delimited packets, but the implementation does not enforce any packet size limit or split oversized payloads.

APISIX allows long tag sources: constant_tags has no maxItems limit, route/service/consumer names can be long, and include_path can add a long route path. With those inputs, a single metric line may still be acceptable while the coalesced 5/6-line datagram exceeds the DogStatsD Agent buffer or practical UDP packet size limits, causing silent metric loss.

Please add packet-size-aware splitting or fall back to smaller datagrams when the coalesced payload is too large, and add a test covering long tags/path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance generate flamegraph for the current PR size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants