Skip to content

Add rate‑limited log warning when BatchSpanProcessor drops spans#8167

Open
VamshikrishnaMonagari wants to merge 2 commits intoopen-telemetry:mainfrom
VamshikrishnaMonagari:alert-through-logs-when-BatchSpanProcessor-drops-span
Open

Add rate‑limited log warning when BatchSpanProcessor drops spans#8167
VamshikrishnaMonagari wants to merge 2 commits intoopen-telemetry:mainfrom
VamshikrishnaMonagari:alert-through-logs-when-BatchSpanProcessor-drops-span

Conversation

@VamshikrishnaMonagari
Copy link

Summary

When BatchSpanProcessor's internal queue fills and spans are dropped, the SDK previously only emitted internal metrics (which can be missed if no metrics pipeline is configured). This PR adds a rate-limited WARNING log so users can detect span drops via logs.

Key changes

  • Add ThrottlingLogger to BatchSpanProcessor
  • Log a warning when a span is dropped due to the queue being full
  • Add a unit test ensuring the warning is emitted under queue-full conditions

Fixes

Fixes #7103

@VamshikrishnaMonagari VamshikrishnaMonagari requested a review from a team as a code owner March 9, 2026 03:09
if (!queue.offer(span)) {
spanProcessorInstrumentation.dropSpans(1);
throttledLogger.log(
Level.WARNING,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't warning too high level? I think INFO would be enough.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think WARNING is appropriate here because dropped spans represent actual data loss, which impacts observability. This warrants immediate attention from operators in production. Since we're using ThrottlingLogger, the output is already rate-limited to prevent log spam.

However, if the team prefers INFO, I'm happy to adjust. Please let me know.

ComponentId.generateLazy("batching_span_processor");

private static final Logger logger = Logger.getLogger(BatchSpanProcessor.class.getName());
private static final ThrottlingLogger throttledLogger = new ThrottlingLogger(logger);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throttlingLogger would be a better name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I've updated the variable name from throttledLogger to throttlingLogger to better reflect that this logger performs throttling (matching the active form of the class name ThrottlingLogger).

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we already have internal metrics that would allow a user to detect / react to this condition, is it necessary to duplicate with a log signal?

More generally, do we want to duplicate signals? If so, when?

We already do some duplication of signals. For example, we log exception when an export fails and also record this to internal telemetry. What's the principle here?

cc @jkwatson

@jkwatson
Copy link
Contributor

Given that we already have internal metrics that would allow a user to detect / react to this condition, is it necessary to duplicate with a log signal?

More generally, do we want to duplicate signals? If so, when?

We already do some duplication of signals. For example, we log exception when an export fails and also record this to internal telemetry. What's the principle here?

cc @jkwatson

I think we've been assuming that everyone will at least have metrics in their pipeline, but perhaps that's a bad assumption. A throttled log for this seems non-terrible to me, but I don't really have a strong opinion one way or another. Perhaps a simpler way would be to just log if we dropped spans at the end of the export (or do we already do that?)

@VamshikrishnaMonagari
Copy link
Author

Given that we already have internal metrics that would allow a user to detect / react to this condition, is it necessary to duplicate with a log signal?
More generally, do we want to duplicate signals? If so, when?
We already do some duplication of signals. For example, we log exception when an export fails and also record this to internal telemetry. What's the principle here?
cc @jkwatson

I think we've been assuming that everyone will at least have metrics in their pipeline, but perhaps that's a bad assumption. A throttled log for this seems non-terrible to me, but I don't really have a strong opinion one way or another. Perhaps a simpler way would be to just log if we dropped spans at the end of the export (or do we already do that?)

Thanks for the feedback, @jack-berg and @jkwatson!

I'd argue the log isn't really a duplication, it serves different audience. The internal metrics via dropSpans() are valuable when a MeterProvider is configured, but many users (especially those only using the SDK for tracing, or those early in their OpenTelemetry adoption) won't have a metrics pipeline set up. For them, spans get silently dropped with no indication that anything is wrong. I belive throttled log provides immediate feedback right in the application output.

This also follows established patterns within the SDK. ThrottlingLogger is also widely used across the SDK (in SdkDoubleCounter, AsynchronousMetricStorage, DefaultSynchronousMetricStorage, etc...) for exactly this kind of diagnostic signal.

Regarding the suggestion to log at the end of the export, the drop happens at addSpan() time when the queue is at capacity, so that's the most accurate and actionable place to surface it. By the time the export runs, the dropped spans are already gone and the exporter has no visibility into them. The existing export-failure logs cover a different failure mode for exporter errors, while this covers queue overflow. having both gives complete coverage.

The use of ThrottlingLogger ensures this won't cause log spam even under sustained high-throughput dropping.

@jkwatson
Copy link
Contributor

I think my (and perhaps Jack's) point was not that the ThrottlingLogger wasn't a good option, but that it might be more important to know how many spans got dropped, which this approach will not let you know, since the logs will only be emitted for the first few. It would let you know that spans were being dropped, but not how bad the problem is. If a solution could be put together to log the total number dropped per batch (which would certainly require more work!), that would probably be a better solution.

@VamshikrishnaMonagari
Copy link
Author

I think my (and perhaps Jack's) point was not that the ThrottlingLogger wasn't a good option, but that it might be more important to know how many spans got dropped, which this approach will not let you know, since the logs will only be emitted for the first few. It would let you know that spans were being dropped, but not how bad the problem is. If a solution could be put together to log the total number dropped per batch (which would certainly require more work!), that would probably be a better solution.

That's a great point, @jkwatson. I see what you mean now.

I think the per-batch drop count approach you're suggesting is a better solution. The implementation would be fairly straightforward: replace the ThrottlingLogger call with an AtomicInteger counter that gets incremented in addSpan() on each drop, and then read and reset it in exportCurrentBatch() to log the total. Something like:

"BatchSpanProcessor dropped 47 spans since the last export because the queue is full (maxQueueSize=2048)"

This is naturally throttled by the export cycle, gives visibility into how severe the backpressure is, and the threading model already supports it cleanly since addSpan() runs on caller threads and exportCurrentBatch() runs on the single worker thread.

I'll update the PR with this approach. Let me know if that aligns with what you had in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Alert through logs when BatchSpanProcessor drops span

4 participants