KEP-5927: Distributed tracing in kube-scheduler#5928
KEP-5927: Distributed tracing in kube-scheduler#5928artem-tkachuk wants to merge 8 commits intokubernetes:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: artem-tkachuk The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @artem-tkachuk. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
2a5a144 to
cbc1d9f
Compare
|
Thanks for the PR, but unfortunately we not likely get the bandwidth for this work in the near future. |
|
@sanposhiho Thank you. I am willing to do all the work myself. I'm also working with @dashpole from SIG instrumentation who implemented similar tracing in Kubelet and API server. In the SIG scheduling meeting on Dec 11, 2025 @macsko said that as soon as I do the work, the SIG would appreciate this. Please advise. |
|
The enhancement freeze is already passed. So, we cannot make it in 1.36. And, I am not even sure if we can prioritize this for 1.37 as well because our top priority is WAS and we already asked several KEPs to postpone to the next release. But, that is debatable. |
|
Likely, we don't have any accurate planning for 1.37 soon. I would suggest just waiting for now, and bringing this up again in the sig meeting once 1.36 release is complete. At that timing, we should roughly know what we do for 1.37 (based on what we actually done in 1.36) and we can properly discuss whether we can make this in for 1.37 or not. |
keps/sig-scheduling/5927-distributed-tracing-in-kube-scheduler/README.md
Outdated
Show resolved
Hide resolved
|
|
||
| **The Scheduler as the first adopter of KEP-5915:** The kube-scheduler is a natural first consumer of this pattern. It is the most prominent asynchronous component in the Pod lifecycle — it watches for unscheduled Pods and reconciles them independently of the original API request. Adopting KEP-5915's `ExtractContext` and Span Link pattern here serves as a concrete, high-value proof point for the async trace context propagation standard. Success in the Scheduler paves the way for adoption in other async consumers (Kubelet, kube-controller-manager, custom operators). | ||
|
|
||
| When the API Server handles a Pod creation request, it injects the current trace context (from the HTTP request's `traceparent` header) into the Pod's annotations (e.g., `tracing.k8s.io/traceparent`) using KEP-5915's `InjectContext`. |
There was a problem hiding this comment.
| When the API Server handles a Pod creation request, it injects the current trace context (from the HTTP request's `traceparent` header) into the Pod's annotations (e.g., `tracing.k8s.io/traceparent`) using KEP-5915's `InjectContext`. | |
| Trace context injection does not impact this proposal since the scheduler only extracts trace context. KEP-5915 proposes that when the API Server handles a Pod creation request, it injects the current trace context (from the HTTP request's `traceparent` header) into the Pod's annotations (e.g., `tracing.k8s.io/traceparent`) using KEP-5915's `InjectContext`. |
Maybe make it clear that this doesn't impact the design of this KEP (since this is one of the most confusing/complex parts of the other KEP).
There was a problem hiding this comment.
Maybe make it clear that this doesn't impact the design of this KEP (since this is one of the most confusing/complex parts of the other KEP).
Thank you for the suggestion! I've updated the wording in 0905903 to clarify that the API Server injection design doesn't impact this proposal. One nuance: the scheduler doesn't only extract — it also re-injects its own trace context into Pod annotations after binding, so the Kubelet can create a Span Link back to the Scheduler's span (enabling the chain: Kubelet → Scheduler → API Server). The updated text now reflects both directions.
The design of trace context injection in the API Server does not impact this proposal. The scheduler extracts trace context from Pod annotations to create Span Links, and re-injects its own context after binding using KEP-5915's helpers.
keps/sig-scheduling/5927-distributed-tracing-in-kube-scheduler/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5927-distributed-tracing-in-kube-scheduler/README.md
Show resolved
Hide resolved
dashpole
left a comment
There was a problem hiding this comment.
This is mostly LGTM, but is probably a bit longer than it needs to be. Its ultimately up to you and reviewers from sig-scheduling, but I would consider removing a lot of the details about the other KEP. The only impact of the context propagation KEP is a 1-line change to the root span (to add a link), so I would focus as much as possible on scheduler instrumentation as a ~mostly standalone feature.
41622c6 to
c3072f4
Compare
@dashpole Thank you for the review! I've addressed all your feedback. Specifically, clarified that API Server trace context injection design doesn't impact this KEP, but also noted that the scheduler both extracts and re-injects context (for Kubelet) and added that to goals and alpha criteria. I also removed a lot of internal KEP-#5915 details to shift focus more towards what the scheduler itself does. Let me know if there's anything else! |
@sanposhiho Thank you for the context on timing. To clarify — I wasn't expecting this to land in 1.36. I opened the KEP early to keep the ball rolling and give reviewers time to provide feedback. The KEP is already getting reviews from @dashpole and I'm actively working on the implementation, so I'd like to have everything ready to go the moment 1.37 planning opens up. I'll bring this up again in the SIG meeting once 1.36 ships. This brings distributed tracing to the last uninstrumented core scheduling component and establishes the first cross-component trace correlation standard via Span Links. I'm committed to driving it to completion — I just need the SIG's green light on scheduling it in. |
Proposes adding OpenTelemetry distributed tracing to the kube-scheduler, instrumenting schedulingCycle and bindingCycle with per-phase and per-plugin spans. Reuses existing TracingConfiguration from component-base and creates Span Links to API Server traces via KEP-5915's async trace context propagation pattern. Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Rename all scheduler span attributes from `scheduler.*` to `k8s.scheduler.*` to follow OpenTelemetry semantic convention namespacing, per @dashpole's review. Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Drop "OpenTelemetry-Go SDK has reached GA (already satisfied)" line from Beta criteria, per @dashpole's review. Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Reference experimental OTel-Go SDK trace metrics (open-telemetry/opentelemetry-go#2547) in rollback metrics section. Once stabilized, these can surface span sampling, queue depth, drops, and export latency via Prometheus, per @dashpole's review. Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Clarify that API Server trace context injection design does not impact this KEP. The scheduler both extracts context (to create Span Links) and re-injects its own context after binding (for Kubelet), per @dashpole's review. Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Local relative paths to KEP-5915's README won't resolve on master until that KEP merges. Use the enhancement issue URL instead, which is stable across branches, rebases, and squashes. Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Add re-injection of the Scheduler's trace context into Pod annotations after binding as an explicit goal and Alpha graduation criterion, so downstream consumers (e.g., Kubelet) can link back to the Scheduler's trace via Span Links. Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Remove advocacy ("first adopter"), KEP-5915 internal details, and
duplicative code example from Context Propagation section. Move root
span code example to Implementation section as a new level alongside
phase-level and per-plugin spans, per @dashpole's review.
Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
ab48154 to
2792db1
Compare
| ) | ||
| defer span.End() | ||
|
|
||
| span.SetAttributes( |
There was a problem hiding this comment.
We would ideally provide attributes on span creation, rather than at a later point. From https://pkg.go.dev/go.opentelemetry.io/otel/trace#Span:
// Note that adding attributes at span creation using [WithAttributes] is preferred
// to calling SetAttribute later, as samplers can only consider information
// already present during span creation.
| * **Plugin Visibility:** Generate child spans for individual plugin execution (e.g., `RunFilterPlugins`, `RunScorePlugins`) to expose per-plugin latency. Ensure the trace context is available to custom plugins (out-of-tree), allowing platform engineers to trace their own proprietary scheduling logic. | ||
| * **Scheduling Queue Observability:** Measure scheduling queue latency — the duration a Pod remains Pending in the active queue before the scheduling cycle begins. | ||
| * **Consistency:** Reuse the existing `TracingConfiguration` API and `component-base` tracing libraries established by KEP-647 (API Server Tracing) and KEP-2831 (Kubelet Tracing). | ||
| * **Context Propagation:** Link Scheduler traces back to the original Pod creation trace by extracting W3C Trace Context from Pod annotations (as detailed in [KEP-5915](https://github.com/kubernetes/enhancements/issues/5915)), and re-inject the Scheduler's own trace context after binding so downstream consumers (e.g., Kubelet) can link back to the Scheduler's trace, enabling full end-to-end lifecycle tracing. |
There was a problem hiding this comment.
How is re-injection done, specifically? Does this add another API call to update the annotation? That might be a blocker for this proposal if it doubles the QPS of the scheduler to the apiserver.
| - Propagate context in [passthrough](https://github.com/open-telemetry/opentelemetry-go-contrib/tree/main/examples/passthrough) mode | ||
| - When the feature gate is **enabled**, and a `TracingConfiguration` with sampling rate 0 (the default) is provided, the scheduler will: | ||
| - Initiate an OTLP connection | ||
| - Not record or export spans for its own root spans (e.g., `SchedulePod`). Note: because the scheduler creates new root spans via Span Links (not child spans of the API Server trace), the `ParentBasedSampler` treats them as roots and applies the configured rate (0 = no sampling). The sampled flag from the linked API Server trace does not influence this decision, since Span Links do not establish a parent-child relationship. |
There was a problem hiding this comment.
The sampled flag from the linked API Server trace does not influence this decision, since Span Links do not establish a parent-child relationship.
Do you think we should implement a custom sampler to make sampling decisions based on the span links? That would allow operators to get "complete" traces of linked spans, rather than independently-sampled spans. https://github.com/open-telemetry/opentelemetry-go/blob/f1f16bcb620285c31ad69a2c669c99ce85934797/sdk/trace/sampling.go#L39
|
|
||
| ###### Will enabling / using this feature result in any new API calls? | ||
|
|
||
| No. The instrumentation adds spans to existing scheduling operations. No new API calls are made. |
There was a problem hiding this comment.
IIUC we need new calls to the kube-apiserver to update the trace context annotation?
Proposes adding OpenTelemetry distributed tracing to kube-scheduler, instrumenting
schedulingCycleandbindingCyclewith per-phase and per-plugin spans. Reuses existingTracingConfigurationfromcomponent-baseand creates Span Links to API Server traces via KEP-5915 (#5915 )'s async trace context propagation pattern.One-line PR description: Initial KEP proposal for distributed tracing in kube-scheduler
Issue link: #5927
Other comments: Completes the tracing story across core control-plane components alongside KEP-647 (#647) API Server, Stable) and KEP-2831 (#2831) (Kubelet, Stable). Depends on KEP-5915 (#5915) for async trace context propagation via annotations.
Related Kubernetes issue: kubernetes/kubernetes#133819
SIG Scheduling meeting notes item