Skip to content

KEP-5927: Distributed tracing in kube-scheduler#5928

Open
artem-tkachuk wants to merge 8 commits intokubernetes:masterfrom
artem-tkachuk:KEP-5927-distributed-tracing-in-kube-scheduler
Open

KEP-5927: Distributed tracing in kube-scheduler#5928
artem-tkachuk wants to merge 8 commits intokubernetes:masterfrom
artem-tkachuk:KEP-5927-distributed-tracing-in-kube-scheduler

Conversation

@artem-tkachuk
Copy link

@artem-tkachuk artem-tkachuk commented Feb 20, 2026

Proposes adding OpenTelemetry distributed tracing to kube-scheduler, instrumenting schedulingCycle and bindingCycle with per-phase and per-plugin spans. Reuses existing TracingConfiguration from component-base and creates Span Links to API Server traces via KEP-5915 (#5915 )'s async trace context propagation pattern.

One-line PR description: Initial KEP proposal for distributed tracing in kube-scheduler

Issue link: #5927

Other comments: Completes the tracing story across core control-plane components alongside KEP-647 (#647) API Server, Stable) and KEP-2831 (#2831) (Kubelet, Stable). Depends on KEP-5915 (#5915) for async trace context propagation via annotations.

Related Kubernetes issue: kubernetes/kubernetes#133819

SIG Scheduling meeting notes item

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: artem-tkachuk
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 20, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @artem-tkachuk. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Feb 20, 2026
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Feb 20, 2026
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 20, 2026
@artem-tkachuk
Copy link
Author

artem-tkachuk commented Feb 20, 2026

cc @dashpole -- second one up! I'm conscious that changes to #5915 might change what's written in this KEP about context propagation. I will be making necessary changes in this KEP as I work through the feedback in the context propagation one.

@sanposhiho
Copy link
Member

sanposhiho commented Feb 20, 2026

Thanks for the PR, but unfortunately we not likely get the bandwidth for this work in the near future.

@artem-tkachuk
Copy link
Author

@sanposhiho Thank you. I am willing to do all the work myself. I'm also working with @dashpole from SIG instrumentation who implemented similar tracing in Kubelet and API server.

In the SIG scheduling meeting on Dec 11, 2025 @macsko said that as soon as I do the work, the SIG would appreciate this. Please advise.

@sanposhiho
Copy link
Member

The enhancement freeze is already passed. So, we cannot make it in 1.36. And, I am not even sure if we can prioritize this for 1.37 as well because our top priority is WAS and we already asked several KEPs to postpone to the next release. But, that is debatable.

@sanposhiho
Copy link
Member

sanposhiho commented Feb 22, 2026

Likely, we don't have any accurate planning for 1.37 soon. I would suggest just waiting for now, and bringing this up again in the sig meeting once 1.36 release is complete. At that timing, we should roughly know what we do for 1.37 (based on what we actually done in 1.36) and we can properly discuss whether we can make this in for 1.37 or not.


**The Scheduler as the first adopter of KEP-5915:** The kube-scheduler is a natural first consumer of this pattern. It is the most prominent asynchronous component in the Pod lifecycle — it watches for unscheduled Pods and reconciles them independently of the original API request. Adopting KEP-5915's `ExtractContext` and Span Link pattern here serves as a concrete, high-value proof point for the async trace context propagation standard. Success in the Scheduler paves the way for adoption in other async consumers (Kubelet, kube-controller-manager, custom operators).

When the API Server handles a Pod creation request, it injects the current trace context (from the HTTP request's `traceparent` header) into the Pod's annotations (e.g., `tracing.k8s.io/traceparent`) using KEP-5915's `InjectContext`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When the API Server handles a Pod creation request, it injects the current trace context (from the HTTP request's `traceparent` header) into the Pod's annotations (e.g., `tracing.k8s.io/traceparent`) using KEP-5915's `InjectContext`.
Trace context injection does not impact this proposal since the scheduler only extracts trace context. KEP-5915 proposes that when the API Server handles a Pod creation request, it injects the current trace context (from the HTTP request's `traceparent` header) into the Pod's annotations (e.g., `tracing.k8s.io/traceparent`) using KEP-5915's `InjectContext`.

Maybe make it clear that this doesn't impact the design of this KEP (since this is one of the most confusing/complex parts of the other KEP).

Copy link
Author

@artem-tkachuk artem-tkachuk Mar 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make it clear that this doesn't impact the design of this KEP (since this is one of the most confusing/complex parts of the other KEP).

Thank you for the suggestion! I've updated the wording in 0905903 to clarify that the API Server injection design doesn't impact this proposal. One nuance: the scheduler doesn't only extract — it also re-injects its own trace context into Pod annotations after binding, so the Kubelet can create a Span Link back to the Scheduler's span (enabling the chain: Kubelet → Scheduler → API Server). The updated text now reflects both directions.

The design of trace context injection in the API Server does not impact this proposal. The scheduler extracts trace context from Pod annotations to create Span Links, and re-injects its own context after binding using KEP-5915's helpers.

Copy link
Contributor

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly LGTM, but is probably a bit longer than it needs to be. Its ultimately up to you and reviewers from sig-scheduling, but I would consider removing a lot of the details about the other KEP. The only impact of the context propagation KEP is a 1-line change to the root span (to add a link), so I would focus as much as possible on scheduler instrumentation as a ~mostly standalone feature.

@artem-tkachuk artem-tkachuk force-pushed the KEP-5927-distributed-tracing-in-kube-scheduler branch 3 times, most recently from 41622c6 to c3072f4 Compare March 1, 2026 00:44
@artem-tkachuk
Copy link
Author

This is mostly LGTM, but is probably a bit longer than it needs to be. Its ultimately up to you and reviewers from sig-scheduling, but I would consider removing a lot of the details about the other KEP. The only impact of the context propagation KEP is a 1-line change to the root span (to add a link), so I would focus as much as possible on scheduler instrumentation as a ~mostly standalone feature.

@dashpole Thank you for the review! I've addressed all your feedback. Specifically, clarified that API Server trace context injection design doesn't impact this KEP, but also noted that the scheduler both extracts and re-injects context (for Kubelet) and added that to goals and alpha criteria. I also removed a lot of internal KEP-#5915 details to shift focus more towards what the scheduler itself does. Let me know if there's anything else!

@artem-tkachuk
Copy link
Author

Likely, we don't have any accurate planning for 1.37 soon. I would suggest just waiting for now, and bringing this up again in the sig meeting once 1.36 release is complete. At that timing, we should roughly know what we do for 1.37 (based on what we actually done in 1.36) and we can properly discuss whether we can make this in for 1.37 or not.

@sanposhiho Thank you for the context on timing. To clarify — I wasn't expecting this to land in 1.36. I opened the KEP early to keep the ball rolling and give reviewers time to provide feedback.

The KEP is already getting reviews from @dashpole and I'm actively working on the implementation, so I'd like to have everything ready to go the moment 1.37 planning opens up. I'll bring this up again in the SIG meeting once 1.36 ships.

This brings distributed tracing to the last uninstrumented core scheduling component and establishes the first cross-component trace correlation standard via Span Links. I'm committed to driving it to completion — I just need the SIG's green light on scheduling it in.

@artem-tkachuk artem-tkachuk requested a review from dashpole March 1, 2026 08:29
Proposes adding OpenTelemetry distributed tracing to the kube-scheduler,
instrumenting schedulingCycle and bindingCycle with per-phase and
per-plugin spans. Reuses existing TracingConfiguration from
component-base and creates Span Links to API Server traces via
KEP-5915's async trace context propagation pattern.

Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Rename all scheduler span attributes from `scheduler.*` to
`k8s.scheduler.*` to follow OpenTelemetry semantic convention
namespacing, per @dashpole's review.

Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Drop "OpenTelemetry-Go SDK has reached GA (already satisfied)" line
from Beta criteria, per @dashpole's review.

Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Reference experimental OTel-Go SDK trace metrics
(open-telemetry/opentelemetry-go#2547) in rollback metrics section.
Once stabilized, these can surface span sampling, queue depth, drops,
and export latency via Prometheus, per @dashpole's review.

Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Clarify that API Server trace context injection design does not impact
this KEP. The scheduler both extracts context (to create Span Links)
and re-injects its own context after binding (for Kubelet), per @dashpole's review.

Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Local relative paths to KEP-5915's README won't resolve on master
until that KEP merges. Use the enhancement issue URL instead, which
is stable across branches, rebases, and squashes.

Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Add re-injection of the Scheduler's trace context into Pod annotations
after binding as an explicit goal and Alpha graduation criterion, so
downstream consumers (e.g., Kubelet) can link back to the Scheduler's
trace via Span Links.

Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Remove advocacy ("first adopter"), KEP-5915 internal details, and
duplicative code example from Context Propagation section. Move root
span code example to Implementation section as a new level alongside
phase-level and per-plugin spans, per @dashpole's review.

Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
@artem-tkachuk artem-tkachuk force-pushed the KEP-5927-distributed-tracing-in-kube-scheduler branch from ab48154 to 2792db1 Compare March 1, 2026 08:40
)
defer span.End()

span.SetAttributes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would ideally provide attributes on span creation, rather than at a later point. From https://pkg.go.dev/go.opentelemetry.io/otel/trace#Span:

// Note that adding attributes at span creation using [WithAttributes] is preferred
// to calling SetAttribute later, as samplers can only consider information
// already present during span creation.

* **Plugin Visibility:** Generate child spans for individual plugin execution (e.g., `RunFilterPlugins`, `RunScorePlugins`) to expose per-plugin latency. Ensure the trace context is available to custom plugins (out-of-tree), allowing platform engineers to trace their own proprietary scheduling logic.
* **Scheduling Queue Observability:** Measure scheduling queue latency — the duration a Pod remains Pending in the active queue before the scheduling cycle begins.
* **Consistency:** Reuse the existing `TracingConfiguration` API and `component-base` tracing libraries established by KEP-647 (API Server Tracing) and KEP-2831 (Kubelet Tracing).
* **Context Propagation:** Link Scheduler traces back to the original Pod creation trace by extracting W3C Trace Context from Pod annotations (as detailed in [KEP-5915](https://github.com/kubernetes/enhancements/issues/5915)), and re-inject the Scheduler's own trace context after binding so downstream consumers (e.g., Kubelet) can link back to the Scheduler's trace, enabling full end-to-end lifecycle tracing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is re-injection done, specifically? Does this add another API call to update the annotation? That might be a blocker for this proposal if it doubles the QPS of the scheduler to the apiserver.

- Propagate context in [passthrough](https://github.com/open-telemetry/opentelemetry-go-contrib/tree/main/examples/passthrough) mode
- When the feature gate is **enabled**, and a `TracingConfiguration` with sampling rate 0 (the default) is provided, the scheduler will:
- Initiate an OTLP connection
- Not record or export spans for its own root spans (e.g., `SchedulePod`). Note: because the scheduler creates new root spans via Span Links (not child spans of the API Server trace), the `ParentBasedSampler` treats them as roots and applies the configured rate (0 = no sampling). The sampled flag from the linked API Server trace does not influence this decision, since Span Links do not establish a parent-child relationship.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sampled flag from the linked API Server trace does not influence this decision, since Span Links do not establish a parent-child relationship.

Do you think we should implement a custom sampler to make sampling decisions based on the span links? That would allow operators to get "complete" traces of linked spans, rather than independently-sampled spans. https://github.com/open-telemetry/opentelemetry-go/blob/f1f16bcb620285c31ad69a2c669c99ce85934797/sdk/trace/sampling.go#L39


###### Will enabling / using this feature result in any new API calls?

No. The instrumentation adds spans to existing scheduling operations. No new API calls are made.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we need new calls to the kube-apiserver to update the trace context annotation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

4 participants