Skip to content

KEP-5915: Standardize async trace context propagation#5916

Open
artem-tkachuk wants to merge 4 commits intokubernetes:masterfrom
artem-tkachuk:KEP-5915-async-trace-context-propagation
Open

KEP-5915: Standardize async trace context propagation#5916
artem-tkachuk wants to merge 4 commits intokubernetes:masterfrom
artem-tkachuk:KEP-5915-async-trace-context-propagation

Conversation

@artem-tkachuk
Copy link

Proposes a standardized mechanism for propagating W3C TraceContext across asynchronous boundaries in Kubernetes using object annotations and OpenTelemetry Span Links. Adds helper functions to component-base for controllers to inject/extract trace context and create linked spans.

One-line PR description: Initial KEP proposal for standardizing async trace context propagation

Issue link: #5915

Other comments: This KEP proposes a standard way to propagate trace context across async boundaries in Kubernetes by storing W3C TraceContext in object annotations (tracing.k8s.io/traceparent) and using OpenTelemetry Span Links. Includes library functions in component-base for easy adoption by controllers.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: artem-tkachuk
Once this PR has been reviewed and has the lgtm label, please assign dgrisonnet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @artem-tkachuk!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 12, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @artem-tkachuk. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 12, 2026
@artem-tkachuk
Copy link
Author

cc @dashpole - incorporated your feedback from the design doc!


**Phase 1: Injection (Producer)**

When a component creates or updates an object's **spec** based on an incoming request, it is responsible for "stamping" the object with the current trace context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how broadly "components" is meant here. I assume users calling the API are also responsible for stamping objects if they are making updates in the context of a span?

**When to inject:**
- Creating a new object (e.g., API Server creates a Pod from user request)
- Updating the **spec** (e.g., changing a Deployment's replica count, updating a label)
- Making semantic changes that represent user intent or control plane decisions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of this kind of change that doesn't update the spec? I assume things like updating a label wouldn't apply, right?

**When NOT to inject:**
- Updating **status** fields (e.g., Pod phase transitions, condition updates)
- Heartbeats or leader election updates
- Internal housekeeping that doesn't represent a new "action"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an example would be helpful here as well

The guideline: **Inject when the update represents a new action or decision**, not for routine status reporting.

**Who is responsible for injection:**
- **API Server**: When handling user requests that create objects (e.g., `kubectl create pod`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me. If a user creates a pod, then the APIServer is responsible for adding the annotation. But if a user updates a pod's spec, then the user is responsible for addition the annotation? Seems like it would be nice for either the user (client) or the APIServer to be responsible for adding the annotation.


// Option 1: Don't inject (preserve creation context)
pod.Spec.NodeName = node
s.client.Update(ctx, pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the scheduler doesn't actually write to the pod object -- it uses a subresource callled binding, IIRC. I wonder if we should exclude subresources generally (which also includes status)... But that would mean no context propagation on the resize subresource. Because subresources can only make specific updates (e.g. binding can only update the pod's nodeName, and resize can only update the number of replicas, etc.), you probably can't update the object annotation in the same request. Controllers that update via subresources may not even have RBAC permissions to update annotations!


#### Beta

- [ ] Adoption by at least one core component (e.g., Scheduler or Controller Manager).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For beta, I would like to demonstrate at least two controllers working together. E.g. scheduler and kubelet, or controller manager and kubelet, etc. Otherwise, i'm not sure we can trust that the propagation mechanisms work reliably and don't cause issues.


This feature is for observability and does not have direct SLOs. It should not impact existing API Server SLOs.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a bit detailed for what we are doing, and how often it occurs. I think it is OK for this feature to not have self-obs metrics given the result can be seen from the resulting trace instrumentation.


- **Tracing enabled in Kubernetes components**: Any W3C TraceContext-compliant tracing implementation. Kubernetes currently uses OpenTelemetry (see KEP-647 for API Server, KEP-2831 for Kubelet).
- **Trace collection pipeline**: Any system that can collect and forward traces (e.g., OpenTelemetry Collector, Jaeger Agent, Zipkin Collector).
- **Trace storage backend**: Any backend that supports W3C TraceContext and Span Links (e.g., Jaeger, Tempo, Lightstep, Honeycomb).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Trace storage backend**: Any backend that supports W3C TraceContext and Span Links (e.g., Jaeger, Tempo, Lightstep, Honeycomb).
- **Trace storage backend**: Any backend that supports OTLP traces (e.g., Jaeger, Tempo, Lightstep, Honeycomb).

The trace backends actually don't depend on W3C at all. Span links are part of the OTLP payload, so lets just require OTLP support


### Automatic Instrumentation via Controller Runtime

Instead of requiring explicit calls to `InjectContext` and `ExtractContext`, the controller runtime (e.g., controller-runtime or client-go's `SharedInformer`) could automatically inject/extract trace context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking controller runtime would only automatically extract context, but injection would be up to the controller code. But i'm also OK to start with making controllers call extract in code. It isn't too bad...

@artem-tkachuk artem-tkachuk force-pushed the KEP-5915-async-trace-context-propagation branch 2 times, most recently from a41ffb2 to a5fb584 Compare February 20, 2026 05:38
Proposes standardized mechanism for propagating W3C TraceContext across
asynchronous boundaries in Kubernetes using object annotations and
OpenTelemetry Span Links. Adds helper functions to component-base for
controllers to inject/extract trace context and create linked spans.

Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
@artem-tkachuk artem-tkachuk force-pushed the KEP-5915-async-trace-context-propagation branch from a5fb584 to c8f7757 Compare March 1, 2026 08:40
Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Per reviewer feedback: drop the webhook bullet since that approach is not planned; the Alternatives section still documents why it was not chosen.

Signed-off-by: Artem Tkachuk <artemtkachuk@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants