feat: add ExitReason tracking for actor lifecycle#163
Conversation
🤖 Kimi Code ReviewThe PR looks solid and correctly implements exit reason tracking for supervision. Here are my observations: ✅ Correctness & Safety
Panic handling - Both async and thread implementations correctly use Race safety - The
|
🤖 Codex Code ReviewFindings:
Open question:
Verification gap: I could not run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
🤖 Claude Code ReviewPR #163 Review:
|
Greptile SummaryThis PR adds Key changes:
Issues found:
Confidence Score: 4/5Safe to merge with the RAII regression in threads mode addressed; all other findings are P2 style/hardening concerns One P1 issue exists: the threads
|
| Filename | Overview |
|---|---|
| concurrency/src/threads/actor.rs | Mutex changed to Option<ExitReason>; CompletionGuard refactored to carry reason, but now constructed after run_actor — breaking the RAII guarantee that prevented wait_exit() deadlocks on unexpected panics |
| concurrency/src/tasks/actor.rs | Watch channel changed from bool to Option<ExitReason>; wait_exit() added; run_actor returns ExitReason; panic_message duplicated from threads module; wait_exit maps dropped sender to Normal even for task aborts |
| concurrency/src/error.rs | Adds ExitReason enum with Normal, Shutdown, Panic(String), Kill variants, is_abnormal(), and Display impl; missing PartialEq derive limits supervision use |
| concurrency/src/lib.rs | Re-exports ExitReason alongside ActorError — straightforward one-line change |
| examples/exit_reason/src/main.rs | New runnable example demonstrating all four exit paths (clean stop, handler panic, started panic, polling); well-structured and accurate |
| examples/exit_reason/Cargo.toml | New example package manifest, correctly declares workspace dependencies |
Sequence Diagram
sequenceDiagram
participant Caller
participant ActorRef
participant run_actor
participant CompletionChannel
Caller->>ActorRef: .start()
ActorRef->>CompletionChannel: init(None)
ActorRef->>run_actor: spawn(actor, ctx, rx, token)
alt started() panics
run_actor-->>run_actor: catch_unwind → ExitReason::Panic
else started() calls ctx.stop()
run_actor-->>run_actor: ExitReason::Normal
else normal flow
loop message loop
run_actor->>run_actor: recv message
alt handler panics
run_actor-->>run_actor: exit_reason = Panic
run_actor-->>run_actor: break
else ctx.stop() called
run_actor-->>run_actor: break (Normal)
end
end
run_actor->>run_actor: stopped() [catch_unwind]
end
run_actor->>CompletionChannel: send(Some(ExitReason))
CompletionChannel-->>ActorRef: notify
Caller->>ActorRef: .wait_exit()
ActorRef->>CompletionChannel: wait for Some(reason)
CompletionChannel-->>ActorRef: Some(ExitReason)
ActorRef-->>Caller: ExitReason
Prompt To Fix All With AI
This is a comment left during a code review.
Path: concurrency/src/threads/actor.rs
Line: 412-420
Comment:
**RAII guarantee broken — `wait_exit()` can deadlock if `run_actor` panics unexpectedly**
The old code constructed `CompletionGuard` *before* calling `run_actor`, so the guard's `Drop` impl would always fire (writing `Some(reason)` and calling `cvar.notify_all()`) even if `run_actor` panicked at the thread level:
```rust
// OLD — always notifies, even on unexpected panic
let _guard = CompletionGuard(completion);
run_actor(actor, ctx, rx, cancellation_token);
```
The new code constructs the guard *after* `run_actor` returns. If `run_actor` ever panics without being caught (e.g., a future bug in the framework, a stack overflow, or a `catch_unwind` boundary issue), the guard is never created, `completion` is just dropped as an `Arc`, and any thread blocked in `wait_exit()` → `cvar.wait(…)` will deadlock forever.
Note that the `tasks` mode is unaffected because dropping `completion_tx` causes `rx.changed()` to return `Err`, so `wait_exit()` returns `ExitReason::Normal` instead of hanging.
A safe fix is to wrap `run_actor` at the thread closure level:
```rust
let _thread_handle = rt::spawn(move || {
let reason = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
run_actor(actor, ctx, rx, cancellation_token)
}))
.unwrap_or(ExitReason::Panic("unexpected framework panic".to_string()));
let guard = CompletionGuard { completion, reason };
drop(guard);
});
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: concurrency/src/tasks/actor.rs
Line: 383-393
Comment:
**`wait_exit()` silently maps task abortion/cancellation to `ExitReason::Normal`**
When `rx.changed().await` returns `Err`, it means every `watch::Sender` was dropped without ever sending `Some(reason)`. If the spawned task is **aborted** from outside (e.g., the tokio `JoinHandle` is used to abort, or the runtime shuts down mid-execution), the sender is dropped without a value and this branch returns `ExitReason::Normal` — even though the actor was forcibly killed.
For a supervision foundation this silent misclassification matters: a supervisor that only restarts on `is_abnormal()` would silently ignore an aborted actor. Consider returning `ExitReason::Kill` here instead:
```suggestion
if rx.changed().await.is_err() {
return ExitReason::Kill;
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: concurrency/src/tasks/actor.rs
Line: 505-510
Comment:
**Panic in `stopped()` is not reflected in the returned `ExitReason`**
When `stopped()` panics, the error is logged but `exit_reason` is unchanged. Two scenarios arise:
1. **Clean stop + `stopped()` panics** → returns `ExitReason::Normal`. A supervisor relying on `is_abnormal()` won't restart the actor despite cleanup code crashing.
2. **Handler panic + `stopped()` panics** → returns the original `Panic("panic in handler: …")`, which is correct but hides the secondary cleanup failure.
The same logic applies to the equivalent lines in `concurrency/src/threads/actor.rs` (lines 479–485).
If panics in `stopped()` should be surfaced, consider:
```rust
if let Err(panic) = stop_result {
let msg = panic_message(&panic);
tracing::error!("Panic in stopped() callback: {msg}");
if !exit_reason.is_abnormal() {
exit_reason = ExitReason::Panic(format!("panic in stopped(): {msg}"));
}
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: concurrency/src/error.rs
Line: 2-12
Comment:
**`ExitReason` is missing `PartialEq` — limits usability for supervision**
`ExitReason` derives `Debug` and `Clone` but not `PartialEq`. Supervision trees will almost certainly need to compare exit reasons directly. Without `PartialEq`, callers must use `matches!` patterns everywhere and cannot use `==` or `!=`.
```suggestion
#[derive(Debug, Clone, PartialEq)]
pub enum ExitReason {
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: concurrency/src/tasks/actor.rs
Line: 515-523
Comment:
**`panic_message` is duplicated across both actor modules**
This identical helper function is defined in both `concurrency/src/tasks/actor.rs` (lines 515–523) and `concurrency/src/threads/actor.rs` (lines 490–498). Consider moving it to `concurrency/src/error.rs` or a shared internal utilities module so it can be reused without duplication.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat: add ExitReason tracking for actor ..." | Re-trigger Greptile
| let _thread_handle = rt::spawn(move || { | ||
| let _guard = CompletionGuard(completion); | ||
| run_actor(actor, ctx, rx, cancellation_token); | ||
| let reason = run_actor(actor, ctx, rx, cancellation_token); | ||
| let guard = CompletionGuard { | ||
| completion, | ||
| reason, | ||
| }; | ||
| drop(guard); | ||
| }); | ||
|
|
There was a problem hiding this comment.
RAII guarantee broken —
wait_exit() can deadlock if run_actor panics unexpectedly
The old code constructed CompletionGuard before calling run_actor, so the guard's Drop impl would always fire (writing Some(reason) and calling cvar.notify_all()) even if run_actor panicked at the thread level:
// OLD — always notifies, even on unexpected panic
let _guard = CompletionGuard(completion);
run_actor(actor, ctx, rx, cancellation_token);The new code constructs the guard after run_actor returns. If run_actor ever panics without being caught (e.g., a future bug in the framework, a stack overflow, or a catch_unwind boundary issue), the guard is never created, completion is just dropped as an Arc, and any thread blocked in wait_exit() → cvar.wait(…) will deadlock forever.
Note that the tasks mode is unaffected because dropping completion_tx causes rx.changed() to return Err, so wait_exit() returns ExitReason::Normal instead of hanging.
A safe fix is to wrap run_actor at the thread closure level:
let _thread_handle = rt::spawn(move || {
let reason = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
run_actor(actor, ctx, rx, cancellation_token)
}))
.unwrap_or(ExitReason::Panic("unexpected framework panic".to_string()));
let guard = CompletionGuard { completion, reason };
drop(guard);
});Prompt To Fix With AI
This is a comment left during a code review.
Path: concurrency/src/threads/actor.rs
Line: 412-420
Comment:
**RAII guarantee broken — `wait_exit()` can deadlock if `run_actor` panics unexpectedly**
The old code constructed `CompletionGuard` *before* calling `run_actor`, so the guard's `Drop` impl would always fire (writing `Some(reason)` and calling `cvar.notify_all()`) even if `run_actor` panicked at the thread level:
```rust
// OLD — always notifies, even on unexpected panic
let _guard = CompletionGuard(completion);
run_actor(actor, ctx, rx, cancellation_token);
```
The new code constructs the guard *after* `run_actor` returns. If `run_actor` ever panics without being caught (e.g., a future bug in the framework, a stack overflow, or a `catch_unwind` boundary issue), the guard is never created, `completion` is just dropped as an `Arc`, and any thread blocked in `wait_exit()` → `cvar.wait(…)` will deadlock forever.
Note that the `tasks` mode is unaffected because dropping `completion_tx` causes `rx.changed()` to return `Err`, so `wait_exit()` returns `ExitReason::Normal` instead of hanging.
A safe fix is to wrap `run_actor` at the thread closure level:
```rust
let _thread_handle = rt::spawn(move || {
let reason = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
run_actor(actor, ctx, rx, cancellation_token)
}))
.unwrap_or(ExitReason::Panic("unexpected framework panic".to_string()));
let guard = CompletionGuard { completion, reason };
drop(guard);
});
```
How can I resolve this? If you propose a fix, please make it concise.| pub async fn wait_exit(&self) -> ExitReason { | ||
| let mut rx = self.completion_rx.clone(); | ||
| while !*rx.borrow_and_update() { | ||
| loop { | ||
| if let Some(reason) = rx.borrow_and_update().clone() { | ||
| return reason; | ||
| } | ||
| if rx.changed().await.is_err() { | ||
| break; | ||
| return ExitReason::Normal; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
wait_exit() silently maps task abortion/cancellation to ExitReason::Normal
When rx.changed().await returns Err, it means every watch::Sender was dropped without ever sending Some(reason). If the spawned task is aborted from outside (e.g., the tokio JoinHandle is used to abort, or the runtime shuts down mid-execution), the sender is dropped without a value and this branch returns ExitReason::Normal — even though the actor was forcibly killed.
For a supervision foundation this silent misclassification matters: a supervisor that only restarts on is_abnormal() would silently ignore an aborted actor. Consider returning ExitReason::Kill here instead:
| pub async fn wait_exit(&self) -> ExitReason { | |
| let mut rx = self.completion_rx.clone(); | |
| while !*rx.borrow_and_update() { | |
| loop { | |
| if let Some(reason) = rx.borrow_and_update().clone() { | |
| return reason; | |
| } | |
| if rx.changed().await.is_err() { | |
| break; | |
| return ExitReason::Normal; | |
| } | |
| } | |
| } | |
| if rx.changed().await.is_err() { | |
| return ExitReason::Kill; | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: concurrency/src/tasks/actor.rs
Line: 383-393
Comment:
**`wait_exit()` silently maps task abortion/cancellation to `ExitReason::Normal`**
When `rx.changed().await` returns `Err`, it means every `watch::Sender` was dropped without ever sending `Some(reason)`. If the spawned task is **aborted** from outside (e.g., the tokio `JoinHandle` is used to abort, or the runtime shuts down mid-execution), the sender is dropped without a value and this branch returns `ExitReason::Normal` — even though the actor was forcibly killed.
For a supervision foundation this silent misclassification matters: a supervisor that only restarts on `is_abnormal()` would silently ignore an aborted actor. Consider returning `ExitReason::Kill` here instead:
```suggestion
if rx.changed().await.is_err() {
return ExitReason::Kill;
}
```
How can I resolve this? If you propose a fix, please make it concise.| cancellation_token.cancel(); | ||
| let stop_result = AssertUnwindSafe(actor.stopped(&ctx)).catch_unwind().await; | ||
| if let Err(panic) = stop_result { | ||
| tracing::error!("Panic in stopped() callback: {panic:?}"); | ||
| let msg = panic_message(&panic); | ||
| tracing::error!("Panic in stopped() callback: {msg}"); | ||
| } |
There was a problem hiding this comment.
Panic in
stopped() is not reflected in the returned ExitReason
When stopped() panics, the error is logged but exit_reason is unchanged. Two scenarios arise:
- Clean stop +
stopped()panics → returnsExitReason::Normal. A supervisor relying onis_abnormal()won't restart the actor despite cleanup code crashing. - Handler panic +
stopped()panics → returns the originalPanic("panic in handler: …"), which is correct but hides the secondary cleanup failure.
The same logic applies to the equivalent lines in concurrency/src/threads/actor.rs (lines 479–485).
If panics in stopped() should be surfaced, consider:
if let Err(panic) = stop_result {
let msg = panic_message(&panic);
tracing::error!("Panic in stopped() callback: {msg}");
if !exit_reason.is_abnormal() {
exit_reason = ExitReason::Panic(format!("panic in stopped(): {msg}"));
}
}Prompt To Fix With AI
This is a comment left during a code review.
Path: concurrency/src/tasks/actor.rs
Line: 505-510
Comment:
**Panic in `stopped()` is not reflected in the returned `ExitReason`**
When `stopped()` panics, the error is logged but `exit_reason` is unchanged. Two scenarios arise:
1. **Clean stop + `stopped()` panics** → returns `ExitReason::Normal`. A supervisor relying on `is_abnormal()` won't restart the actor despite cleanup code crashing.
2. **Handler panic + `stopped()` panics** → returns the original `Panic("panic in handler: …")`, which is correct but hides the secondary cleanup failure.
The same logic applies to the equivalent lines in `concurrency/src/threads/actor.rs` (lines 479–485).
If panics in `stopped()` should be surfaced, consider:
```rust
if let Err(panic) = stop_result {
let msg = panic_message(&panic);
tracing::error!("Panic in stopped() callback: {msg}");
if !exit_reason.is_abnormal() {
exit_reason = ExitReason::Panic(format!("panic in stopped(): {msg}"));
}
}
```
How can I resolve this? If you propose a fix, please make it concise.| #[derive(Debug, Clone)] | ||
| pub enum ExitReason { | ||
| /// Clean stop via `ctx.stop()` or channel closure. | ||
| Normal, | ||
| /// Ordered shutdown from a supervisor or linked actor. | ||
| Shutdown, | ||
| /// Actor panicked in `started()`, a handler, or `stopped()`. | ||
| Panic(String), | ||
| /// Untrappable kill signal. | ||
| Kill, | ||
| } |
There was a problem hiding this comment.
ExitReason is missing PartialEq — limits usability for supervision
ExitReason derives Debug and Clone but not PartialEq. Supervision trees will almost certainly need to compare exit reasons directly. Without PartialEq, callers must use matches! patterns everywhere and cannot use == or !=.
| #[derive(Debug, Clone)] | |
| pub enum ExitReason { | |
| /// Clean stop via `ctx.stop()` or channel closure. | |
| Normal, | |
| /// Ordered shutdown from a supervisor or linked actor. | |
| Shutdown, | |
| /// Actor panicked in `started()`, a handler, or `stopped()`. | |
| Panic(String), | |
| /// Untrappable kill signal. | |
| Kill, | |
| } | |
| #[derive(Debug, Clone, PartialEq)] | |
| pub enum ExitReason { |
Prompt To Fix With AI
This is a comment left during a code review.
Path: concurrency/src/error.rs
Line: 2-12
Comment:
**`ExitReason` is missing `PartialEq` — limits usability for supervision**
`ExitReason` derives `Debug` and `Clone` but not `PartialEq`. Supervision trees will almost certainly need to compare exit reasons directly. Without `PartialEq`, callers must use `matches!` patterns everywhere and cannot use `==` or `!=`.
```suggestion
#[derive(Debug, Clone, PartialEq)]
pub enum ExitReason {
```
How can I resolve this? If you propose a fix, please make it concise.| fn panic_message(panic: &Box<dyn std::any::Any + Send>) -> String { | ||
| if let Some(s) = panic.downcast_ref::<&str>() { | ||
| s.to_string() | ||
| } else if let Some(s) = panic.downcast_ref::<String>() { | ||
| s.clone() | ||
| } else { | ||
| format!("{panic:?}") | ||
| } | ||
| } |
There was a problem hiding this comment.
panic_message is duplicated across both actor modules
This identical helper function is defined in both concurrency/src/tasks/actor.rs (lines 515–523) and concurrency/src/threads/actor.rs (lines 490–498). Consider moving it to concurrency/src/error.rs or a shared internal utilities module so it can be reused without duplication.
Prompt To Fix With AI
This is a comment left during a code review.
Path: concurrency/src/tasks/actor.rs
Line: 515-523
Comment:
**`panic_message` is duplicated across both actor modules**
This identical helper function is defined in both `concurrency/src/tasks/actor.rs` (lines 515–523) and `concurrency/src/threads/actor.rs` (lines 490–498). Consider moving it to `concurrency/src/error.rs` or a shared internal utilities module so it can be reused without duplication.
How can I resolve this? If you propose a fix, please make it concise.
Bot Review ResponsesAll bot findings have been addressed in 4ec9859 and f37e99b. Summary below. Kimi
Fixed. Changed to
Fixed. Restored RAII pattern with
Codex
Fixed. If
Fixed. Guard is now constructed BEFORE
Fixed. Returns
Added doc comments: "Not yet produced — reserved for supervision tree implementation." Claude
Fixed. Same as Codex finding above.
Fixed. Added
Fixed. Moved to
Fixed. Changed to
Fixed. See Kimi response above.
Fixed. Now uses
Disregarded. The GreptileAll findings overlap with Codex/Claude above — all addressed. The sequence diagram and summary are accurate. |
| /// Clean stop via `ctx.stop()` or channel closure. | ||
| Normal, | ||
| /// Ordered shutdown from a supervisor or linked actor. | ||
| /// Not yet produced — reserved for supervision tree implementation. | ||
| Shutdown, | ||
| /// Actor panicked in `started()`, a handler, or `stopped()`. | ||
| Panic(String), | ||
| /// Untrappable kill signal. | ||
| /// Not yet produced — reserved for supervision tree implementation. | ||
| Kill, |
There was a problem hiding this comment.
I'm not entirely sure about the variants. Specifically, behaviorally it seems like normal and shutdown are equivalent, and kill and panic differ only on the message.
Also, how will the untrappable kill signal be implemented? It sounds like its benefit would be that it works even if the actor is stuck in an endless loop, but that would require real preemption and the ability to direct signals to specific threads (which I'm not sure if it's doable) to work. Otherwise, you would just need to wait for the cancellation point and at that point you can probably just allow the shutdown process to run.
There was a problem hiding this comment.
Good questions. 🤡
On Normal vs Shutdown: They are behaviorally identical today — both are "clean" exits where is_abnormal() returns false. The distinction matters for supervision: a Transient child should not restart on either, but the cause is different. Normal means the actor decided to stop on its own (ctx.stop()). Shutdown means something external told it to stop (a supervisor shutting down children in order, or a linked actor propagating an orderly shutdown). In Erlang, shutdown and {shutdown, Term} are distinct from normal for exactly this reason — they're both "clean" but carry different semantic meaning for debugging and logging. You want to know whether an actor stopped because it was done or because its supervisor told it to. That said, if in practice the distinction proves unnecessary, we can collapse them later.
On Kill and preemption: You're right that we can't preempt a stuck actor — Rust has no mechanism for that (no Thread::stop(), and tokio::JoinHandle::abort() only works at .await points). In practice, Kill means:
- Cancel the token immediately (same as Shutdown)
- Don't run
stopped()callback — this is the key difference from Shutdown, which does run cleanup - Bypass trap_exit — a supervisor trapping exits receives
Exitmessages for Normal/Shutdown/Panic, butKillterminates the actor without delivering a message. This matters for supervision trees: when a supervisor hits meltdown and needs to tear everything down immediately, it kills children without giving them a chance to interfere.
So Kill is not about preempting a stuck loop (that's a harder problem), but about skipping cleanup and bypassing exit trapping. Same as Erlang's exit(Pid, kill) — it doesn't magically preempt either, it just skips terminate/2 and can't be trapped.
The "stuck in an endless loop" problem is real but orthogonal — that's more in the territory of the Thread backend (where you could pthread_cancel in theory) or timeouts at the supervisor level (ShutdownType::Timeout + force drop). Worth discussing separately.
| /// Untrappable kill signal. | ||
| /// Not yet produced — reserved for supervision tree implementation. | ||
| Kill, |
There was a problem hiding this comment.
The doc says "Not yet produced", but this PR produces Kill in two places: wait_exit() returns it when the watch channel sender drops, and CompletionGuard::drop() defaults to it when no reason was set.
There was a problem hiding this comment.
Good catch — the doc was wrong. Fixed in bf225b9:
Killdoc updated to reflect it's currently produced as a fallback when an async task is aborted externally- Removed the "Not yet produced" note from
Kill(kept it onShutdownwhich truly is never produced yet) CompletionGuard::dropfallback changed fromKilltoPanic("unexpected framework panic")— see MegaRedHand's other comment
| let (lock, cvar) = &*self.completion; | ||
| let mut completed = lock.lock().unwrap_or_else(|p| p.into_inner()); | ||
| *completed = true; | ||
| *completed = self.reason.take().or(Some(ExitReason::Kill)); |
There was a problem hiding this comment.
Should we differentiate here between true Kills and this kind of unexpected failure?
There was a problem hiding this comment.
Good point — we should differentiate. Fixed in bf225b9:
The CompletionGuard fallback now uses Panic("unexpected framework panic") instead of Kill. This is more accurate since this path is only hit if run_actor panics at the thread boundary (a framework bug), not an intentional kill.
Kill is now reserved for intentional untrappable termination from supervision (ShutdownType::BrutalKill).
Summary
ExitReasonenum (Normal,Shutdown,Panic(String),Kill) withis_abnormal()methodActorRef::wait_exit()returns the exit reason (async in tasks, blocking in threads)ActorRef::exit_reason()polls the reason without blockingjoin()unchanged for backwards compatibilityexit_reasonexample demonstrating all exit pathsFoundation for supervision trees (#131, #133). The watch channel changes from
booltoOption<ExitReason>in tasks mode, andMutex<bool>toMutex<Option<ExitReason>>in threads mode.run_actor()now returnsExitReasonon every exit path: clean stop, channel closure, panic in started(), and panic in handler.Test plan
cargo run -p exit_reason