Skip to content

docs: improve devnet-runner skill#226

Open
MegaRedHand wants to merge 3 commits intomainfrom
update-devnet-restart-skill
Open

docs: improve devnet-runner skill#226
MegaRedHand wants to merge 3 commits intomainfrom
update-devnet-restart-skill

Conversation

@MegaRedHand
Copy link
Collaborator

This PR cleans up and improves the devnet-runner skill

@github-actions
Copy link

🤖 Kimi Code Review

Review Summary

This PR reorganizes the devnet-runner documentation by splitting large files into focused references and adding long-lived devnet capabilities. The changes are primarily documentation reorganization with one minor script improvement.

Issues Found

  1. Missing cross-reference in SKILL.md (Line 97-98)

    • The text mentions "See references/validator-config.md" but this file was just created in this PR. The reference should be updated to use the correct relative path from the SKILL.md location: ../../references/validator-config.md
  2. Inconsistent path references

    • Throughout the documentation, paths like lean-quickstart/local-devnet/genesis/validator-config.yaml are used, but these should probably be relative to the repo root for consistency
  3. Potential security concern in long-lived-devnet.md

    • Line 45: The manual genesis time update uses $(date +%s) which could create race conditions if multiple people run this simultaneously. Consider adding a warning about this.
  4. Docker command in long-lived-devnet.md

    • Lines 52-67: The Docker run commands use --network host which may not work in all environments (especially cloud/CI). Consider adding a note about this limitation.

Positive Changes

  1. Good documentation split: The monolithic files were becoming unwieldy, and splitting into focused references improves maintainability
  2. Added cleanup in run-devnet-with-timeout.sh: The explicit container cleanup after timeout is a good addition for preventing stale containers
  3. Clear 60-second wait guidance: The long-lived devnet documentation correctly identifies the gossipsub backoff requirement

Suggestions

  1. Consider adding a validation script to check that validator-config.yaml only contains nodes that will actually run, as this is a critical configuration step
  2. The long-lived devnet procedure could benefit from a single script that automates the rolling restart with the 60-second wait

Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

🤖 Codex Code Review

  1. Potentially destructive container cleanup scope (Medium)
    The new forced cleanup removes any running container whose name matches ^(zeam|ream|qlean|lantern|lighthouse|grandine|ethlambda)_, not just containers started by this script. On shared hosts or parallel devnet runs, this can kill unrelated nodes and cause test interference/data loss.
    Reference: run-devnet-with-timeout.sh:38, run-devnet-with-timeout.sh:39
    Suggestion: track container IDs started in this invocation (or use labels/network scoping) and only remove those.

  2. Hardcoded client-name regex risks stale containers and maintenance drift (Low)
    Log dump and cleanup both rely on a hardcoded client list. If new client names appear, logs won’t be collected and containers won’t be cleaned, causing inconsistent devnet behavior and harder debugging.
    Reference: run-devnet-with-timeout.sh:28, run-devnet-with-timeout.sh:38
    Suggestion: derive names dynamically from validator config or use a common label assigned at container start.

No Rust consensus-path code (fork choice, attestation validation, process_slots/process_block, XMSS, SSZ) is changed in this PR, so there are no direct consensus-logic correctness findings in scope.


Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

🤖 Claude Code Review

I have all the information I need from the diff. Here is the review:


PR #226 Review: docs: improve devnet-runner skill

This PR restructures the devnet-runner skill by extracting large reference sections into dedicated files (validator-config.md, long-lived-devnet.md) and adding a concrete operational guide for persistent devnets with rolling restarts. The reorganization intent is good. A few issues below.


Broken Reference: Default Image Tags Are Now Missing

SKILL.md (step 2, Quick Start) now says:

See references/clients.md for current default images, tags, and known compatibility issues.

But the diff deletes the default tags table from both SKILL.md and clients.md:

-| zeam | blockblaz/zeam | devnet1 |
-| ream | ghcr.io/reamlabs/ream | latest |
-| ethlambda | ghcr.io/lambdaclass/ethlambda | local |
...

And clients.md's per-client sections (with their Image: fields) are also deleted. The promise "See references/clients.md for current default images, tags" is now a dead end — that file no longer contains them. Either restore the tags table to clients.md, or change the SKILL.md pointer to direct agents to look at client-cmds/*.sh directly.


Flag Inconsistency in long-lived-devnet.md

The docker run examples in long-lived-devnet.md use:

--metrics-address 0.0.0.0 --metrics-port 8081

But CLAUDE.md documents the bind address flag as --http-address (not --metrics-address), and notes it is shared by both API and metrics servers. This discrepancy should be verified against the actual CLI.

Additionally, --gossipsub-port 9001 is used where the validator-config uses enrFields.quic. Gossipsub (pub/sub layer) and QUIC (transport layer) are distinct — if the actual flag is something like --quic-port or --listen-port, the example will silently misconfigure nodes. This is worth double-checking against bin/ethlambda's CLI.


Missing --api-port in Long-Lived Devnet Examples

The docker run examples in long-lived-devnet.md include --metrics-port but omit --api-port. Since ethlambda runs two independent HTTP servers, if four ethlambda_* nodes are started without explicit --api-port, all four will attempt to bind 5052 and three will fail silently (CLAUDE.md: "Bind failures are logged via error!() but do not crash the node"). The rolling restart checkpoint sync step then picks CHECKPOINT_SOURCE_PORT=5052 — which may land on the wrong node or a failed port. The examples should include --api-port with per-node unique values.


checkpoint-sync.md Deleted Without Replacement for --restart-client

references/checkpoint-sync.md documented the spin-node.sh --restart-client <node> workflow (Step 5 covered this in detail). That section is now gone, and the new long-lived-devnet.md exclusively covers the docker run -d approach. The --restart-client flag still exists in spin-node.sh and is a valid workflow for short test devnets. If the intent is to deprecate it in favor of detached containers, a brief note in SKILL.md's "Manual Commands" section would prevent confusion.


Script Improvement is Sound

run-devnet-with-timeout.sh addition:

for node in $(docker ps --format '{{.Names}}' | grep -E '^(zeam|ream|qlean|lantern|lighthouse|grandine|ethlambda)_'); do
  docker rm -f "$node" 2>/dev/null
done

This correctly addresses the case where spin-node.sh doesn't clean up containers when killed. The grep pattern matches the established naming convention. docker rm -f is appropriate for cleanup (no need for graceful shutdown in a timed test). Good addition.


Minor: ×x in Timing Formulas

Changing the Unicode multiplication sign × to x removes a potential rendering issue in some terminals/editors. Fine as-is.


Summary

  • The information architecture (short main skill + detailed references) is an improvement.
  • The 60-second gossipsub backoff wait and restart order (non-aggregator first) in long-lived-devnet.md are valuable operational knowledge.
  • The broken default tags reference, flag inconsistencies in the new docker examples, and missing --api-port in multi-node setups are the most actionable issues to fix before merging.

Automated review by Claude (Anthropic) · custom prompt

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR reorganises the devnet-runner skill by splitting a large monolithic SKILL.md into three focused reference files (validator-config.md, clients.md, long-lived-devnet.md), and replaces the old checkpoint-sync.md (which described spin-node.sh --restart-client) with a new long-lived-devnet.md that covers persistent detached-container devnets and rolling restarts. A small script fix is also included that explicitly removes containers after the timeout in run-devnet-with-timeout.sh.

Key observations:

  • Documentation gap: checkpoint-sync.md is deleted and the spin-node.sh --restart-client procedure it described is no longer documented anywhere in the skill. If users need to restart a node in a locally-managed spin-node.sh devnet, there is no longer a documented procedure for this.
  • Port-conflict risk in long-lived-devnet.md: The docker run example for starting multiple ethlambda nodes omits --api-port, which means every node defaults to port 5052. On a single host this causes a bind conflict for all but one node — and the rolling restart procedure's CHECKPOINT_SOURCE_PORT=5052 inherits this ambiguity.
  • The script change (docker rm -f cleanup loop) is a net improvement; only a minor inaccuracy in the status message ("Stopping" vs "Stopping and removing").

Confidence Score: 4/5

  • Safe to merge with one documentation fix recommended before follow-on users run multi-node long-lived devnets.
  • All changes are documentation and a small shell-script cleanup. The structural reorganisation is clean and the new reference files are accurate. The main concern is a logic gap in the long-lived-devnet example that omits --api-port, which will cause port 5052 conflicts for anyone spinning up more than one ethlambda node on the same host following the guide literally.
  • references/long-lived-devnet.md — the multi-node docker run example needs explicit --api-port values per node.

Important Files Changed

Filename Overview
.claude/skills/devnet-runner/SKILL.md Primary skill file refactored to reduce verbosity by moving detailed content (validator schema, client tables, checkpoint sync procedure) into dedicated reference files; description updated to enumerate more trigger phrases for skill activation.
.claude/skills/devnet-runner/scripts/run-devnet-with-timeout.sh Added explicit container cleanup loop after killing spin-node.sh to handle cases where spin-node.sh doesn't remove containers on signal; minor inaccuracy in the "Stopping containers..." status message (docker rm -f stops and removes).
.claude/skills/devnet-runner/references/long-lived-devnet.md New reference file for persistent detached devnets and rolling restarts; the docker run example for multi-node ethlambda deployments omits --api-port, which causes port 5052 conflicts when running multiple nodes on the same host.
.claude/skills/devnet-runner/references/validator-config.md New reference file extracted from SKILL.md; contains full validator-config.yaml schema, field reference table, add/remove node procedures, port allocation guide, and local vs ansible deployment comparison — all accurate and consistent with the rest of the skill.
.claude/skills/devnet-runner/references/clients.md Per-client configuration notes removed and replaced with a pointer to client-cmds scripts; retained Docker image table, port defaults, known compatibility issues, and environment variable reference — well-organised.
.claude/skills/devnet-runner/references/checkpoint-sync.md Deleted; the spin-node.sh --restart-client procedure it documented is no longer covered anywhere in the skill — local devnet checkpoint-sync restarts are now undocumented.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([User Request]) --> B{Devnet type?}

    B -->|Short local test| C[Edit validator-config.yaml]
    B -->|Long-lived / remote| D[Generate genesis via spin-node.sh]

    C --> E[Update client image tags]
    E --> F[run-devnet-with-timeout.sh &lt;seconds&gt;]
    F --> G[spin-node.sh --generateGenesis runs in background]
    G --> H[sleep &lt;seconds&gt;]
    H --> I[Dump logs to *.log files]
    I --> J[kill spin-node.sh]
    J --> K[docker rm -f remaining containers]

    D --> L[docker run -d --restart unless-stopped\nper node with explicit ports]
    L --> M{Upgrade needed?}
    M -->|No| N([Devnet running persistently])
    M -->|Yes| O[docker pull new image]
    O --> P[docker rm -f node]
    P --> Q[sleep 60s — gossipsub backoff]
    Q --> R[docker run -d new image\n+ --checkpoint-sync-url]
    R --> S{Verify gossip\n+ finalization}
    S -->|Pass| T{More nodes\nto restart?}
    S -->|Fail| Q
    T -->|Yes| O
    T -->|No| N
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: .claude/skills/devnet-runner/scripts/run-devnet-with-timeout.sh
Line: 37

Comment:
**Misleading status message**

The message says "Stopping containers..." but `docker rm -f` both stops and removes containers. A more accurate message avoids confusion when reading script output.

```suggestion
echo "Stopping and removing containers..."
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: .claude/skills/devnet-runner/references/long-lived-devnet.md
Line: 50-58

Comment:
**Missing `--api-port` causes port conflicts for multi-node runs**

The example `docker run` command omits `--api-port`. When multiple ethlambda nodes run on the same host (as shown with `ethlambda_0` through `ethlambda_3`), they all default to `--api-port 5052`. Only one node can bind to that port; the rest fail silently or refuse connections.

This directly affects the rolling restart procedure, where `CHECKPOINT_SOURCE_PORT=5052` is used as the checkpoint URL port — if only one node is actually on 5052, the choice of source node is constrained but undocumented.

Consider adding `--api-port` explicitly to the example, matching the per-node port assignment table in `validator-config.md` (e.g., ethlambda_0 → 5052, ethlambda_1 → 5053, etc.):

```bash
docker run -d --restart unless-stopped --name ethlambda_0 --network host \
  -v $GENESIS:/config -v $DATA/ethlambda_0:/data \
  $IMAGE \
  --custom-network-config-dir /config \
  --gossipsub-port 9001 --node-id ethlambda_0 \
  --node-key /config/ethlambda_0.key \
  --api-port 5052 \
  --metrics-address 0.0.0.0 --metrics-port 8081

# ethlambda_1
docker run -d --restart unless-stopped --name ethlambda_1 --network host \
  -v $GENESIS:/config -v $DATA/ethlambda_1:/data \
  $IMAGE \
  ...
  --api-port 5053 \
  --metrics-port 8082
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 9eec519

wait $PID 2>/dev/null

# Explicitly stop and remove containers (spin-node.sh may not clean up on kill)
echo "Stopping containers..."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading status message

The message says "Stopping containers..." but docker rm -f both stops and removes containers. A more accurate message avoids confusion when reading script output.

Suggested change
echo "Stopping containers..."
echo "Stopping and removing containers..."
Prompt To Fix With AI
This is a comment left during a code review.
Path: .claude/skills/devnet-runner/scripts/run-devnet-with-timeout.sh
Line: 37

Comment:
**Misleading status message**

The message says "Stopping containers..." but `docker rm -f` both stops and removes containers. A more accurate message avoids confusion when reading script output.

```suggestion
echo "Stopping and removing containers..."
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +50 to +58
# Start each node (adjust ports, node-id, aggregator flag per validator-config.yaml)
docker run -d --restart unless-stopped --name ethlambda_0 --network host \
-v $GENESIS:/config -v $DATA/ethlambda_0:/data \
$IMAGE \
--custom-network-config-dir /config \
--gossipsub-port 9001 --node-id ethlambda_0 \
--node-key /config/ethlambda_0.key \
--metrics-address 0.0.0.0 --metrics-port 8081

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing --api-port causes port conflicts for multi-node runs

The example docker run command omits --api-port. When multiple ethlambda nodes run on the same host (as shown with ethlambda_0 through ethlambda_3), they all default to --api-port 5052. Only one node can bind to that port; the rest fail silently or refuse connections.

This directly affects the rolling restart procedure, where CHECKPOINT_SOURCE_PORT=5052 is used as the checkpoint URL port — if only one node is actually on 5052, the choice of source node is constrained but undocumented.

Consider adding --api-port explicitly to the example, matching the per-node port assignment table in validator-config.md (e.g., ethlambda_0 → 5052, ethlambda_1 → 5053, etc.):

docker run -d --restart unless-stopped --name ethlambda_0 --network host \
  -v $GENESIS:/config -v $DATA/ethlambda_0:/data \
  $IMAGE \
  --custom-network-config-dir /config \
  --gossipsub-port 9001 --node-id ethlambda_0 \
  --node-key /config/ethlambda_0.key \
  --api-port 5052 \
  --metrics-address 0.0.0.0 --metrics-port 8081

# ethlambda_1
docker run -d --restart unless-stopped --name ethlambda_1 --network host \
  -v $GENESIS:/config -v $DATA/ethlambda_1:/data \
  $IMAGE \
  ...
  --api-port 5053 \
  --metrics-port 8082
Prompt To Fix With AI
This is a comment left during a code review.
Path: .claude/skills/devnet-runner/references/long-lived-devnet.md
Line: 50-58

Comment:
**Missing `--api-port` causes port conflicts for multi-node runs**

The example `docker run` command omits `--api-port`. When multiple ethlambda nodes run on the same host (as shown with `ethlambda_0` through `ethlambda_3`), they all default to `--api-port 5052`. Only one node can bind to that port; the rest fail silently or refuse connections.

This directly affects the rolling restart procedure, where `CHECKPOINT_SOURCE_PORT=5052` is used as the checkpoint URL port — if only one node is actually on 5052, the choice of source node is constrained but undocumented.

Consider adding `--api-port` explicitly to the example, matching the per-node port assignment table in `validator-config.md` (e.g., ethlambda_0 → 5052, ethlambda_1 → 5053, etc.):

```bash
docker run -d --restart unless-stopped --name ethlambda_0 --network host \
  -v $GENESIS:/config -v $DATA/ethlambda_0:/data \
  $IMAGE \
  --custom-network-config-dir /config \
  --gossipsub-port 9001 --node-id ethlambda_0 \
  --node-key /config/ethlambda_0.key \
  --api-port 5052 \
  --metrics-address 0.0.0.0 --metrics-port 8081

# ethlambda_1
docker run -d --restart unless-stopped --name ethlambda_1 --network host \
  -v $GENESIS:/config -v $DATA/ethlambda_1:/data \
  $IMAGE \
  ...
  --api-port 5053 \
  --metrics-port 8082
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants