docs: improve devnet-runner skill by MegaRedHand · Pull Request #226 · lambdaclass/ethlambda

MegaRedHand · 2026-03-13T21:43:49Z

This PR cleans up and improves the devnet-runner skill

github-actions · 2026-03-13T21:44:34Z

🤖 Kimi Code Review

Review Summary

This PR reorganizes the devnet-runner documentation by splitting large files into focused references and adding long-lived devnet capabilities. The changes are primarily documentation reorganization with one minor script improvement.

Issues Found

Missing cross-reference in SKILL.md (Line 97-98)
- The text mentions "See references/validator-config.md" but this file was just created in this PR. The reference should be updated to use the correct relative path from the SKILL.md location: ../../references/validator-config.md
Inconsistent path references
- Throughout the documentation, paths like lean-quickstart/local-devnet/genesis/validator-config.yaml are used, but these should probably be relative to the repo root for consistency
Potential security concern in long-lived-devnet.md
- Line 45: The manual genesis time update uses $(date +%s) which could create race conditions if multiple people run this simultaneously. Consider adding a warning about this.
Docker command in long-lived-devnet.md
- Lines 52-67: The Docker run commands use --network host which may not work in all environments (especially cloud/CI). Consider adding a note about this limitation.

Positive Changes

Good documentation split: The monolithic files were becoming unwieldy, and splitting into focused references improves maintainability
Added cleanup in run-devnet-with-timeout.sh: The explicit container cleanup after timeout is a good addition for preventing stale containers
Clear 60-second wait guidance: The long-lived devnet documentation correctly identifies the gossipsub backoff requirement

Suggestions

Consider adding a validation script to check that validator-config.yaml only contains nodes that will actually run, as this is a critical configuration step
The long-lived devnet procedure could benefit from a single script that automates the rolling restart with the 60-second wait

Automated review by Kimi (Moonshot AI) · custom prompt

github-actions · 2026-03-13T21:45:10Z

🤖 Codex Code Review

Potentially destructive container cleanup scope (Medium)
The new forced cleanup removes any running container whose name matches ^(zeam|ream|qlean|lantern|lighthouse|grandine|ethlambda)_, not just containers started by this script. On shared hosts or parallel devnet runs, this can kill unrelated nodes and cause test interference/data loss.
Reference: run-devnet-with-timeout.sh:38, run-devnet-with-timeout.sh:39
Suggestion: track container IDs started in this invocation (or use labels/network scoping) and only remove those.
Hardcoded client-name regex risks stale containers and maintenance drift (Low)
Log dump and cleanup both rely on a hardcoded client list. If new client names appear, logs won’t be collected and containers won’t be cleaned, causing inconsistent devnet behavior and harder debugging.
Reference: run-devnet-with-timeout.sh:28, run-devnet-with-timeout.sh:38
Suggestion: derive names dynamically from validator config or use a common label assigned at container start.

No Rust consensus-path code (fork choice, attestation validation, process_slots/process_block, XMSS, SSZ) is changed in this PR, so there are no direct consensus-logic correctness findings in scope.

Automated review by OpenAI Codex · custom prompt

github-actions · 2026-03-13T21:46:02Z

🤖 Claude Code Review

I have all the information I need from the diff. Here is the review:

PR #226 Review: `docs: improve devnet-runner skill`

This PR restructures the devnet-runner skill by extracting large reference sections into dedicated files (validator-config.md, long-lived-devnet.md) and adding a concrete operational guide for persistent devnets with rolling restarts. The reorganization intent is good. A few issues below.

Broken Reference: Default Image Tags Are Now Missing

SKILL.md (step 2, Quick Start) now says:

See references/clients.md for current default images, tags, and known compatibility issues.

But the diff deletes the default tags table from both SKILL.md and clients.md:

-| zeam | blockblaz/zeam | devnet1 |
-| ream | ghcr.io/reamlabs/ream | latest |
-| ethlambda | ghcr.io/lambdaclass/ethlambda | local |
...

And clients.md's per-client sections (with their Image: fields) are also deleted. The promise "See references/clients.md for current default images, tags" is now a dead end — that file no longer contains them. Either restore the tags table to clients.md, or change the SKILL.md pointer to direct agents to look at client-cmds/*.sh directly.

Flag Inconsistency in `long-lived-devnet.md`

The docker run examples in long-lived-devnet.md use:

--metrics-address 0.0.0.0 --metrics-port 8081

But CLAUDE.md documents the bind address flag as --http-address (not --metrics-address), and notes it is shared by both API and metrics servers. This discrepancy should be verified against the actual CLI.

Additionally, --gossipsub-port 9001 is used where the validator-config uses enrFields.quic. Gossipsub (pub/sub layer) and QUIC (transport layer) are distinct — if the actual flag is something like --quic-port or --listen-port, the example will silently misconfigure nodes. This is worth double-checking against bin/ethlambda's CLI.

Missing `--api-port` in Long-Lived Devnet Examples

The docker run examples in long-lived-devnet.md include --metrics-port but omit --api-port. Since ethlambda runs two independent HTTP servers, if four ethlambda_* nodes are started without explicit --api-port, all four will attempt to bind 5052 and three will fail silently (CLAUDE.md: "Bind failures are logged via error!() but do not crash the node"). The rolling restart checkpoint sync step then picks CHECKPOINT_SOURCE_PORT=5052 — which may land on the wrong node or a failed port. The examples should include --api-port with per-node unique values.

`checkpoint-sync.md` Deleted Without Replacement for `--restart-client`

references/checkpoint-sync.md documented the spin-node.sh --restart-client <node> workflow (Step 5 covered this in detail). That section is now gone, and the new long-lived-devnet.md exclusively covers the docker run -d approach. The --restart-client flag still exists in spin-node.sh and is a valid workflow for short test devnets. If the intent is to deprecate it in favor of detached containers, a brief note in SKILL.md's "Manual Commands" section would prevent confusion.

Script Improvement is Sound

run-devnet-with-timeout.sh addition:

for node in $(docker ps --format '{{.Names}}' | grep -E '^(zeam|ream|qlean|lantern|lighthouse|grandine|ethlambda)_'); do
  docker rm -f "$node" 2>/dev/null
done

This correctly addresses the case where spin-node.sh doesn't clean up containers when killed. The grep pattern matches the established naming convention. docker rm -f is appropriate for cleanup (no need for graceful shutdown in a timed test). Good addition.

Minor: `×` → `x` in Timing Formulas

Changing the Unicode multiplication sign × to x removes a potential rendering issue in some terminals/editors. Fine as-is.

Summary

The information architecture (short main skill + detailed references) is an improvement.
The 60-second gossipsub backoff wait and restart order (non-aggregator first) in long-lived-devnet.md are valuable operational knowledge.
The broken default tags reference, flag inconsistencies in the new docker examples, and missing --api-port in multi-node setups are the most actionable issues to fix before merging.

Automated review by Claude (Anthropic) · custom prompt

greptile-apps · 2026-03-13T21:46:51Z

Greptile Summary

This PR reorganises the devnet-runner skill by splitting a large monolithic SKILL.md into three focused reference files (validator-config.md, clients.md, long-lived-devnet.md), and replaces the old checkpoint-sync.md (which described spin-node.sh --restart-client) with a new long-lived-devnet.md that covers persistent detached-container devnets and rolling restarts. A small script fix is also included that explicitly removes containers after the timeout in run-devnet-with-timeout.sh.

Key observations:

Documentation gap: checkpoint-sync.md is deleted and the spin-node.sh --restart-client procedure it described is no longer documented anywhere in the skill. If users need to restart a node in a locally-managed spin-node.sh devnet, there is no longer a documented procedure for this.
Port-conflict risk in long-lived-devnet.md: The docker run example for starting multiple ethlambda nodes omits --api-port, which means every node defaults to port 5052. On a single host this causes a bind conflict for all but one node — and the rolling restart procedure's CHECKPOINT_SOURCE_PORT=5052 inherits this ambiguity.
The script change (docker rm -f cleanup loop) is a net improvement; only a minor inaccuracy in the status message ("Stopping" vs "Stopping and removing").

Confidence Score: 4/5

Safe to merge with one documentation fix recommended before follow-on users run multi-node long-lived devnets.
All changes are documentation and a small shell-script cleanup. The structural reorganisation is clean and the new reference files are accurate. The main concern is a logic gap in the long-lived-devnet example that omits --api-port, which will cause port 5052 conflicts for anyone spinning up more than one ethlambda node on the same host following the guide literally.
references/long-lived-devnet.md — the multi-node docker run example needs explicit --api-port values per node.

Important Files Changed

Filename	Overview
.claude/skills/devnet-runner/SKILL.md	Primary skill file refactored to reduce verbosity by moving detailed content (validator schema, client tables, checkpoint sync procedure) into dedicated reference files; description updated to enumerate more trigger phrases for skill activation.
.claude/skills/devnet-runner/scripts/run-devnet-with-timeout.sh	Added explicit container cleanup loop after killing spin-node.sh to handle cases where spin-node.sh doesn't remove containers on signal; minor inaccuracy in the "Stopping containers..." status message (docker rm -f stops and removes).
.claude/skills/devnet-runner/references/long-lived-devnet.md	New reference file for persistent detached devnets and rolling restarts; the docker run example for multi-node ethlambda deployments omits --api-port, which causes port 5052 conflicts when running multiple nodes on the same host.
.claude/skills/devnet-runner/references/validator-config.md	New reference file extracted from SKILL.md; contains full validator-config.yaml schema, field reference table, add/remove node procedures, port allocation guide, and local vs ansible deployment comparison — all accurate and consistent with the rest of the skill.
.claude/skills/devnet-runner/references/clients.md	Per-client configuration notes removed and replaced with a pointer to client-cmds scripts; retained Docker image table, port defaults, known compatibility issues, and environment variable reference — well-organised.
.claude/skills/devnet-runner/references/checkpoint-sync.md	Deleted; the spin-node.sh --restart-client procedure it documented is no longer covered anywhere in the skill — local devnet checkpoint-sync restarts are now undocumented.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([User Request]) --> B{Devnet type?}

    B -->|Short local test| C[Edit validator-config.yaml]
    B -->|Long-lived / remote| D[Generate genesis via spin-node.sh]

    C --> E[Update client image tags]
    E --> F[run-devnet-with-timeout.sh &lt;seconds&gt;]
    F --> G[spin-node.sh --generateGenesis runs in background]
    G --> H[sleep &lt;seconds&gt;]
    H --> I[Dump logs to *.log files]
    I --> J[kill spin-node.sh]
    J --> K[docker rm -f remaining containers]

    D --> L[docker run -d --restart unless-stopped\nper node with explicit ports]
    L --> M{Upgrade needed?}
    M -->|No| N([Devnet running persistently])
    M -->|Yes| O[docker pull new image]
    O --> P[docker rm -f node]
    P --> Q[sleep 60s — gossipsub backoff]
    Q --> R[docker run -d new image\n+ --checkpoint-sync-url]
    R --> S{Verify gossip\n+ finalization}
    S -->|Pass| T{More nodes\nto restart?}
    S -->|Fail| Q
    T -->|Yes| O
    T -->|No| N

Prompt To Fix All With AI

This is a comment left during a code review.
Path: .claude/skills/devnet-runner/scripts/run-devnet-with-timeout.sh
Line: 37

Comment:
**Misleading status message**

The message says "Stopping containers..." but `docker rm -f` both stops and removes containers. A more accurate message avoids confusion when reading script output.

```suggestion
echo "Stopping and removing containers..."
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: .claude/skills/devnet-runner/references/long-lived-devnet.md
Line: 50-58

Comment:
**Missing `--api-port` causes port conflicts for multi-node runs**

The example `docker run` command omits `--api-port`. When multiple ethlambda nodes run on the same host (as shown with `ethlambda_0` through `ethlambda_3`), they all default to `--api-port 5052`. Only one node can bind to that port; the rest fail silently or refuse connections.

This directly affects the rolling restart procedure, where `CHECKPOINT_SOURCE_PORT=5052` is used as the checkpoint URL port — if only one node is actually on 5052, the choice of source node is constrained but undocumented.

Consider adding `--api-port` explicitly to the example, matching the per-node port assignment table in `validator-config.md` (e.g., ethlambda_0 → 5052, ethlambda_1 → 5053, etc.):

```bash
docker run -d --restart unless-stopped --name ethlambda_0 --network host \
  -v $GENESIS:/config -v $DATA/ethlambda_0:/data \
  $IMAGE \
  --custom-network-config-dir /config \
  --gossipsub-port 9001 --node-id ethlambda_0 \
  --node-key /config/ethlambda_0.key \
  --api-port 5052 \
  --metrics-address 0.0.0.0 --metrics-port 8081

# ethlambda_1
docker run -d --restart unless-stopped --name ethlambda_1 --network host \
  -v $GENESIS:/config -v $DATA/ethlambda_1:/data \
  $IMAGE \
  ...
  --api-port 5053 \
  --metrics-port 8082
```

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: 9eec519}

greptile-apps · 2026-03-13T21:46:55Z

.claude/skills/devnet-runner/scripts/run-devnet-with-timeout.sh

 wait $PID 2>/dev/null
+
+# Explicitly stop and remove containers (spin-node.sh may not clean up on kill)
+echo "Stopping containers..."


Misleading status message

The message says "Stopping containers..." but docker rm -f both stops and removes containers. A more accurate message avoids confusion when reading script output.

Suggested change

echo "Stopping containers..."

echo "Stopping and removing containers..."

Prompt To Fix With AI

This is a comment left during a code review. Path: .claude/skills/devnet-runner/scripts/run-devnet-with-timeout.sh Line: 37 Comment: **Misleading status message** The message says "Stopping containers..." but `docker rm -f` both stops and removes containers. A more accurate message avoids confusion when reading script output. ```suggestion echo "Stopping and removing containers..." ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-13T21:46:56Z

.claude/skills/devnet-runner/references/long-lived-devnet.md

+# Start each node (adjust ports, node-id, aggregator flag per validator-config.yaml)
+docker run -d --restart unless-stopped --name ethlambda_0 --network host \
+  -v $GENESIS:/config -v $DATA/ethlambda_0:/data \
+  $IMAGE \
+  --custom-network-config-dir /config \
+  --gossipsub-port 9001 --node-id ethlambda_0 \
+  --node-key /config/ethlambda_0.key \
+  --metrics-address 0.0.0.0 --metrics-port 8081
+


Missing --api-port causes port conflicts for multi-node runs

The example docker run command omits --api-port. When multiple ethlambda nodes run on the same host (as shown with ethlambda_0 through ethlambda_3), they all default to --api-port 5052. Only one node can bind to that port; the rest fail silently or refuse connections.

This directly affects the rolling restart procedure, where CHECKPOINT_SOURCE_PORT=5052 is used as the checkpoint URL port — if only one node is actually on 5052, the choice of source node is constrained but undocumented.

Consider adding --api-port explicitly to the example, matching the per-node port assignment table in validator-config.md (e.g., ethlambda_0 → 5052, ethlambda_1 → 5053, etc.):

docker run -d --restart unless-stopped --name ethlambda_0 --network host \ -v $GENESIS:/config -v $DATA/ethlambda_0:/data \ $IMAGE \ --custom-network-config-dir /config \ --gossipsub-port 9001 --node-id ethlambda_0 \ --node-key /config/ethlambda_0.key \ --api-port 5052 \ --metrics-address 0.0.0.0 --metrics-port 8081 # ethlambda_1 docker run -d --restart unless-stopped --name ethlambda_1 --network host \ -v $GENESIS:/config -v $DATA/ethlambda_1:/data \ $IMAGE \ ... --api-port 5053 \ --metrics-port 8082

Prompt To Fix With AI

This is a comment left during a code review. Path: .claude/skills/devnet-runner/references/long-lived-devnet.md Line: 50-58 Comment: **Missing `--api-port` causes port conflicts for multi-node runs** The example `docker run` command omits `--api-port`. When multiple ethlambda nodes run on the same host (as shown with `ethlambda_0` through `ethlambda_3`), they all default to `--api-port 5052`. Only one node can bind to that port; the rest fail silently or refuse connections. This directly affects the rolling restart procedure, where `CHECKPOINT_SOURCE_PORT=5052` is used as the checkpoint URL port — if only one node is actually on 5052, the choice of source node is constrained but undocumented. Consider adding `--api-port` explicitly to the example, matching the per-node port assignment table in `validator-config.md` (e.g., ethlambda_0 → 5052, ethlambda_1 → 5053, etc.): ```bash docker run -d --restart unless-stopped --name ethlambda_0 --network host \ -v $GENESIS:/config -v $DATA/ethlambda_0:/data \ $IMAGE \ --custom-network-config-dir /config \ --gossipsub-port 9001 --node-id ethlambda_0 \ --node-key /config/ethlambda_0.key \ --api-port 5052 \ --metrics-address 0.0.0.0 --metrics-port 8081 # ethlambda_1 docker run -d --restart unless-stopped --name ethlambda_1 --network host \ -v $GENESIS:/config -v $DATA/ethlambda_1:/data \ $IMAGE \ ... --api-port 5053 \ --metrics-port 8082 ``` How can I resolve this? If you propose a fix, please make it concise.

docs: improve devnet-runner skill

9eec519

greptile-apps bot reviewed Mar 13, 2026

View reviewed changes

pablodeymo and others added 2 commits March 13, 2026 18:59

Merge branch 'main' into update-devnet-restart-skill

302d60a

docs: fix wrong flags in skill

556b200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: improve devnet-runner skill#226

docs: improve devnet-runner skill#226
MegaRedHand wants to merge 3 commits intomainfrom
update-devnet-restart-skill

MegaRedHand commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

greptile-apps bot commented Mar 13, 2026

Important Files Changed

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

greptile-apps bot Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	echo "Stopping containers..."
	echo "Stopping and removing containers..."

Conversation

MegaRedHand commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

🤖 Kimi Code Review

Review Summary

Issues Found

Positive Changes

Suggestions

Uh oh!

github-actions bot commented Mar 13, 2026

🤖 Codex Code Review

Uh oh!

github-actions bot commented Mar 13, 2026

🤖 Claude Code Review

PR #226 Review: docs: improve devnet-runner skill

Broken Reference: Default Image Tags Are Now Missing

Flag Inconsistency in long-lived-devnet.md

Missing --api-port in Long-Lived Devnet Examples

checkpoint-sync.md Deleted Without Replacement for --restart-client

Script Improvement is Sound

Minor: × → x in Timing Formulas

Summary

Uh oh!

greptile-apps bot commented Mar 13, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PR #226 Review: `docs: improve devnet-runner skill`

Flag Inconsistency in `long-lived-devnet.md`

Missing `--api-port` in Long-Lived Devnet Examples

`checkpoint-sync.md` Deleted Without Replacement for `--restart-client`

Minor: `×` → `x` in Timing Formulas