Skip to content

perf(sandbox): reduce memory dirtying from journald and envd logging#2423

Draft
ValentaTomas wants to merge 1 commit intomainfrom
perf/reduce-sandbox-memory-dirtying
Draft

perf(sandbox): reduce memory dirtying from journald and envd logging#2423
ValentaTomas wants to merge 1 commit intomainfrom
perf/reduce-sandbox-memory-dirtying

Conversation

@ValentaTomas
Copy link
Copy Markdown
Member

Summary

  • Add journald Storage=none drop-in inside guest VMs to prevent journal writes from dirtying memory pages during pause-resume cycles
  • Remove stdout writer from envd in FC mode — debug logs still ship via HTTP exporter to the orchestrator, but never touch journald/disk inside the VM
  • Add measure-memory-dirtying.sh benchmark script for measuring diff sizes across pause-resume cycles

Motivation

During frequent pause-resume cycles, journald and envd stdout were constantly writing to memory, dirtying pages that then had to be captured in every snapshot diff layer. This increased layer sizes and slowed snapshot restore.

Changes

  • journald.conf.tpl (new): Storage=none, MaxLevelConsole=warning, MaxLevelKMsg=warning — system errors still reach the kernel console, but nothing is stored in journal files
  • logger.go: In FC mode, envd writes only to the HTTP exporter (debug level preserved). In non-FC mode, stdout + debug unchanged
  • envd version: Bumped to 0.5.15
  • rootfs_test.go: Updated file count and added assertions for journald conf
  • measure-memory-dirtying.sh: Benchmark script that chains build → pause-resume cycles and prints memfile/rootfs diff sizes

Test plan

  • Unit test passes (TestAdditionalOCILayers)
  • Run benchmark on staging machine with GCS access to measure before/after diff sizes
  • Verify envd debug logs still arrive at orchestrator via HTTP exporter

- Add journald.conf drop-in with Storage=none to prevent journal
  writes from dirtying guest memory pages during pause-resume cycles
- Remove stdout writer from envd in FC mode — debug logs ship via
  HTTP exporter only, never hitting journald/disk inside the VM
- Add measure-memory-dirtying.sh script to benchmark diff sizes
  across pause-resume cycles with the orchestrator CLI tools
- Bump envd version to 0.5.15
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 17, 2026

PR Summary

Medium Risk
Changes guest logging behavior and journald configuration, which could reduce local observability or mask early-boot/logging failures if the HTTP exporter is unavailable. The rest is additive/testing-only, but validating log delivery in FC mode is important.

Overview
Reduces pause/resume snapshot diff sizes by disabling persistent journald storage in the guest (via a journald drop-in) and by stopping envd from writing logs to stdout in Firecracker mode so debug logs only go through the HTTP exporter; it also adds a helper script to benchmark dirtying across repeated pause/resume cycles, bumps the envd version, and updates rootfs tests to assert the new journald config is included.

Reviewed by Cursor Bugbot for commit 0576ecd. Bugbot is set up for automated code reviews on this repo. Configure here.

// HTTP exporter only — stdout goes to journald which dirties guest memory pages.
exporters = append(exporters, exporter.NewHTTPLogsExporter(ctx, isNotFC, mmdsChan))
}
level = zerolog.DebugLevel
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var level zerolog.Level initializes to 0, which equals zerolog.DebugLevel. The unconditional level = zerolog.DebugLevel outside the if-else is dead code — it reassigns the same zero value. If the intent is per-branch level selection in the future, the assignment belongs inside each branch; as written it just adds noise.

#
# Requires: root, KVM, Docker, NBD, hugepages
set -euo pipefail

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paths ./packages/orchestrator/cmd/create-build and ./packages/orchestrator/cmd/resume-build are relative to the repo root, but the usage comment says sudo ./measure-memory-dirtying.sh — implying the caller is in the script's own directory, where those paths do not exist. The script should either document that it must be run from the repo root, or resolve it dynamically.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0576ecd549

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +15 to +16
CREATE_BUILD="go run ./packages/orchestrator/cmd/create-build"
RESUME_BUILD="go run ./packages/orchestrator/cmd/resume-build"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resolve go-run paths from the script directory

The script’s documented usage (sudo ./measure-memory-dirtying.sh) implies running it from packages/orchestrator/cmd/resume-build, but these hard-coded commands use repo-root-relative paths, so invocation from the script directory resolves to non-existent paths and the benchmark fails before step 1 starts. This makes the new benchmark unusable unless callers happen to run it from the repository root.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger.go change removes stdout from envd in FC mode — worth a human verifying that the HTTP exporter reliably captures all debug logs before the stdout path is dropped entirely.

Extended reasoning...

Overview

This PR touches five files: a new journald drop-in config template, a logger change removing stdout from envd in FC (Firecracker) mode, a version bump, a test update, and a new developer benchmark script.

Security risks

No security-sensitive code paths are touched. The changes are scoped to logging configuration inside guest VMs and a developer utility script.

Level of scrutiny

The journald config, version bump, test update, and benchmark script are straightforward. The production-relevant change is in logger.go: in FC mode, envd previously wrote to both the HTTP exporter and stdout; now it writes only to the HTTP exporter. stdout inside a guest VM goes through journald, so removing it is the correct mechanism to prevent memory dirtying — but it does reduce log visibility if the HTTP exporter path ever fails silently. The test plan item for verifying HTTP exporter delivery is still unchecked.

Other factors

The two bugs found are nits: incomplete test assertions for the journald config (missing [Journal] header check) and a misleading Usage comment in the benchmark script. Neither is blocking. The overall approach is sound and well-motivated.

Comment on lines 103 to 108
assert.Equal(t, disabledContent, actualFiles["etc/systemd/system/systemd-networkd.service.d/override.conf"])

journaldConf := strings.TrimSpace(actualFiles["etc/systemd/journald.conf.d/e2b.conf"])
assert.Contains(t, journaldConf, "Storage=none")
assert.Contains(t, journaldConf, "MaxLevelConsole=warning")
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The journald conf test only checks Storage=none and MaxLevelConsole=warning, omitting the [Journal] section header and ForwardToSyslog=no. Without the [Journal] header, systemd silently ignores the entire drop-in file — a regression that accidentally removes it would make Storage=none a complete no-op in production while the test still passes. Consider using assert.Equal to match the full config content, as done for disabledContent elsewhere in this test.

Extended reasoning...

The new test assertions cover only two of the five settings produced by journald.conf.tpl: Storage=none and MaxLevelConsole=warning. Three settings go unchecked: [Journal] section header, ForwardToSyslog=no, and MaxLevelKMsg=warning.

The [Journal] section header is the most critical omission. systemd's drop-in parser uses an INI-style format where every directive must appear under a named section header. A .conf file that contains only bare key=value lines with no [Journal] header will be accepted by the parser without error but all its directives will be silently ignored. This means that if a future refactor accidentally removed the [Journal] line from journald.conf.tpl, the entire optimization (Storage=none, ForwardToSyslog=no, etc.) would become a no-op in production — journald would continue writing to disk and dirtying guest memory pages during every pause-resume cycle.

The current template (journald.conf.tpl) does correctly include [Journal] on line 4 and ForwardToSyslog=no on line 8, so production behaviour is correct today. This is purely a test coverage weakness: the assertions are not strict enough to catch a regression that removes the section header.

Step-by-step proof of the regression scenario:

  1. A future change accidentally removes [Journal] from journald.conf.tpl, leaving only bare key=value lines.
  2. The template still renders a file containing Storage=none and MaxLevelConsole=warning as literal strings.
  3. The test calls assert.Contains(t, journaldConf, "Storage=none") — this passes, because the string is present.
  4. The test calls assert.Contains(t, journaldConf, "MaxLevelConsole=warning") — this also passes.
  5. The test suite is green. But the drop-in file has no [Journal] section header.
  6. Inside the guest VM, systemd reads the file and discards all directives because there is no enclosing section. journald falls back to its default Storage=auto behaviour and continues writing to the journal, dirtying memory pages on every pause-resume cycle.

The fix is straightforward: either add assert.Contains(t, journaldConf, "[Journal]") (and for ForwardToSyslog=no), or use assert.Equal to match the full expected content — which is the pattern already established by the disabledContent check immediately above these lines.

Comment on lines +14 to +16
STORAGE="${1:-.local-build}"
CREATE_BUILD="go run ./packages/orchestrator/cmd/create-build"
RESUME_BUILD="go run ./packages/orchestrator/cmd/resume-build"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The go run paths in measure-memory-dirtying.sh are relative to CWD and only work when invoked from the repo root, but the Usage comment (sudo ./measure-memory-dirtying.sh) implies running from the script's own directory where those paths don't exist. Either update the Usage comment to show repo-root invocation (e.g. sudo ./packages/orchestrator/cmd/resume-build/measure-memory-dirtying.sh) or add a cd $(dirname "$0")/../../.. at the top to make the script location-independent.

Extended reasoning...

The script at packages/orchestrator/cmd/resume-build/measure-memory-dirtying.sh defines:

CREATE_BUILD="go run ./packages/orchestrator/cmd/create-build"
RESUME_BUILD="go run ./packages/orchestrator/cmd/resume-build"

These paths are relative to the current working directory at the time the script is invoked — not relative to the script file itself. The Usage comment says:

sudo ./measure-memory-dirtying.sh [storage-path]

The ./ prefix in ./measure-memory-dirtying.sh conventionally implies the user is already in the script's directory (packages/orchestrator/cmd/resume-build/). If a developer follows that convention literally:

Step-by-step proof of failure:

  1. Developer runs: cd packages/orchestrator/cmd/resume-build && sudo ./measure-memory-dirtying.sh
  2. CWD is now packages/orchestrator/cmd/resume-build/
  3. CREATE_BUILD expands to: go run ./packages/orchestrator/cmd/create-build
  4. Go looks for packages/orchestrator/cmd/resume-build/packages/orchestrator/cmd/create-build — which does not exist
  5. go run fails with a "no Go files" or "cannot find package" error

The RESUME_BUILD variable is equally broken under this scenario: go run ./packages/orchestrator/cmd/resume-build would resolve to packages/orchestrator/cmd/resume-build/packages/orchestrator/cmd/resume-build, which also does not exist.

The script only works correctly when invoked from the repo root (e.g. sudo ./packages/orchestrator/cmd/resume-build/measure-memory-dirtying.sh), but the Usage comment does not document this requirement.

The fix is either: (a) update the Usage comment to require repo-root invocation, or (b) add cd "$(dirname "$0")/../../.." && ... at the top of the script to resolve all paths relative to the repo root regardless of where the caller is located. Option (b) is more robust for a developer utility.

This is a developer-facing benchmark script, not production code, so the severity is nit — it will cause an immediately obvious failure rather than silent data corruption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants