Skip to content

feat(orch): support rootfs resizing when building from template#2406

Draft
tvi wants to merge 1 commit intomainfrom
t/build-template
Draft

feat(orch): support rootfs resizing when building from template#2406
tvi wants to merge 1 commit intomainfrom
t/build-template

Conversation

@tvi
Copy link
Copy Markdown
Contributor

@tvi tvi commented Apr 15, 2026

Previously, building a template from another template (FromTemplate) always reused the source template's build artifacts as-is, with no ability to change the disk size. The rootfs size was baked into the header at base layer creation time and copied identically through every subsequent diff generation. Since resize2fs requires a local ext4 file, there was no way to resize the diff-chain-based virtual block device.

This commit modifies the FromTemplate path in the base build phase to support disk resizing by materializing the source template's rootfs into a new ext4 file:

  1. The source template's rootfs is loaded from storage via NBD (same mechanism as cmd/mount-build-rootfs) and mounted read-only.
  2. A fresh ext4 filesystem is created at the target size.
  3. All files are copied from source to destination via rsync.
  4. Fresh provisioning files (envd, busybox, provision script, system configs) are written to overwrite any stale versions from the source template.
  5. The ext4 is shrunk, resized to the target DiskSizeMB, and integrity-checked.
  6. The sandbox is re-provisioned (systemd install via busybox init).
  7. A new snapshot layer is created (systemd boot, pause, upload) -- same flow as buildLayerFromOCI.

The FromTemplate Layer() method now checks the hash-based layer cache (which already includes DiskSizeMB in its hash key) instead of always returning Cached: true. Repeated builds with the same from-template + disk-size combination remain cached.

Structural changes:

  • Extract NBD utilities (Cleaner, BuildDevice, TemplateRootfs, GetNBDDevice, MountNBDDevice) from pkg/sandbox/nbd/testutils into a new production package pkg/sandbox/nbd/nbdutil. The testutils package now re-exports from nbdutil for backward compatibility. Test-only types (ZeroDevice, LoggerOverlay) remain in testutils.

  • Export ProvisioningFiles() from pkg/template/build/core/rootfs to allow both the OCI path (as OCI tar layers) and the new template path (as direct disk writes) to share the same file set. The existing additionalOCILayers() now delegates to ProvisioningFiles() internally.

  • Add template_rootfs.go in the base build phase containing: buildLayerFromTemplate(), materializeTemplateRootfs(), copyFilesRsync(), and writeProvisioningFiles().

Previously, building a template from another template (FromTemplate) always
reused the source template's build artifacts as-is, with no ability to change
the disk size. The rootfs size was baked into the header at base layer creation
time and copied identically through every subsequent diff generation. Since
resize2fs requires a local ext4 file, there was no way to resize the
diff-chain-based virtual block device.

This commit modifies the FromTemplate path in the base build phase to support
disk resizing by materializing the source template's rootfs into a new ext4
file:

1. The source template's rootfs is loaded from storage via NBD (same mechanism
   as cmd/mount-build-rootfs) and mounted read-only.
2. A fresh ext4 filesystem is created at the target size.
3. All files are copied from source to destination via rsync.
4. Fresh provisioning files (envd, busybox, provision script, system configs)
   are written to overwrite any stale versions from the source template.
5. The ext4 is shrunk, resized to the target DiskSizeMB, and integrity-checked.
6. The sandbox is re-provisioned (systemd install via busybox init).
7. A new snapshot layer is created (systemd boot, pause, upload) -- same flow
   as buildLayerFromOCI.

The FromTemplate Layer() method now checks the hash-based layer cache (which
already includes DiskSizeMB in its hash key) instead of always returning
Cached: true. Repeated builds with the same from-template + disk-size
combination remain cached.

Structural changes:

- Extract NBD utilities (Cleaner, BuildDevice, TemplateRootfs, GetNBDDevice,
  MountNBDDevice) from pkg/sandbox/nbd/testutils into a new production package
  pkg/sandbox/nbd/nbdutil. The testutils package now re-exports from nbdutil
  for backward compatibility. Test-only types (ZeroDevice, LoggerOverlay)
  remain in testutils.

- Export ProvisioningFiles() from pkg/template/build/core/rootfs to allow both
  the OCI path (as OCI tar layers) and the new template path (as direct disk
  writes) to share the same file set. The existing additionalOCILayers() now
  delegates to ProvisioningFiles() internally.

- Add template_rootfs.go in the base build phase containing:
  buildLayerFromTemplate(), materializeTemplateRootfs(), copyFilesRsync(),
  and writeProvisioningFiles().
return os.RemoveAll(diffCacheDir)
})

flags, err := featureflags.NewClient()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

featureflags.NewClient() creates a real LaunchDarkly LDClient (when LAUNCH_DARKLY_API_KEY is set) whose goroutines and connections are never released. featureflags.Client has a Close(ctx) error method - a corresponding cleaner.Add step is needed to avoid leaking one LD client per buildLayerFromTemplate invocation.

return os.RemoveAll(diffCacheDir)
})

flags, err := featureflags.NewClient()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

featureflags.NewClient() creates a real LaunchDarkly LDClient (when LAUNCH_DARKLY_API_KEY is set) whose goroutines and connections are never released. featureflags.Client has a Close(ctx context.Context) error method — add a corresponding cleaner.Add step to avoid leaking one LD client per buildLayerFromTemplate invocation.


cleaner.Add(func(cleanupCtx context.Context) error {
<-poolClosed

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<-poolClosed blocks unconditionally without selecting on cleanupCtx.Done(). If devicePool.Populate is slow to respond to context cancellation, this step hangs indefinitely and the 30-second timeout in Cleaner.Run has no effect on it.

defer span.End()

// We use a separate context for NBD operations to avoid cleanup deadlocks on cancellation
nbdCtx := context.Background()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context.Background() means a cancelled build context (user abort, deadline) will not propagate to mnt.Open or devicePool.Populate, potentially leaving the build appearing stuck. context.WithoutCancel(ctx) would be a better choice here — it preserves trace/value propagation while still isolating from cancellation-induced cleanup deadlocks.

}

// Remove existing file/symlink if present
os.Remove(fullLinkPath)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.Remove(fullLinkPath) silently discards its error. If the path exists but cannot be removed (e.g. it is a non-empty directory or a permissions issue), the error is swallowed and the subsequent os.Symlink fails with EEXIST, obscuring the real cause. Only os.IsNotExist should be ignored; other errors should be returned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants