Skip to content

Rewrite bk job log with Parquet-backed reads, follow mode, and URL input#720

Open
mekenthompson wants to merge 4 commits intobuildkite:mainfrom
mekenthompson:ken/rewrite-job-log
Open

Rewrite bk job log with Parquet-backed reads, follow mode, and URL input#720
mekenthompson wants to merge 4 commits intobuildkite:mainfrom
mekenthompson:ken/rewrite-job-log

Conversation

@mekenthompson
Copy link
Copy Markdown

Summary

Rewrites bk job log (aliased as bk logs) on top of the buildkite-logs Parquet library, bringing the MCP server's log capabilities to the CLI. This matters now because LLM-based tools increasingly reach for CLI commands when an MCP server isn't explicitly configured -- and bk logs is about to become a dependency for official Buildkite agentic skills shipping shortly.

The command is modeled after kubectl logs, docker logs, fly logs, and railway logs while handling Buildkite-specific realities: step keys, parallel job matrices, grouped log sections, and the copy-paste-a-URL-from-Slack workflow that CI debugging actually starts with.

Smart defaults mean zero flags for the common case:

  • Pipeline and build resolve automatically from the git repo and current branch
  • Single-job builds auto-select the job; multi-job builds show a picker
  • Running jobs auto-follow in a TTY (with a stderr notice); finished jobs dump through a pager
  • Color and pager disabled automatically when piped
  • Spinner and interactive prompts suppressed in non-TTY

Flags are opt-in for power use cases: --tail N, --follow, --since/--until, --seek/--limit, --step, --group, --json, --timestamps. Designed to compose with standard Unix tools:

bk logs -f | grep -i "error\|panic"           # live search
bk logs --json -n 100 | jq '.content'          # structured extraction
bk logs --since 5m | tail -20                   # recent output
bk logs <slack-url> -n 50                       # paste and go

Job to be done

A developer's build just failed. They got a Slack notification with a Buildkite URL. They want to see what went wrong without leaving their terminal, without copy-pasting UUIDs, and without downloading a 10MB log just to look at the last 20 lines.

What changed

Buildkite URL as input -- Copy a URL from Slack or the web UI, paste it as the argument. bk logs https://buildkite.com/org/pipe/builds/123#job-id extracts everything. Build-only URLs open the job picker. Slack's <angle-bracket> wrapping is stripped automatically.

Follow mode -- bk logs -f polls every 2s, streams new lines as they appear, and exits when the job reaches a terminal state. When you run bk logs with no flags on a running job in a TTY, it auto-follows and tells you on stderr.

Tail -- bk logs -n 50 shows the last 50 lines without downloading the full log. Combines with --follow (show last N then stream) and --since (last N lines within a time window).

Time filtering -- --since 5m and --until <RFC3339> filter by timestamp. Works across all modes.

Parallel step disambiguation -- --step test on a build with parallelism: 5 now shows a picker with parallel indices instead of silently returning the first match.

JSON output -- --json emits JSONL. Old --yaml/--text/-o flags removed (they were inherited from OutputFlags and silently ignored).

Typed errors -- Flag conflicts exit 2 with "Validation Error:". Missing jobs/builds exit 4 with "Not Found:" and suggestions. API failures exit 3 with status-code-specific messages.

Bug fix -- --follow --tail N on a job with zero log output crashed on SeekToRow(0) against an empty Parquet file. Fixed with a row count guard.

Use cases tested against live Buildkite builds

  • Paste a full job URL from the web UI
  • Paste a build-only URL, pick job interactively
  • Paste a Slack-wrapped <URL>
  • --step build on a multi-step pipeline
  • --step nonexistent (exit 4, actionable error)
  • -n 5 on a finished job
  • -n 3 -f on a running job (tail then stream)
  • -f on a finished job (dump log, exit in <2s)
  • -f on a running job (stream lines every 2s, exit when done)
  • --json | jq '.content'
  • --json --since <timestamp> | jq -r '@tsv'
  • --since 1h on a build from days ago (empty, exit 0)
  • --since <mid-build-timestamp> -n 3
  • --seek 100 --limit 5
  • --timestamps (RFC3339 prefix)
  • Pipe to grep (no pager, no color)
  • -n 1000 when log has 37 lines (shows all 37)
  • Wrong pipeline (exit 3, "404 No pipeline found")
  • Wrong build number (exit 3, "404 Not Found")
  • Nonexistent job UUID in URL (exit 3, "job not found")
  • --yaml flag (rejected, suggests --tail)
  • URL + --pipeline (exit 2, "cannot use --pipeline with a URL")

Edge cases handled

  • Empty job (0 log rows): "No log output for this job." exit 0
  • Follow + tail on empty job: no crash, polls until output appears
  • Uppercase UUIDs in URL fragment
  • URLs with query params (rejected, won't false-match)
  • URLs with trailing slashes or extra path segments (rejected)
  • Empty URL fragments (rejected)
  • Double-pasted URLs (rejected)
  • Markdown-wrapped URLs (rejected)
  • --timestamps with --json (JSON always includes timestamps, flag is a no-op but doesn't error)
  • Multiple bk;t= markers in a single line (all stripped)
  • Follow mode: tolerates up to 10 consecutive API errors before giving up
  • Follow mode: Ctrl-C exits cleanly (exit 0, no error)
  • --no-input with multiple jobs and no job ID: clear error instead of hanging on a prompt

Test plan

  • go test ./cmd/job/ -- 99 tests, all passing
  • go test ./... -- full suite green
  • go build . -- compiles, --help output correct
  • mise run format -- clean
  • mise run lint -- 0 issues
  • Live tested against competitor-intelligence/starter-pipeline and competitor-intelligence/competitor-intelligence-report builds
  • Triggered builds with slow output (30 lines over 60s) to verify follow mode streams in real time
  • Verified pager skipped when piped, auto-follow skipped for finished jobs

🤖 Generated with Claude Code

mekenthompson and others added 2 commits March 25, 2026 23:27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… and typed errors

The old `bk job log` fetched the entire log via REST and dumped it through
a pager. Fine for small jobs, useless for a 50,000-line test suite failure
at 2am. This rewrites the command on top of the buildkite-logs library
(same backend as the MCP server), which downloads logs once, converts to
Parquet, and caches locally for fast columnar reads.

This brings feature parity between the CLI and the MCP server for log
access -- increasingly important as LLM-based tools bias toward CLI
commands when MCP isn't explicitly configured. This will also be a
dependency for official Buildkite agentic skills shipping shortly.

What changed:

- Read/tail/follow modes: full log with pager, --tail N for last N lines,
  --follow polls every 2s for running jobs and exits when the job finishes.
  Auto-follow when TTY + running job + no explicit flags.

- Buildkite URL input: paste a URL from the web UI or Slack and it extracts
  org/pipeline/build/job. Handles <angle-bracket> Slack wrapping.
  Build-only URLs (no #fragment) fall through to the job picker.

- Step key resolution with parallel matrix support: --step test picks
  the job by pipeline.yml key. When multiple parallel jobs match the same
  key, shows the interactive picker instead of silently returning the first.

- Time filtering: --since 5m, --until 2026-01-15T10:00:00Z, or both.
  Works with tail, read, and follow modes. Duration values pin to
  invocation time so filtering is deterministic across the log.

- JSON output: --json emits one JSON object per line (JSONL) with
  row_number, timestamp, content, and group. Replaces the old OutputFlags
  embed that exposed --yaml/--text/--output flags which silently did nothing.

- Typed errors: all user-facing errors now use the CLI's error type system.
  Flag conflicts exit 2 (validation), missing resources exit 4 (not found),
  API failures exit 3 with status-code-specific messages and suggestions.

- Group filtering: --group "Running tests" shows only log lines within
  a Buildkite --- group section.

- Pager integration: full-log reads go through less -R (respects PAGER env,
  --no-pager, and config). Tail, follow, and JSON skip the pager. Non-TTY
  disables pager, color, auto-follow, and the spinner.

Bug fix: follow mode with --tail on a job with 0 log rows crashed because
SeekToRow(0) failed on an empty Parquet file. Added a row count guard.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mekenthompson mekenthompson requested review from a team as code owners March 25, 2026 12:28
Comment on lines +69 to +72
$ bk logs https://buildkite.com/my-org/my-pipeline/builds/123#0190046e-e199-453b-a302-a21a4d649d31

# Build URL without job fragment (opens job picker)
$ bk logs https://buildkite.com/my-org/my-pipeline/builds/123
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assmume bk job log is preferred?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I want bk logs to be the first-class command, same as kubectl logs / docker logs / fly logs. bk job log stays for compatibility but bk logs is what we promote. Switching all help examples to use bk logs.

}
}

func (c *LogCmd) validateFlags() error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function check that --group and --seek have not been used together?

if c.Seek >= 0 && c.Group != "" {...}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, --seek silently wins and --group gets dropped. Adding a validation error. We could compose them but it's not clear what "seek within a group" means, and nobody's asked for it.

Comment on lines +736 to +738
Content: "hello",
Timestamp: 1000,
RowNumber: 0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test to strip out ANSI but the content contains no ANSI

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, test passes trivially with no ANSI in the input. Updating to include actual escape codes so CleanContent(true) is exercised.

cmd/job/log.go Outdated
Content: strings.TrimRight(entry.CleanContent(true), "\n"),
Group: entry.Group,
}
data, _ := json.Marshal(obj)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something with the error here? Maybe in debug mode at least?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't actually fail with these types (string + int64), but swallowing the error reads wrong. Adding an early return with a stderr warning.

func TestBuildJobLabelsParallelIndex(t *testing.T) {
t.Parallel()

idx0, idx1, idx2 := 0, 1, 2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these do as they're ignored later?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code, leftover from an earlier approach. Deleted.

mekenthompson and others added 2 commits March 28, 2026 09:36
…x tests

- Use `bk logs` consistently in help examples (first-class command,
  `bk job log` kept for compatibility)
- Add --seek/--group mutual exclusivity check to validateFlags()
- Fix ANSI strip test to include actual escape codes in input
- Handle json.Marshal error with stderr warning instead of swallowing
- Remove unused idx0/idx1/idx2 variables from parallel index test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
go.sum resolved via go mod tidy. Re-exported isTTY as IsTTY in
internal/io/pager.go since it was unexported by an upstream change
but is needed by cmd/job/log.go for auto-follow TTY detection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@mipearson mipearson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments courtesy the code-review skill in amp (except the one about the PRD), cross-referenced against opus 4.6 & gpt 5.4 to make sure, and de-duped against Ben's findings.

robots on robots on robots.


startRow := max(fileInfo.RowCount-int64(c.Tail), 0)

for entry, iterErr := range reader.SeekToRow(startRow) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: --tail without time filters ignores --group. This path uses SeekToRow(startRow) which reads raw rows with no group filtering. The time-filter branch above correctly uses FilterByGroupIter (line 562), but this branch doesn't.
bk logs --tail 20 --group "Running tests" will return the last 20 lines of the entire log, not the last 20 lines of the "Running tests" group.

lastSeenRow = fileInfo.RowCount
} else {
// Show everything from the beginning (respecting --since if set)
for entry, iterErr := range reader.ReadEntriesIter() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: --group filter is not applied in follow mode. Both the initial fetch (lines 613 and 625 use SeekToRow/ReadEntriesIter directly) and the polling loop (line 679 uses SeekToRow) emit all entries regardless of group.
bk logs -f --group "tests" will print all log output, not just entries from the "tests" group.

reqCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()

buildInfo, _, err := f.RestAPIClient.Builds.Get(reqCtx, org, pipeline, build, nil)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: jobState calls Builds.Get which fetches the entire build including all jobs. In follow mode this runs every 2 seconds (line 693). For builds with high parallelism this is a lot of payload to fetch repeatedly just to check one job's state. Not blocking, but worth noting - if go-buildkite ever adds a single-job endpoint, this would be a good candidate.

@@ -0,0 +1,301 @@
# PRD: Enhanced `bk job log` Command
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this remain in this repository, and if so, where should it live? Probably not the root directory - docs/prds maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants