Rewrite bk job log with Parquet-backed reads, follow mode, and URL input#720
Rewrite bk job log with Parquet-backed reads, follow mode, and URL input#720mekenthompson wants to merge 4 commits intobuildkite:mainfrom
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… and typed errors The old `bk job log` fetched the entire log via REST and dumped it through a pager. Fine for small jobs, useless for a 50,000-line test suite failure at 2am. This rewrites the command on top of the buildkite-logs library (same backend as the MCP server), which downloads logs once, converts to Parquet, and caches locally for fast columnar reads. This brings feature parity between the CLI and the MCP server for log access -- increasingly important as LLM-based tools bias toward CLI commands when MCP isn't explicitly configured. This will also be a dependency for official Buildkite agentic skills shipping shortly. What changed: - Read/tail/follow modes: full log with pager, --tail N for last N lines, --follow polls every 2s for running jobs and exits when the job finishes. Auto-follow when TTY + running job + no explicit flags. - Buildkite URL input: paste a URL from the web UI or Slack and it extracts org/pipeline/build/job. Handles <angle-bracket> Slack wrapping. Build-only URLs (no #fragment) fall through to the job picker. - Step key resolution with parallel matrix support: --step test picks the job by pipeline.yml key. When multiple parallel jobs match the same key, shows the interactive picker instead of silently returning the first. - Time filtering: --since 5m, --until 2026-01-15T10:00:00Z, or both. Works with tail, read, and follow modes. Duration values pin to invocation time so filtering is deterministic across the log. - JSON output: --json emits one JSON object per line (JSONL) with row_number, timestamp, content, and group. Replaces the old OutputFlags embed that exposed --yaml/--text/--output flags which silently did nothing. - Typed errors: all user-facing errors now use the CLI's error type system. Flag conflicts exit 2 (validation), missing resources exit 4 (not found), API failures exit 3 with status-code-specific messages and suggestions. - Group filtering: --group "Running tests" shows only log lines within a Buildkite --- group section. - Pager integration: full-log reads go through less -R (respects PAGER env, --no-pager, and config). Tail, follow, and JSON skip the pager. Non-TTY disables pager, color, auto-follow, and the spinner. Bug fix: follow mode with --tail on a job with 0 log rows crashed because SeekToRow(0) failed on an empty Parquet file. Added a row count guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| $ bk logs https://buildkite.com/my-org/my-pipeline/builds/123#0190046e-e199-453b-a302-a21a4d649d31 | ||
|
|
||
| # Build URL without job fragment (opens job picker) | ||
| $ bk logs https://buildkite.com/my-org/my-pipeline/builds/123 |
There was a problem hiding this comment.
I assmume bk job log is preferred?
There was a problem hiding this comment.
Good catch. I want bk logs to be the first-class command, same as kubectl logs / docker logs / fly logs. bk job log stays for compatibility but bk logs is what we promote. Switching all help examples to use bk logs.
| } | ||
| } | ||
|
|
||
| func (c *LogCmd) validateFlags() error { |
There was a problem hiding this comment.
Should this function check that --group and --seek have not been used together?
if c.Seek >= 0 && c.Group != "" {...}There was a problem hiding this comment.
You're right, --seek silently wins and --group gets dropped. Adding a validation error. We could compose them but it's not clear what "seek within a group" means, and nobody's asked for it.
cmd/job/log_test.go
Outdated
| Content: "hello", | ||
| Timestamp: 1000, | ||
| RowNumber: 0, |
There was a problem hiding this comment.
This is a test to strip out ANSI but the content contains no ANSI
There was a problem hiding this comment.
Yep, test passes trivially with no ANSI in the input. Updating to include actual escape codes so CleanContent(true) is exercised.
cmd/job/log.go
Outdated
| Content: strings.TrimRight(entry.CleanContent(true), "\n"), | ||
| Group: entry.Group, | ||
| } | ||
| data, _ := json.Marshal(obj) |
There was a problem hiding this comment.
Should we do something with the error here? Maybe in debug mode at least?
There was a problem hiding this comment.
Can't actually fail with these types (string + int64), but swallowing the error reads wrong. Adding an early return with a stderr warning.
cmd/job/log_test.go
Outdated
| func TestBuildJobLabelsParallelIndex(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| idx0, idx1, idx2 := 0, 1, 2 |
There was a problem hiding this comment.
What do these do as they're ignored later?
There was a problem hiding this comment.
Dead code, leftover from an earlier approach. Deleted.
…x tests - Use `bk logs` consistently in help examples (first-class command, `bk job log` kept for compatibility) - Add --seek/--group mutual exclusivity check to validateFlags() - Fix ANSI strip test to include actual escape codes in input - Handle json.Marshal error with stderr warning instead of swallowing - Remove unused idx0/idx1/idx2 variables from parallel index test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
go.sum resolved via go mod tidy. Re-exported isTTY as IsTTY in internal/io/pager.go since it was unexported by an upstream change but is needed by cmd/job/log.go for auto-follow TTY detection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mipearson
left a comment
There was a problem hiding this comment.
Comments courtesy the code-review skill in amp (except the one about the PRD), cross-referenced against opus 4.6 & gpt 5.4 to make sure, and de-duped against Ben's findings.
robots on robots on robots.
|
|
||
| startRow := max(fileInfo.RowCount-int64(c.Tail), 0) | ||
|
|
||
| for entry, iterErr := range reader.SeekToRow(startRow) { |
There was a problem hiding this comment.
Bug: --tail without time filters ignores --group. This path uses SeekToRow(startRow) which reads raw rows with no group filtering. The time-filter branch above correctly uses FilterByGroupIter (line 562), but this branch doesn't.
bk logs --tail 20 --group "Running tests" will return the last 20 lines of the entire log, not the last 20 lines of the "Running tests" group.
| lastSeenRow = fileInfo.RowCount | ||
| } else { | ||
| // Show everything from the beginning (respecting --since if set) | ||
| for entry, iterErr := range reader.ReadEntriesIter() { |
There was a problem hiding this comment.
Bug: --group filter is not applied in follow mode. Both the initial fetch (lines 613 and 625 use SeekToRow/ReadEntriesIter directly) and the polling loop (line 679 uses SeekToRow) emit all entries regardless of group.
bk logs -f --group "tests" will print all log output, not just entries from the "tests" group.
| reqCtx, cancel := context.WithTimeout(ctx, 30*time.Second) | ||
| defer cancel() | ||
|
|
||
| buildInfo, _, err := f.RestAPIClient.Builds.Get(reqCtx, org, pipeline, build, nil) |
There was a problem hiding this comment.
Nit: jobState calls Builds.Get which fetches the entire build including all jobs. In follow mode this runs every 2 seconds (line 693). For builds with high parallelism this is a lot of payload to fetch repeatedly just to check one job's state. Not blocking, but worth noting - if go-buildkite ever adds a single-job endpoint, this would be a good candidate.
| @@ -0,0 +1,301 @@ | |||
| # PRD: Enhanced `bk job log` Command | |||
There was a problem hiding this comment.
Should this remain in this repository, and if so, where should it live? Probably not the root directory - docs/prds maybe?
Summary
Rewrites
bk job log(aliased asbk logs) on top of thebuildkite-logsParquet library, bringing the MCP server's log capabilities to the CLI. This matters now because LLM-based tools increasingly reach for CLI commands when an MCP server isn't explicitly configured -- andbk logsis about to become a dependency for official Buildkite agentic skills shipping shortly.The command is modeled after
kubectl logs,docker logs,fly logs, andrailway logswhile handling Buildkite-specific realities: step keys, parallel job matrices, grouped log sections, and the copy-paste-a-URL-from-Slack workflow that CI debugging actually starts with.Smart defaults mean zero flags for the common case:
Flags are opt-in for power use cases:
--tail N,--follow,--since/--until,--seek/--limit,--step,--group,--json,--timestamps. Designed to compose with standard Unix tools:Job to be done
A developer's build just failed. They got a Slack notification with a Buildkite URL. They want to see what went wrong without leaving their terminal, without copy-pasting UUIDs, and without downloading a 10MB log just to look at the last 20 lines.
What changed
Buildkite URL as input -- Copy a URL from Slack or the web UI, paste it as the argument.
bk logs https://buildkite.com/org/pipe/builds/123#job-idextracts everything. Build-only URLs open the job picker. Slack's<angle-bracket>wrapping is stripped automatically.Follow mode --
bk logs -fpolls every 2s, streams new lines as they appear, and exits when the job reaches a terminal state. When you runbk logswith no flags on a running job in a TTY, it auto-follows and tells you on stderr.Tail --
bk logs -n 50shows the last 50 lines without downloading the full log. Combines with--follow(show last N then stream) and--since(last N lines within a time window).Time filtering --
--since 5mand--until <RFC3339>filter by timestamp. Works across all modes.Parallel step disambiguation --
--step teston a build withparallelism: 5now shows a picker with parallel indices instead of silently returning the first match.JSON output --
--jsonemits JSONL. Old--yaml/--text/-oflags removed (they were inherited from OutputFlags and silently ignored).Typed errors -- Flag conflicts exit 2 with "Validation Error:". Missing jobs/builds exit 4 with "Not Found:" and suggestions. API failures exit 3 with status-code-specific messages.
Bug fix --
--follow --tail Non a job with zero log output crashed onSeekToRow(0)against an empty Parquet file. Fixed with a row count guard.Use cases tested against live Buildkite builds
<URL>--step buildon a multi-step pipeline--step nonexistent(exit 4, actionable error)-n 5on a finished job-n 3 -fon a running job (tail then stream)-fon a finished job (dump log, exit in <2s)-fon a running job (stream lines every 2s, exit when done)--json | jq '.content'--json --since <timestamp> | jq -r '@tsv'--since 1hon a build from days ago (empty, exit 0)--since <mid-build-timestamp> -n 3--seek 100 --limit 5--timestamps(RFC3339 prefix)grep(no pager, no color)-n 1000when log has 37 lines (shows all 37)--yamlflag (rejected, suggests--tail)--pipeline(exit 2, "cannot use --pipeline with a URL")Edge cases handled
--timestampswith--json(JSON always includes timestamps, flag is a no-op but doesn't error)bk;t=markers in a single line (all stripped)--no-inputwith multiple jobs and no job ID: clear error instead of hanging on a promptTest plan
go test ./cmd/job/-- 99 tests, all passinggo test ./...-- full suite greengo build .-- compiles,--helpoutput correctmise run format-- cleanmise run lint-- 0 issuescompetitor-intelligence/starter-pipelineandcompetitor-intelligence/competitor-intelligence-reportbuilds🤖 Generated with Claude Code