Monitor datasource recency and alert on delays#670
Conversation
📝 WalkthroughWalkthroughThis change introduces monitoring for Cortex Nova datasource reconciliation status by adding a new Prometheus gauge metric Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/knowledge/kpis/plugins/deployment/datasource_state.go (1)
124-124: Reduce scrape-path log verbosity.Line 124 logs at Info on every
Collect, which can create avoidable log volume. ConsiderV(1)(or debug-level equivalent) here.Suggested patch
- datasourceStateKPILogger.Info("Collected datasource state metrics", "count", len(datasources)) + datasourceStateKPILogger.V(1).Info("Collected datasource state metrics", "count", len(datasources))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/knowledge/kpis/plugins/deployment/datasource_state.go` at line 124, The Info-level log call using datasourceStateKPILogger ("Collected datasource state metrics", "count", len(datasources)) is too noisy on every Collect; change it to a lower verbosity (e.g., datasourceStateKPILogger.V(1).Info(...)) or an appropriate debug-level logger so the message is only emitted at verbose/debug logging, preserving the same message and fields; locate the call in the Collect implementation where datasourceStateKPILogger is used and replace Info(...) with V(1).Info(...) (or the project's debug API) to reduce scrape-path log volume.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@helm/bundles/cortex-nova/alerts/nova.alerts.yaml`:
- Around line 612-623: The alert named
CortexNovaExistingDatasourcesLackingBehind has the wrong sign in its expr and an
inaccurate name: change the PromQL expression
cortex_datasource_seconds_until_reconcile{queued="true",domain="nova"} < 600 to
use < -600 so it only fires for overdue (negative) reconcile times, and rename
the alert symbol to CortexNovaExistingDatasourcesLaggingBehind (and update the
summary/annotation text from "Lacking behind" to "Lagging behind") to reflect
the correct semantics; ensure you update all occurrences of the alert name and
the summary annotation to the new wording.
---
Nitpick comments:
In `@internal/knowledge/kpis/plugins/deployment/datasource_state.go`:
- Line 124: The Info-level log call using datasourceStateKPILogger ("Collected
datasource state metrics", "count", len(datasources)) is too noisy on every
Collect; change it to a lower verbosity (e.g.,
datasourceStateKPILogger.V(1).Info(...)) or an appropriate debug-level logger so
the message is only emitted at verbose/debug logging, preserving the same
message and fields; locate the call in the Collect implementation where
datasourceStateKPILogger is used and replace Info(...) with V(1).Info(...) (or
the project's debug API) to reduce scrape-path log volume.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d302d00d-7caf-498c-be98-e4b4f82882e2
📒 Files selected for processing (3)
helm/bundles/cortex-nova/alerts/nova.alerts.yamlinternal/knowledge/kpis/plugins/deployment/datasource_state.gointernal/knowledge/kpis/plugins/deployment/datasource_state_test.go
c3e78ac to
95d4bf4
Compare
Test Coverage ReportTest Coverage 📊: 68.4% |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/knowledge/kpis/plugins/deployment/datasource_state.go (1)
124-124: Avoid info-level logging on every scrape path execution.
Collectis called per scrape; thisInfolog can become high-volume noise. PreferV(1).Info(...)(or remove).♻️ Proposed change
- datasourceStateKPILogger.Info("Collected datasource state metrics", "count", len(datasources)) + datasourceStateKPILogger.V(1).Info("Collected datasource state metrics", "count", len(datasources))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/knowledge/kpis/plugins/deployment/datasource_state.go` at line 124, The Collect method currently logs every scrape with datasourceStateKPILogger.Info("Collected datasource state metrics", "count", len(datasources)), which will create high-volume noise; change this to a lower verbosity level (e.g., datasourceStateKPILogger.V(1).Info(...)) or remove the log entirely so it does not execute on every Collect call, keeping the message and "count" context if you retain it; update the call site in the Collect function where datasourceStateKPILogger.Info is invoked to use V(1).Info or delete the statement.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/knowledge/kpis/plugins/deployment/datasource_state.go`:
- Line 124: The Collect method currently logs every scrape with
datasourceStateKPILogger.Info("Collected datasource state metrics", "count",
len(datasources)), which will create high-volume noise; change this to a lower
verbosity level (e.g., datasourceStateKPILogger.V(1).Info(...)) or remove the
log entirely so it does not execute on every Collect call, keeping the message
and "count" context if you retain it; update the call site in the Collect
function where datasourceStateKPILogger.Info is invoked to use V(1).Info or
delete the statement.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0a4b0b05-7e74-406c-ba6e-60469916fb97
📒 Files selected for processing (3)
helm/bundles/cortex-nova/alerts/nova.alerts.yamlinternal/knowledge/kpis/plugins/deployment/datasource_state.gointernal/knowledge/kpis/plugins/deployment/datasource_state_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/knowledge/kpis/plugins/deployment/datasource_state_test.go
- helm/bundles/cortex-nova/alerts/nova.alerts.yaml
If remote endpoints are slow or the datasource controller's priority workqueue starves, it may happen that datasources are synced with significant delays, or even never synced at all. For this case, we now provide a metric and 2 alerts. Alert
CortexNovaNewDatasourcesNotReconcilingwill indicate if there are fresh datasources which have never synced.CortexNovaExistingDatasourcesLackingBehindwill fly if an existing datasource has exceeded its next sync time by > 10 minutes.