Add alerts for reconciliation and webhooks#671
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughFive new Prometheus alerting rules were added to Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@helm/bundles/cortex-nova/alerts/nova.alerts.yaml`:
- Around line 665-668: The alert CortexNovaWebhookErrorsHigh uses
controller_runtime_webhook_requests_total but divides error rates by only
code="200" (successes), producing errors/successes and possible divide-by-zero;
update the expr for CortexNovaWebhookErrorsHigh so the denominator uses the
total request rate (remove the code="200" label filter from the denominator)
i.e., use sum by (webhook) (rate(controller_runtime_webhook_requests_total[5m]))
as the denominator to compute errors/total correctly.
- Around line 630-632: The alert CortexNovaWorkqueueNotDrained incorrectly uses
rate() on the gauge workqueue_depth; change the expression to check the gauge
value directly by removing rate() and using sum by (name)
(workqueue_depth{service="cortex-nova-metrics"}) > 0 so the rule fires when
items are present in the queue (and keep the existing 15m for duration).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 36a6483b-a948-4e0f-8c65-1f665132cd68
📒 Files selected for processing (1)
helm/bundles/cortex-nova/alerts/nova.alerts.yaml
e966ba4 to
3bebfbe
Compare
Test Coverage ReportTest Coverage 📊: 68.4% |
With controller-runtime's exporter, we're already exposing a rich set of metrics on which we can declare alerts. This change adds five alerts: one for reconciler errors, one for reconciler timeouts, one for the workqueue not being drained, one for the webhook latency, and one for the webhook error rate. These are basically the same as in greenhouse.