Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request undertakes a comprehensive redesign of the Grafana monitoring dashboards for Elastickv. The goal is to provide operators with more focused and actionable insights into the system's health, performance, and consensus state. By introducing specialized dashboards for request health and Raft status, alongside a refined cluster overview, the changes aim to streamline monitoring workflows and facilitate quicker identification and diagnosis of issues. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly refactors and updates the Grafana monitoring dashboards for Elastickv, enhancing visibility into both DynamoDB-compatible API request health and Raft control-plane status. The README.md file has been updated to reflect the new metrics and dashboard organization. The elastickv-cluster-overview.json dashboard now includes more detailed metrics like system/user error splits, in-flight requests, and comprehensive Raft state, along with templating for group and node IDs. The elastickv-cluster-summary.json dashboard has been transformed into Elastickv Request Health, focusing on DynamoDB request throughput, latency, errors, and per-table activity, introducing templating for node ID, operation, and table. The elastickv-dynamodb-summary.json dashboard has been removed. Finally, the elastickv-raft-status.json dashboard provides a more granular view of Raft control-plane status, including per-node index snapshots, backlog, and FSM pending, also with templating for group and node IDs. A suggestion was made to rename the elastickv-cluster-summary.json file to elastickv-request-health.json to align with its new title and description.
| - `monitoring/grafana/dashboards/elastickv-cluster-summary.json` | ||
| - `monitoring/grafana/dashboards/elastickv-raft-status.json` | ||
| - `monitoring/grafana/provisioning/` | ||
| - `monitoring/docker-compose.yml` | ||
|
|
||
| The provisioned dashboards are organized by operator task: | ||
|
|
||
| - `Elastickv Cluster Overview` is the landing page for leader identity, cluster-wide latency/error posture, and per-node Raft health | ||
| - `Elastickv Request Health` is the DynamoDB/API drilldown for slow operations, noisy nodes, and hot/erroring tables |
There was a problem hiding this comment.
For consistency and clarity, consider renaming the dashboard file elastickv-cluster-summary.json to elastickv-request-health.json. This would align the filename with its new title Elastickv Request Health and the description provided in this README. You would also need to update this file to reflect the new name.
There was a problem hiding this comment.
Pull request overview
This PR redesigns the Grafana monitoring dashboards under monitoring/grafana/dashboards/ to better support operator workflows, splitting “overview” vs. “request health” vs. “raft control-plane” drilldowns and adding richer per-node/per-table breakdowns.
Changes:
- Redesign Raft dashboard into a control-plane focused view with per-node index/contact snapshots and node filtering.
- Redesign cluster summary into a DynamoDB/API request health dashboard with node/operation/table templating and top offenders tables.
- Update cluster overview into an operator landing page and remove the legacy DynamoDB summary dashboard; update README dashboard documentation.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| monitoring/grafana/dashboards/elastickv-raft-status.json | Reworked Raft status dashboard: new layout, node filtering, leader/membership tables, per-node index snapshot, and additional time series. |
| monitoring/grafana/dashboards/elastickv-dynamodb-summary.json | Removed legacy DynamoDB summary dashboard JSON. |
| monitoring/grafana/dashboards/elastickv-cluster-summary.json | Repurposed into “Request Health” dashboard with node/operation/table variables and new panels for error/latency/table hot spots. |
| monitoring/grafana/dashboards/elastickv-cluster-overview.json | Updated “Cluster Overview” landing page with node/group templating, revised KPI stats, and a per-node Raft health snapshot table. |
| README.md | Updated monitoring section to reflect new dashboard set and intent. |
| "editorMode": "code", | ||
| "expr": "count by (group) (elastickv_raft_member_present{job=\"elastickv\",group=~\"$group\"} * on(group, node_id, node_address) group_left() (elastickv_raft_local_state{job=\"elastickv\",group=~\"$group\",state=\"leader\"} == 1))", | ||
| "instant": false, | ||
| "expr": "max by (group) (clamp_min(elastickv_raft_last_contact_seconds{job=\"elastickv\",group=~\"$group\",node_id=~\"$node_id\"}, 0))", | ||
| "legendFormat": "{{group}}", |
There was a problem hiding this comment.
elastickv_raft_last_contact_seconds uses -1 as a sentinel for unknown/invalid values (see monitoring/raft.go). Clamping to 0 here turns “unknown” into “0s” (healthy) and can mask missing telemetry. Consider keeping the raw value and using a value mapping for -1, or handling unknown explicitly in the PromQL.
| { | ||
| "datasource": null, | ||
| "editorMode": "code", | ||
| "expr": "max by (group, node_id, node_address) (clamp_min(elastickv_raft_last_contact_seconds{job=\"elastickv\",group=~\"$group\",node_id=~\"$node_id\"}, 0))", |
There was a problem hiding this comment.
This series clamps elastickv_raft_last_contact_seconds to 0, but the exporter uses -1 to mean “unknown”. That makes missing/invalid contact data look like a real 0s value. Prefer leaving the sentinel value and adding a value mapping, or converting unknowns to null/NaN so they’re visually distinct.
| "expr": "max by (group, node_id, node_address) (clamp_min(elastickv_raft_last_contact_seconds{job=\"elastickv\",group=~\"$group\",node_id=~\"$node_id\"}, 0))", | |
| "expr": "max by (group, node_id, node_address) (elastickv_raft_last_contact_seconds{job=\"elastickv\",group=~\"$group\",node_id=~\"$node_id\"})", |
| "expr": "max(clamp_min(elastickv_raft_last_contact_seconds{job=\"elastickv\",group=~\"$group\",node_id=~\"$node_id\"}, 0))", | ||
| "legendFormat": "max last contact", | ||
| "range": true, |
There was a problem hiding this comment.
elastickv_raft_last_contact_seconds uses -1 as “unknown”. Clamping to 0 here makes unknown contact look like 0s and can hide scrape/telemetry issues. Consider removing the clamp and adding a value mapping for -1 (or filtering unknowns explicitly) so operators can distinguish “unknown” from “0s”.
| { | ||
| "datasource": null, | ||
| "editorMode": "code", | ||
| "expr": "max by (group, node_id, node_address) (clamp_min(elastickv_raft_last_contact_seconds{job=\"elastickv\",group=~\"$group\",node_id=~\"$node_id\"}, 0))", |
There was a problem hiding this comment.
The table clamps elastickv_raft_last_contact_seconds to 0, but the exporter uses -1 as a sentinel for unknown/invalid values. This makes missing data indistinguishable from a real 0s value in the snapshot. Prefer keeping -1 with a value mapping, or translating unknowns to null so they render as “unknown”/blank.
| "expr": "max by (group, node_id, node_address) (clamp_min(elastickv_raft_last_contact_seconds{job=\"elastickv\",group=~\"$group\",node_id=~\"$node_id\"}, 0))", | |
| "expr": "max by (group, node_id, node_address) (elastickv_raft_last_contact_seconds{job=\"elastickv\",group=~\"$group\",node_id=~\"$node_id\"} unless elastickv_raft_last_contact_seconds{job=\"elastickv\",group=~\"$group\",node_id=~\"$node_id\"} == -1)", |
| "expr": "max by (group) (elastickv_raft_term{job=\"elastickv\",group=~\"$group\",node_id=~\"$node_id\"})", | ||
| "legendFormat": "{{group}}", |
There was a problem hiding this comment.
The panel title implies this is the group’s current raft term, but the query is max by (group)(elastickv_raft_term{...}), which can surface a higher term from a non-leader observer during elections/partitions. Either scope the term to the leader (as the old dashboard did) or rename the panel to something like “Max Observed Term” to match the query semantics.
No description provided.