SOLR-18147 Make a new Grafana dashboard for Solr 10.x#4210
SOLR-18147 Make a new Grafana dashboard for Solr 10.x#4210janhoy wants to merge 14 commits intoapache:mainfrom
Conversation
Add traffic generator Run two solr's in example cluster
Fix several panels
support running your own solr
|
So the foundation is laid I believe. Technically it is working and I generally like the "rows" and panels chosen by AI. But there are probably useful changes to do. Here are some I can think of
|
|
Latency graphs should always show the max, p50 is basically useless... https://www.youtube.com/watch?v=lJ8ydIuPFeU Also update latency is only rarely interesting... throughput is what most folks care about for indexing, that and stuck/failed documents. |
|
Thanks Jan this looks like a great start. I'll find some time to take a look. I really love the docker compose setup making it easy to test. Something we should add is also a way to turn on tracing module with this so we can also see exemplars that Solr supports now as well with these dashboards. Maybe a second iteration since that is definitely way out of scope. |
Good feedback, adding in a max graph in the search latency panel. Let's do that.
Yea, cause /update is non-blocking, right, so it won't tell much other than how large the payload was and perhaps how busy the server was. Let's use that real estate for something better. |
Thought of it but wanted to keep scope somewhat low, so I think this PR should focus on a GA dashboard. Then follow up work could add OTEL collector and Jaeger to the |
|
Looking good! +1 on lacing up a OTEL collector next 👀
|
|
Are you ok with the location in the monorepo |
|
I like |
| # ./stack.sh --help # All options | ||
| # | ||
| # Services (full stack): | ||
| # solr1 http://localhost:8983 (SolrCloud node 1, embedded ZooKeeper) |
epugh
left a comment
There was a problem hiding this comment.
Good progress.. There is a lot here that I don't quite grok... Is trafficgen coming out of other perf related effrots, or just "hey, we need some load" ;-)
Trafficgen is just something I wrote earlier, not written for perf at all, just to have something happening in a cluster, as it is boring to view a dashboard or traces with nothing going on. This Do you feel it is too much to add? Should the entire |
Good question. I think if "trafficgen" is used by other areas of Solr, beyond basically a tool for demoing/integration testing of alerting then is should be elsewhere. One thought.. I wonder if we need to reframe this as solr/monitoring into something like |
|
The example folder is mainly used to demonstrate different config sets and data types, see https://github.com/apache/solr/tree/main/solr/example But I'm positive to a |
- Rename "Node Overview" row to "Cluster Overview" - "Distributed QPS": per-collection (external traffic only) - "Search Latency p50/p95/p99": per-collection instead of per-instance - "Total Update Rate": per-collection (replaces "Indexing Rate") - "Update Latency p99": per-collection (external traffic only) - "Document Count": moved to Cluster Overview, per-collection - Add new "Solr Core" row: QPS, Update Rate, Update Latency, Commit Rate, Optimize Rate — all per core - "Shard Leaders": time series per node (was single stat) - "Update Log Replay Remaining": per-collection time series (was single stat) - "Segment Count per Collection": max per collection (was per core) - "Total Index Size per Node": time series per node (was single stat) - Add "Index Size per Collection": leader cores only via PromQL join - "Pending Commit Docs": per-collection time series (was single stat) - Rename "Cache Efficiency" row to "Solr Caches"
|
Did several improvements:
|
- "Heap Used" and "Heap Committed": add heap max (-Xmx) reference line - "Heap Max" renamed to "System and Heap Memory": area chart showing heap max, system total, and system used per instance - "Update Latency p99": fix empty panel by removing internal="false" filter (UPDATE metrics carry no internal label) - Rename metric jvm_system_memory_total_bytes → jvm_system_memory_bytes throughout dashboard and alerts
|
More improvements:
|
| ts( | ||
| 'Total Update Rate', | ||
| [prom( | ||
| 'sum by (collection)(rate(solr_core_requests_times_milliseconds_count{%s,%s,%s,category="UPDATE"}[$interval]))' % [envSel, clusterSel, instSel], |
There was a problem hiding this comment.
@mlbiscoc and @dsmiley I have in the "Cluster Overview" row made panels for "Distributed QPS" per collection, i.e. number of user-generated (label internal=false).
I wanted to do the same for Update requests in this row, but turns out we do not make that distinction in the metrics, although internal requests are clearly tagged with a url parameters distrib.from and update.distrib=FROMLEADER. Do you remember why this was not added, or if the same information is captured in our metrics but in a different way?
There was a problem hiding this comment.
It's absolutely an oversight that nobody stepped up to do this for /update. I've known about it for some time.
…showing active leaders per collection/shard - Remove "Active Cores" stat panel - Move "Disk Free" gauge from Cluster Overview to Index Health row
https://issues.apache.org/jira/browse/SOLR-18147
monitoring-with-prometheus-and-grafanarefguide page, but written from scratch, with a new diagram scraping each solr node.solr/monitoring/devfolder with a docker-compose file that starts two solr, prometheus, grafana, alertmanager and a tarffic ingester container, to easily test metric/grafana changes locallyWant to review?
This is a first draft, the things most ready for review are the mixin build logic and the dev/ compose setup for local testing.
I'd not recommend starting a details-focused review of each dashboard panel, presentation etc. The dashboard and panels themselves I'd categorize as first LLM draft. I have not done more than fixing them so they display data and react to variable dropdowns. Thus, everything related to choice of dashboard ROWs, selection and presentation of what metrics to make panels for, and the design of those panels are up for discussion, so the most useful review feedback on the dashboard at this stage is high-level on what rows and panels we need, and what style.
I give every committer permission to commit fixes and improvements to this branch, after first announcing what you intend to do in a review comment or ordinary comment. I am not strongly attached to the current row+panel selection.
Current dashboard layout (Draft)
The rows are:
Here are some screenshots:





Disclaimer: All of this is built by Claude Code.