SOLR-18147 Make a new Grafana dashboard for Solr 10.x by janhoy · Pull Request #4210 · apache/solr

janhoy · 2026-03-12T15:52:38Z

https://issues.apache.org/jira/browse/SOLR-18147

Brand new dashboard, built from mixin source that can re-generate both dashboard and alerts.
Bringing back monitoring-with-prometheus-and-grafana refguide page, but written from scratch, with a new diagram scraping each solr node.
A solr/monitoring/dev folder with a docker-compose file that starts two solr, prometheus, grafana, alertmanager and a tarffic ingester container, to easily test metric/grafana changes locally

Want to review?

This is a first draft, the things most ready for review are the mixin build logic and the dev/ compose setup for local testing.

I'd not recommend starting a details-focused review of each dashboard panel, presentation etc. The dashboard and panels themselves I'd categorize as first LLM draft. I have not done more than fixing them so they display data and react to variable dropdowns. Thus, everything related to choice of dashboard ROWs, selection and presentation of what metrics to make panels for, and the design of those panels are up for discussion, so the most useful review feedback on the dashboard at this stage is high-level on what rows and panels we need, and what style.

I give every committer permission to commit fixes and improvements to this branch, after first announcing what you intend to do in a review comment or ordinary comment. I am not strongly attached to the current row+panel selection.

Current dashboard layout (Draft)

The rows are:

Node Overview (open by default) — query/index request rates, latency, cores, disk
JVM (open by default) — heap, GC, threads, CPU
SolrCloud (collapsed) — Overseer queues, ZK ops, shard leaders
Index Health (collapsed) — segments, index size, merge rates, MMap efficiency
Cache Efficiency (collapsed) — filter/query/document cache hit rates and evictions

Here are some screenshots:

Disclaimer: All of this is built by Claude Code.

Add traffic generator Run two solr's in example cluster

Fix several panels

support running your own solr

janhoy · 2026-03-13T00:09:10Z

So the foundation is laid I believe. Technically it is working and I generally like the "rows" and panels chosen by AI.

But there are probably useful changes to do. Here are some I can think of

Add a panel for system memory (dependent on SOLR-18159 Add metrics for system memory #4209), perhaps a stacked area with heap-max in it
Distinguish between "collection QPS" and "per-core" QPS. I think the metrics include a label for whether they are "local" or not?
Add panel for number of zookeepers "up"
Add panel for number of solr nodes "up"
Other panels for cluster-level things like number of collections, shard leadership over time
Gather more user feedback for what they lack
Add OTEL collector to the docker-compose and have it push metrics to the same prometheus, but with a different "cluster" or "environment" label, to test those dropdowns.

gus-asf · 2026-03-13T14:00:25Z

Latency graphs should always show the max, p50 is basically useless... https://www.youtube.com/watch?v=lJ8ydIuPFeU

Also update latency is only rarely interesting... throughput is what most folks care about for indexing, that and stuck/failed documents.

mlbiscoc · 2026-03-13T14:13:39Z

Thanks Jan this looks like a great start. I'll find some time to take a look. I really love the docker compose setup making it easy to test. Something we should add is also a way to turn on tracing module with this so we can also see exemplars that Solr supports now as well with these dashboards. Maybe a second iteration since that is definitely way out of scope.

janhoy · 2026-03-13T14:36:00Z

Latency graphs should always show the max, p50 is basically useless... https://www.youtube.com/watch?v=lJ8ydIuPFeU

Good feedback, adding in a max graph in the search latency panel. Let's do that.

Also update latency is only rarely interesting... throughput is what most folks care about for indexing, that and stuck/failed documents.

Yea, cause /update is non-blocking, right, so it won't tell much other than how large the payload was and perhaps how busy the server was. Let's use that real estate for something better.

janhoy · 2026-03-13T14:41:47Z

Something we should add is also a way to turn on tracing module with this so we can also see exemplars that Solr supports now as well with these dashboards.

Thought of it but wanted to keep scope somewhat low, so I think this PR should focus on a GA dashboard. Then follow up work could add OTEL collector and Jaeger to the dev/ setup. I also discovered Microsofts Aspire Dashboard project, and I think I'll add it to compose. It shows you real-time what OTLP packets (metric, trace, logs) are received, and you can inspect the content of all. it has a simple traces viewer.

Jesssullivan · 2026-03-13T16:22:15Z

Looking good! +1 on lacing up a OTEL collector next 👀

Thought of it but wanted to keep scope somewhat low, so I think this PR should focus on a GA dashboard. Then follow up work could add OTEL collector and Jaeger to the dev/ setup. I

janhoy · 2026-03-19T14:18:04Z

Are you ok with the location in the monorepo solr/monitoring ? In some way it more belongs on the top level, but I guess I try to avoid adding stuff to top level. Considered separate git repo but that breaks with our monorepo style, and it is useful to keep dashboard in sync with evolution of the app.

mlbiscoc · 2026-03-19T21:05:14Z

I like solr/monitoring location over it being at the root and not putting in a separate repo. In a separate repo, if we add metrics or change, it'd be hard to see it without switching between 2 repos. I'd vote how it is.

epugh · 2026-03-21T11:35:19Z

+#   ./stack.sh --help              # All options
+#
+# Services (full stack):
+#   solr1        http://localhost:8983  (SolrCloud node 1, embedded ZooKeeper)


I ❤️ this!

epugh

Good progress.. There is a lot here that I don't quite grok... Is trafficgen coming out of other perf related effrots, or just "hey, we need some load" ;-)

janhoy · 2026-03-21T23:27:17Z

Good progress.. There is a lot here that I don't quite grok... Is trafficgen coming out of other perf related effrots, or just "hey, we need some load" ;-)

Trafficgen is just something I wrote earlier, not written for perf at all, just to have something happening in a cluster, as it is boring to view a dashboard or traces with nothing going on. This dev/ hack is just convenience tooling to assist when developing / changing dashboards, metrics, modifyint OTEL Collector configuration etc.

Do you feel it is too much to add? Should the entire dev/ folder move to /dev-tools/monitoring instead, and trafficgen to /dev-tools/trafficgen?

epugh · 2026-03-23T12:00:14Z

Do you feel it is too much to add? Should the entire dev/ folder move to /dev-tools/monitoring instead, and trafficgen to /dev-tools/trafficgen?

Good question. I think if "trafficgen" is used by other areas of Solr, beyond basically a tool for demoing/integration testing of alerting then is should be elsewhere.

One thought.. I wonder if we need to reframe this as solr/monitoring into something like solr/examples/monitoring or solr/integrations/monitoring... With an eye towards starting to move more community knowledge on things into more formal documented artifacts? I could imagine a leader-follower-replication or a docker\docker-compose.yml example shipping.

janhoy · 2026-03-23T17:22:08Z

The example folder is mainly used to demonstrate different config sets and data types, see https://github.com/apache/solr/tree/main/solr/example

But I'm positive to a solr/integrations/monitoring/ location if you're looking for a location to ship other cluster-level examples. Or we could repurpose the example/ folder to be both. But I guess the example/ folder is bundled up in a release tgz, and this does not need to, so that's a major difference...

- Rename "Node Overview" row to "Cluster Overview" - "Distributed QPS": per-collection (external traffic only) - "Search Latency p50/p95/p99": per-collection instead of per-instance - "Total Update Rate": per-collection (replaces "Indexing Rate") - "Update Latency p99": per-collection (external traffic only) - "Document Count": moved to Cluster Overview, per-collection - Add new "Solr Core" row: QPS, Update Rate, Update Latency, Commit Rate, Optimize Rate — all per core - "Shard Leaders": time series per node (was single stat) - "Update Log Replay Remaining": per-collection time series (was single stat) - "Segment Count per Collection": max per collection (was per core) - "Total Index Size per Node": time series per node (was single stat) - Add "Index Size per Collection": leader cores only via PromQL join - "Pending Commit Docs": per-collection time series (was single stat) - Rename "Cache Efficiency" row to "Solr Caches"

janhoy · 2026-04-08T20:28:01Z

Did several improvements:

Rename "Node Overview" row to "Cluster Overview"
"Distributed QPS": per-collection (external traffic only)
"Search Latency p50/p95/p99": per-collection instead of per-instance
"Total Update Rate": per-collection (replaces "Indexing Rate")
"Update Latency p99": per-collection (external traffic only)
"Document Count": moved to Cluster Overview, per-collection
Add new "Solr Core" row: QPS, Update Rate, Update Latency, Commit Rate, Optimize Rate — all per core
"Shard Leaders": time series per node (was single stat)
"Update Log Replay Remaining": per-collection time series (was single stat)
"Segment Count per Collection": max per collection (was per core)
"Total Index Size per Node": time series per node (was single stat)
Add "Index Size per Collection": leader cores only via PromQL join
"Pending Commit Docs": per-collection time series (was single stat)
Rename "Cache Efficiency" row to "Solr Caches"

- "Heap Used" and "Heap Committed": add heap max (-Xmx) reference line - "Heap Max" renamed to "System and Heap Memory": area chart showing heap max, system total, and system used per instance - "Update Latency p99": fix empty panel by removing internal="false" filter (UPDATE metrics carry no internal label) - Rename metric jvm_system_memory_total_bytes → jvm_system_memory_bytes throughout dashboard and alerts

janhoy · 2026-04-08T21:24:53Z

More improvements:

"Heap Used" and "Heap Committed": add heap max (-Xmx) reference line
"Heap Max" changed to area chart and renamed to "System and Heap Memory", incorporating the new system max metric
"Update Latency p99": fix empty panel by removing internal="false" filter (UPDATE metrics carry no internal label)
Rename metric jvm_system_memory_total_bytes → jvm_system_memory_bytes throughout dashboard and alerts

janhoy · 2026-04-08T21:35:44Z

+  ts(
+    'Total Update Rate',
+    [prom(
+      'sum by (collection)(rate(solr_core_requests_times_milliseconds_count{%s,%s,%s,category="UPDATE"}[$interval]))' % [envSel, clusterSel, instSel],


@mlbiscoc and @dsmiley I have in the "Cluster Overview" row made panels for "Distributed QPS" per collection, i.e. number of user-generated (label internal=false).

I wanted to do the same for Update requests in this row, but turns out we do not make that distinction in the metrics, although internal requests are clearly tagged with a url parameters distrib.from and update.distrib=FROMLEADER. Do you remember why this was not added, or if the same information is captured in our metrics but in a different way?

It's absolutely an oversight that nobody stepped up to do this for /update. I've known about it for some time.

…showing active leaders per collection/shard - Remove "Active Cores" stat panel - Move "Disk Free" gauge from Cluster Overview to Index Health row

SOLR-18147 Make a new Grafana dashboard for Solr 10.x

a297ba2

janhoy marked this pull request as draft March 12, 2026 15:52

github-actions bot added documentation Improvements or additions to documentation scripts labels Mar 12, 2026

janhoy added 4 commits March 12, 2026 16:54

Shorten changelog title

4f5a1a1

Tested makefile and fixed it

630f788

Docker based makefile build

1842193

Fix various dashboard issues

17cb754

Add traffic generator Run two solr's in example cluster

janhoy marked this pull request as ready for review March 12, 2026 19:51

janhoy requested a review from Copilot March 12, 2026 19:51

Copilot started reviewing on behalf of janhoy March 12, 2026 19:52 View session

This comment was marked as outdated.

Sign in to view

janhoy added 2 commits March 12, 2026 21:00

License headers

854cb1e

Add a better screenshot

27d875f

Fix several panels

janhoy requested a review from mlbiscoc March 12, 2026 22:38

stack.sh script for running monitoring stack

8634fe9

support running your own solr

Adjust panel to include state=total

579df1d

janhoy marked this pull request as draft March 13, 2026 08:58

epugh reviewed Mar 21, 2026

View reviewed changes

janhoy added 2 commits April 8, 2026 22:28

Merge branch 'main' into 001-grafana-dashboard-solr10

7f9ae8b

janhoy commented Apr 8, 2026

View reviewed changes

janhoy added 2 commits April 9, 2026 00:26

- Add "Shard Leaders" stacked area panel (solid fill, y-axis from 0) …

e404aff

…showing active leaders per collection/shard - Remove "Active Cores" stat panel - Move "Disk Free" gauge from Cluster Overview to Index Health row

Added a "Collections" panel

ab80184

Conversation

janhoy commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Want to review?

Current dashboard layout (Draft)

Uh oh!

This comment was marked as outdated.

Uh oh!

janhoy commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gus-asf commented Mar 13, 2026

Uh oh!

mlbiscoc commented Mar 13, 2026

Uh oh!

janhoy commented Mar 13, 2026

Uh oh!

janhoy commented Mar 13, 2026

Uh oh!

Jesssullivan commented Mar 13, 2026

Uh oh!

janhoy commented Mar 19, 2026

Uh oh!

mlbiscoc commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

epugh Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

epugh left a comment

Choose a reason for hiding this comment

Uh oh!

janhoy commented Mar 21, 2026

Uh oh!

epugh commented Mar 23, 2026

Uh oh!

janhoy commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janhoy commented Apr 8, 2026

Uh oh!

janhoy commented Apr 8, 2026

Uh oh!

janhoy Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

dsmiley Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

janhoy commented Mar 12, 2026 •

edited

Loading

janhoy commented Mar 13, 2026 •

edited

Loading

mlbiscoc commented Mar 19, 2026 •

edited

Loading

janhoy commented Mar 23, 2026 •

edited

Loading