Add generic module reload validation and harden fastrpc unload hang coverage by smuppand · Pull Request #336 · qualcomm-linux/qcom-linux-testkit

smuppand · 2026-03-09T17:48:40Z

Add a new generic Module_Reload_Validation suite as per the new requirement #326 and wire in an initial fastrpc profile for unload/reload regression coverage.

Key changes in this PR:

reusable module reload engine in lib_module_reload.sh
new Baseport Module_Reload_Validation testcase and YAML
profile-based execution so future modules can be covered without duplicating testcase logic
add fastrpc-specific daemon lifecycle handling around adsprpcd/cdsprpcd
improve timeout handling so stuck rmmod/modprobe commands do not hang the harness
capture clearer timeout-path evidence and print relevant artifact paths to stdout for CI/LAVA debugging
narrow module-state collection to the target module being tested

The first target is fastrpc, where unload can hang after daemon activity.This suite is meant to make that regression reproducible in automation andto preserve useful evidence when it happens.

root@rb3gen2-core-kit:/tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation# ./run.sh --module fastrpc
[INFO] 1970-01-01 00:01:06 - ---------------- Starting Module_Reload_Validation ----------------
[INFO] 1970-01-01 00:01:06 - Platform Details: machine='Qualcomm Technologies, Inc. Robotics RB3gen2' target='Kodiak' kernel='6.18.7-g0e842c87bef3-dirty' arch='aarch64'
[INFO] 1970-01-01 00:01:06 - Args: module='fastrpc' iterations=3 unload_timeout=30 load_timeout=30 settle_timeout=20 mode='profile-default' sysrq_dump=1
[INFO] 1970-01-01 00:01:06 - Sysrq hang dump policy: enabled on timeout paths only
[INFO] 1970-01-01 00:01:06 - [fastrpc] module=fastrpc mode=daemon_lifecycle iterations=3
[INFO] 1970-01-01 00:01:06 - [fastrpc] warmup: unmasking and starting adsprpcd.service
[INFO] 1970-01-01 00:01:07 - [fastrpc] warmup: unmasking and starting cdsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: stopping adsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: masking adsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: killing remaining adsprpcd.service cgroup processes
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: stopping cdsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: masking cdsprpcd.service
[INFO] 1970-01-01 00:01:09 - [fastrpc] quiesce: killing remaining cdsprpcd.service cgroup processes
[INFO] 1970-01-01 00:01:12 - [fastrpc] sending SIGTERM to pid=587 pattern=/usr/bin/adsprpcd
[INFO] 1970-01-01 00:01:28 - [fastrpc] iter 1/3 exec: rmmod fastrpc
[FAIL] 1970-01-01 00:02:02 - [fastrpc] iter 1/3 unload timed out after 30s (pid=12476)
[INFO] 1970-01-01 00:02:02 - [fastrpc] unload timeout log: /tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01/unload.log
[INFO] 1970-01-01 00:02:02 - [fastrpc] collecting hang evidence in: /tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01/hang_evidence

Necessary logs will be available

root@rb3gen2-core-kit:/tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01# ls -l
total 4
drwxr-xr-x 2 root root 340 Jan  1 00:02 hang_evidence
drwxr-xr-x 2 root root 260 Jan  1 00:01 pre_state
-rw-r--r-- 1 root root 185 Jan  1 00:02 unload.log
drwxr-xr-x 2 root root 120 Jan  1 00:01 unload_timeout
root@rb3gen2-core-kit:/tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01#

mwasilew · 2026-03-10T17:23:29Z

Is there an example job using this code? It almost seems like there is too much code for one commit.

basak-qcom · 2026-03-11T09:33:51Z

I'm also puzzled by the complexity of this. I was expecting <30 lines in total! For example, while I agree it makes sense to make the load/unload test generic of the specific module being tested, why is fastrpc.profile 438 lines, when I'd expect it to then declare the names of a kernel module, a couple of systemd services, and perhaps a few other parameters only? I imagine adding a second profile for a different kernel module and having to duplicate a huge bunch of code. So then what did the abstraction do for us?

smuppand · 2026-03-11T10:30:28Z

I'm also puzzled by the complexity of this. I was expecting <30 lines in total! For example, while I agree it makes sense to make the load/unload test generic of the specific module being tested, why is fastrpc.profile 438 lines, when I'd expect it to then declare the names of a kernel module, a couple of systemd services, and perhaps a few other parameters only? I imagine adding a second profile for a different kernel module and having to duplicate a huge bunch of code. So then what did the abstraction do for us?

Thanks — this is a fair concern.
The intent of the abstraction was to keep the runner generic (run.sh + lib_module_reload.sh) and isolate only the module-specific lifecycle in the profile. In the current fastrpc case, that profile ended up carrying too much procedural logic because the unload hang is tightly coupled to adsprpcd / cdsprpcd stop-mask-kill-wait handling and the evidence collection around that path.

I agree the current split is not ideal if future profiles have to repeat this much code. The right end state is for a profile to be mostly declarative (module name, services, patterns, mode, a few knobs), with the common service/process wait/kill helpers moved into lib_module_reload.sh.

For this PR, I started with the fastrpc-specific flow to make the unload-hang regression reproducible first. I can follow up by shrinking fastrpc.profile and moving the reusable pieces into the shared library so that adding a second module profile is much smaller and cleaner.

smuppand · 2026-03-11T10:37:28Z

Is there an example job using this code? It almost seems like there is too much code for one commit.

https://lava.infra.foundries.io/scheduler/job/158642

…ence capture Avoid wedging the test harness when unload or load commands time out by returning immediately after timeout evidence collection. Also narrow module state capture to the target module and improve timeout-path stdout reporting for easier debugging. Signed-off-by: Srikanth Muppandam <smuppand@qti.qualcomm.com>

smuppand · 2026-03-13T04:44:57Z

I'm also puzzled by the complexity of this. I was expecting <30 lines in total! For example, while I agree it makes sense to make the load/unload test generic of the specific module being tested, why is fastrpc.profile 438 lines, when I'd expect it to then declare the names of a kernel module, a couple of systemd services, and perhaps a few other parameters only? I imagine adding a second profile for a different kernel module and having to duplicate a huge bunch of code. So then what did the abstraction do for us?

I’ve pushed changes to reduce the amount of procedural logic living in fastrpc.profile. The common service/process lifecycle handling is being kept in the shared module-reload layer so future profiles can stay smaller and mostly parameter-driven, with only the module-specific lifecycle kept in the profile. Please take another look.

vnarapar · 2026-03-13T09:02:29Z

Runner/suites/Kernel/Baseport/Module_Reload_Validation/Module_Reload_Validation.yaml

+    - cd Runner/suites/Kernel/Baseport/Module_Reload_Validation
+    - SYSRQ_ARG=""
+    - if [ "${ENABLE_SYSRQ_HANG_DUMP}" = "0" ]; then SYSRQ_ARG="--disable-sysrq-hang-dump"; fi
+    - ./run.sh --module "${PROFILE}" --iterations "${ITERATIONS}" --mode "${MODE}" --timeout-unload "${TIMEOUT_UNLOAD}" --timeout-load "${TIMEOUT_LOAD}" --timeout-settle "${TIMEOUT_SETTLE}" ${SYSRQ_ARG} || true


Any specific reason why ${SYSRQ_ARG}, instead we can directly use --enable-sysrq-hang-dump 0 which will reduce the number of CLI parameters and also the condition in line#26

I see u are using the same ENABLE_SYSRQ_HANG_DUMP to set the value as 1/0 anyway

Any specific reason why ${SYSRQ_ARG}, instead we can directly use --enable-sysrq-hang-dump 0 which will reduce the number of CLI parameters and also the condition in line#26

I see u are using the same ENABLE_SYSRQ_HANG_DUMP to set the value as 1/0 anyway

The current runner CLI uses explicit boolean flags (--enable-sysrq-hang-dump / --disable-sysrq-hang-dump) and does not accept --enable-sysrq-hang-dump 0, so I’ve kept the YAML-side mapping to the optional disable flag unchanged.

Runner/suites/Kernel/Baseport/Module_Reload_Validation/run.sh

Runner/suites/Kernel/Baseport/Module_Reload_Validation/Module_Reload_Validation_README.md

Runner/suites/Kernel/Baseport/Module_Reload_Validation/run.sh

Runner/suites/Kernel/Baseport/Module_Reload_Validation/Module_Reload_Validation_README.md

Runner/suites/Kernel/Baseport/Module_Reload_Validation/run.sh

Runner/suites/Kernel/Baseport/Module_Reload_Validation/Module_Reload_Validation.yaml

vnarapar · 2026-03-13T12:05:50Z

Runner/suites/Kernel/Baseport/Module_Reload_Validation/Module_Reload_Validation.yaml

+
+run:
+  steps:
+    - REPO_PATH=$PWD


Append || true

Append || true

It appears that, aside from run.sh, this approach isn't being followed in this repository.

vnarapar · 2026-03-13T12:05:56Z

Runner/suites/Kernel/Baseport/Module_Reload_Validation/Module_Reload_Validation.yaml

+run:
+  steps:
+    - REPO_PATH=$PWD
+    - cd Runner/suites/Kernel/Baseport/Module_Reload_Validation


Append || true

Append || true

It appears that, aside from run.sh, this approach isn't being followed in this repository.

vnarapar · 2026-03-13T12:06:10Z

Runner/suites/Kernel/Baseport/Module_Reload_Validation/Module_Reload_Validation.yaml

+    - SYSRQ_ARG=""
+    - if [ "${ENABLE_SYSRQ_HANG_DUMP}" = "0" ]; then SYSRQ_ARG="--disable-sysrq-hang-dump"; fi
+    - ./run.sh --module "${PROFILE}" --iterations "${ITERATIONS}" --mode "${MODE}" --timeout-unload "${TIMEOUT_UNLOAD}" --timeout-load "${TIMEOUT_LOAD}" --timeout-settle "${TIMEOUT_SETTLE}" ${SYSRQ_ARG} || true
+    - $REPO_PATH/Runner/utils/send-to-lava.sh Module_Reload_Validation.res


Append || true

Append || true

Run.sh already handles the exit code. I don't believe the .res file step specifically requires this.

Update the fastrpc profile to stop and mask rpc daemons, verify remaining daemon processes, and provide clearer stdout context when quiesce or unload paths fail. Signed-off-by: Srikanth Muppandam <smuppand@qti.qualcomm.com>

smuppand requested review from abbajaj806, basak-qcom, bhargav0610 and vnarapar March 9, 2026 17:48

smuppand force-pushed the kernel branch from a7ba5d8 to a16c2ab Compare March 13, 2026 04:40