Skip to content

Add generic module reload validation and harden fastrpc unload hang coverage #336

Open
smuppand wants to merge 2 commits intoqualcomm-linux:mainfrom
smuppand:kernel
Open

Add generic module reload validation and harden fastrpc unload hang coverage #336
smuppand wants to merge 2 commits intoqualcomm-linux:mainfrom
smuppand:kernel

Conversation

@smuppand
Copy link
Contributor

@smuppand smuppand commented Mar 9, 2026

Add a new generic Module_Reload_Validation suite as per the new requirement #326 and wire in an initial fastrpc profile for unload/reload regression coverage.

Key changes in this PR:

  • reusable module reload engine in lib_module_reload.sh
  • new Baseport Module_Reload_Validation testcase and YAML
  • profile-based execution so future modules can be covered without  duplicating testcase logic
  • add fastrpc-specific daemon lifecycle handling around adsprpcd/cdsprpcd
  • improve timeout handling so stuck rmmod/modprobe commands do not hang  the harness
  • capture clearer timeout-path evidence and print relevant artifact paths  to stdout for CI/LAVA debugging
  • narrow module-state collection to the target module being tested

The first target is fastrpc, where unload can hang after daemon activity.This suite is meant to make that regression reproducible in automation andto preserve useful evidence when it happens.

root@rb3gen2-core-kit:/tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation# ./run.sh --module fastrpc
[INFO] 1970-01-01 00:01:06 - ---------------- Starting Module_Reload_Validation ----------------
[INFO] 1970-01-01 00:01:06 - Platform Details: machine='Qualcomm Technologies, Inc. Robotics RB3gen2' target='Kodiak' kernel='6.18.7-g0e842c87bef3-dirty' arch='aarch64'
[INFO] 1970-01-01 00:01:06 - Args: module='fastrpc' iterations=3 unload_timeout=30 load_timeout=30 settle_timeout=20 mode='profile-default' sysrq_dump=1
[INFO] 1970-01-01 00:01:06 - Sysrq hang dump policy: enabled on timeout paths only
[INFO] 1970-01-01 00:01:06 - [fastrpc] module=fastrpc mode=daemon_lifecycle iterations=3
[INFO] 1970-01-01 00:01:06 - [fastrpc] warmup: unmasking and starting adsprpcd.service
[INFO] 1970-01-01 00:01:07 - [fastrpc] warmup: unmasking and starting cdsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: stopping adsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: masking adsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: killing remaining adsprpcd.service cgroup processes
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: stopping cdsprpcd.service
[INFO] 1970-01-01 00:01:08 - [fastrpc] quiesce: masking cdsprpcd.service
[INFO] 1970-01-01 00:01:09 - [fastrpc] quiesce: killing remaining cdsprpcd.service cgroup processes
[INFO] 1970-01-01 00:01:12 - [fastrpc] sending SIGTERM to pid=587 pattern=/usr/bin/adsprpcd
[INFO] 1970-01-01 00:01:28 - [fastrpc] iter 1/3 exec: rmmod fastrpc
[FAIL] 1970-01-01 00:02:02 - [fastrpc] iter 1/3 unload timed out after 30s (pid=12476)
[INFO] 1970-01-01 00:02:02 - [fastrpc] unload timeout log: /tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01/unload.log
[INFO] 1970-01-01 00:02:02 - [fastrpc] collecting hang evidence in: /tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01/hang_evidence

Necessary logs will be available

root@rb3gen2-core-kit:/tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01# ls -l
total 4
drwxr-xr-x 2 root root 340 Jan  1 00:02 hang_evidence
drwxr-xr-x 2 root root 260 Jan  1 00:01 pre_state
-rw-r--r-- 1 root root 185 Jan  1 00:02 unload.log
drwxr-xr-x 2 root root 120 Jan  1 00:01 unload_timeout
root@rb3gen2-core-kit:/tmp/Runner/suites/Kernel/Baseport/Module_Reload_Validation/results/Module_Reload_Validation/fastrpc/iter_01#

@mwasilew
Copy link
Contributor

Is there an example job using this code? It almost seems like there is too much code for one commit.

@basak-qcom
Copy link

basak-qcom commented Mar 11, 2026

I'm also puzzled by the complexity of this. I was expecting <30 lines in total! For example, while I agree it makes sense to make the load/unload test generic of the specific module being tested, why is fastrpc.profile 438 lines, when I'd expect it to then declare the names of a kernel module, a couple of systemd services, and perhaps a few other parameters only? I imagine adding a second profile for a different kernel module and having to duplicate a huge bunch of code. So then what did the abstraction do for us?

@smuppand
Copy link
Contributor Author

I'm also puzzled by the complexity of this. I was expecting <30 lines in total! For example, while I agree it makes sense to make the load/unload test generic of the specific module being tested, why is fastrpc.profile 438 lines, when I'd expect it to then declare the names of a kernel module, a couple of systemd services, and perhaps a few other parameters only? I imagine adding a second profile for a different kernel module and having to duplicate a huge bunch of code. So then what did the abstraction do for us?

Thanks — this is a fair concern.
The intent of the abstraction was to keep the runner generic (run.sh + lib_module_reload.sh) and isolate only the module-specific lifecycle in the profile. In the current fastrpc case, that profile ended up carrying too much procedural logic because the unload hang is tightly coupled to adsprpcd / cdsprpcd stop-mask-kill-wait handling and the evidence collection around that path.

I agree the current split is not ideal if future profiles have to repeat this much code. The right end state is for a profile to be mostly declarative (module name, services, patterns, mode, a few knobs), with the common service/process wait/kill helpers moved into lib_module_reload.sh.

For this PR, I started with the fastrpc-specific flow to make the unload-hang regression reproducible first. I can follow up by shrinking fastrpc.profile and moving the reusable pieces into the shared library so that adding a second module profile is much smaller and cleaner.

@smuppand
Copy link
Contributor Author

Is there an example job using this code? It almost seems like there is too much code for one commit.

https://lava.infra.foundries.io/scheduler/job/158642

…ence capture

Avoid wedging the test harness when unload or load commands time out
by returning immediately after timeout evidence collection.

Also narrow module state capture to the target module and improve
timeout-path stdout reporting for easier debugging.

Signed-off-by: Srikanth Muppandam <smuppand@qti.qualcomm.com>
@smuppand
Copy link
Contributor Author

I'm also puzzled by the complexity of this. I was expecting <30 lines in total! For example, while I agree it makes sense to make the load/unload test generic of the specific module being tested, why is fastrpc.profile 438 lines, when I'd expect it to then declare the names of a kernel module, a couple of systemd services, and perhaps a few other parameters only? I imagine adding a second profile for a different kernel module and having to duplicate a huge bunch of code. So then what did the abstraction do for us?

I’ve pushed changes to reduce the amount of procedural logic living in fastrpc.profile. The common service/process lifecycle handling is being kept in the shared module-reload layer so future profiles can stay smaller and mostly parameter-driven, with only the module-specific lifecycle kept in the profile. Please take another look.

- cd Runner/suites/Kernel/Baseport/Module_Reload_Validation
- SYSRQ_ARG=""
- if [ "${ENABLE_SYSRQ_HANG_DUMP}" = "0" ]; then SYSRQ_ARG="--disable-sysrq-hang-dump"; fi
- ./run.sh --module "${PROFILE}" --iterations "${ITERATIONS}" --mode "${MODE}" --timeout-unload "${TIMEOUT_UNLOAD}" --timeout-load "${TIMEOUT_LOAD}" --timeout-settle "${TIMEOUT_SETTLE}" ${SYSRQ_ARG} || true
Copy link
Contributor

@vnarapar vnarapar Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason why ${SYSRQ_ARG}, instead we can directly use --enable-sysrq-hang-dump 0 which will reduce the number of CLI parameters and also the condition in line#26

I see u are using the same ENABLE_SYSRQ_HANG_DUMP to set the value as 1/0 anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason why ${SYSRQ_ARG}, instead we can directly use --enable-sysrq-hang-dump 0 which will reduce the number of CLI parameters and also the condition in line#26

I see u are using the same ENABLE_SYSRQ_HANG_DUMP to set the value as 1/0 anyway

The current runner CLI uses explicit boolean flags (--enable-sysrq-hang-dump / --disable-sysrq-hang-dump) and does not accept --enable-sysrq-hang-dump 0, so I’ve kept the YAML-side mapping to the optional disable flag unchanged.


run:
steps:
- REPO_PATH=$PWD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append || true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append || true

It appears that, aside from run.sh, this approach isn't being followed in this repository.

run:
steps:
- REPO_PATH=$PWD
- cd Runner/suites/Kernel/Baseport/Module_Reload_Validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append || true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append || true

It appears that, aside from run.sh, this approach isn't being followed in this repository.

- SYSRQ_ARG=""
- if [ "${ENABLE_SYSRQ_HANG_DUMP}" = "0" ]; then SYSRQ_ARG="--disable-sysrq-hang-dump"; fi
- ./run.sh --module "${PROFILE}" --iterations "${ITERATIONS}" --mode "${MODE}" --timeout-unload "${TIMEOUT_UNLOAD}" --timeout-load "${TIMEOUT_LOAD}" --timeout-settle "${TIMEOUT_SETTLE}" ${SYSRQ_ARG} || true
- $REPO_PATH/Runner/utils/send-to-lava.sh Module_Reload_Validation.res
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append || true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append || true

Run.sh already handles the exit code. I don't believe the .res file step specifically requires this.

Update the fastrpc profile to stop and mask rpc daemons, verify
remaining daemon processes, and provide clearer stdout context when
quiesce or unload paths fail.

Signed-off-by: Srikanth Muppandam <smuppand@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants