Skip to content

fix(test): use sys_yield() instead of sys_sleep() in balance system test#199

Merged
MRNIU merged 3932 commits intoSimple-XX:mainfrom
MRNIU:fix/balance-test-use-yield
Mar 20, 2026
Merged

fix(test): use sys_yield() instead of sys_sleep() in balance system test#199
MRNIU merged 3932 commits intoSimple-XX:mainfrom
MRNIU:fix/balance-test-use-yield

Conversation

@MRNIU
Copy link
Member

@MRNIU MRNIU commented Mar 20, 2026

Summary

  • 修复 balance system test 失败问题:将 worker 的 sys_sleep() 替换为 sys_yield(),使任务对负载均衡器可见

问题原因

Balance() 通过 GetQueueSize() 检查各核心的调度器就绪队列来判断负载。但 sys_sleep() 会将任务从就绪队列移到独立的 sleeping_tasks 优先队列中,导致:

  1. Worker 调用 sys_sleep(10) → 移出 ready_queue_,进入 sleeping_tasks
  2. Balance() 每 64 tick 运行,检查 GetQueueSize() → 返回 0(sleeping 任务不可见)
  3. 所有核心的 ready queue 看起来都是空的 → 不触发任何迁移
  4. 所有 worker 始终在同一核心执行 → cores_used < 2 → 测试失败

修复方案

  • imbalance_worker: sys_sleep(10)sys_yield(),迭代次数 20 → 2000
  • affinity_pinned_worker: sys_sleep(10)sys_yield(),迭代次数 10 → 1000
  • sys_yield() 调用 Schedule() 后任务留在就绪队列中,对 Balance() 可见
  • 增大迭代次数确保 worker 存活时间覆盖多个 Balance 周期(每 64ms 一次)

测试

  • make SimpleKernel 编译通过(riscv64)

MRNIU added 30 commits February 28, 2026 13:47
…e interrupt controllers Interrupt members

Move arch-specific singleton type aliases from shared kernel.h into
per-arch directories, and convert interrupt controller singletons
(PlicSingleton, ApicSingleton) into private members of each arch's
Interrupt class, following the existing aarch64 pattern where Gic is
already an Interrupt member.

- Move Pl011Singleton to src/arch/aarch64/include/pl011_singleton.h
- Move SerialSingleton to file-local scope in x86_64/early_console.cpp
- Move Ns16550aSingleton to file-local scope in riscv64/interrupt_main.cpp
- Add Plic plic_ member to riscv64 Interrupt with InitPlic() deferred init
- Add Apic apic_ member to x86_64 Interrupt with InitApic() deferred init
- Move APIC creation from ArchInit() to InterruptInit() (boot order fix)
- Remove arch-specific #includes and #ifdefs from kernel.h

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Since Interrupt is used as etl::singleton (only one instance), static
class members are semantically equivalent to non-static members. Remove
static to eliminate the need for out-of-class definitions in .cpp files.

- aarch64: interrupt_handlers -> interrupt_handlers_ (non-static member)
- riscv64: interrupt_handlers_, exception_handlers_ (drop static + defs)
- x86_64: interrupt_handlers_, idts_ (drop static + defs, keep alignas)

The alignas(4096) on x86_64 members propagates correctly through
etl::singleton via uninitialized_buffer_of<T> which uses
alignas(etl::alignment_of<T>::value) on its storage.

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
…irectories

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
- C1: Add ReapTask(current) for orphan tasks in Exit() to prevent TCB leak
- C2: Start FSM after default-constructing TCB in Clone() to avoid null deref
- I2: Use STATE_ID constant in StateExited::on_event(MsgReap)
- I3: Move GetStatus() implementation from header to .cpp file
- I4: Enqueue idle task in kReady state, then transition to kRunning
- M2: Restore dropped @todo SIGCHLD comment in exit.cpp
- M5: Add [[nodiscard]] attribute to GetStatus()

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
…_router

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
…ification

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
…ork injection

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
…config override

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
MRNIU and others added 24 commits March 20, 2026 13:16
U-Boot's image.h requires openssl/evp.h for FIT image signing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
…uild

OP-TEE's build system requires aarch64-linux-gnu-cpp which was not
symlinked via update-alternatives.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Add kSyscallSchedGetaffinity and kSyscallSchedSetaffinity constants for
all architectures. Add dispatcher cases for sys_kill, sys_sigaction,
sys_sigprocmask, sys_sched_getaffinity, and sys_sched_setaffinity.
Implement sys_kill, sys_sigaction, and sys_sigprocmask function bodies
that delegate to TaskManager signal methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
- signal_test: SIGTERM/SIGKILL default, SIG_IGN, sigprocmask, error paths
- affinity_test: get/set affinity syscalls, cross-task, error paths
- tick_test: tick increment, sleep timing, runtime tracking
- zombie_reap_test: zombie reaping, orphan reparenting, multi-child Wait
- stress_test: 20 concurrent tasks, wait non-child, rapid create-exit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
…w clone

- Split single serial job into parallel build-riscv64 + build-aarch64 jobs
- Add dev-image.yml workflow to build/push dev container to GHCR
- Replace devcontainers/ci per-step with container: for shared container
- Use shallow clone (fetch-depth: 1) and shallow submodules (--depth 1)
- Add CMake build cache via actions/cache
- Reduce system test runs to 3 for PRs (10 for push/release)
- Add concurrency group to cancel superseded runs
- Upgrade codecov-action v3->v4, actions-gh-pages v3->v4

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
The riscv64 system test step only checked the cmake exit code, which
is always 0 even when individual tests fail inside QEMU. Align with
the aarch64 approach: capture output to file, grep for "Failed: 0"
to determine pass/fail.

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Copilot AI review requested due to automatic review settings March 20, 2026 17:19
Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes scheduler load-balancing visibility in the balance system test by replacing sleep-based worker loops with yield-based loops, and it also introduces broader kernel scheduling/tasking improvements (work stealing, signal plumbing, test-suite expansion) alongside removing x86_64-related build/tooling paths.

Changes:

  • Update balance system test workers to use sys_yield() and longer runtimes so tasks remain visible to Balance() and survive multiple balance intervals.
  • Implement/enable core work-stealing (TaskManager::Balance()), improve wait/block/wakeup semantics, and add basic signal support + new system tests.
  • Remove x86_64 toolchain/build paths and update docs/CI/devcontainer to reflect RISC-V + AArch64 focus.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tools/x86_64_qemu_virt.its.in Removed x86_64 FIT template
tools/x86_64_boot_scr.txt Removed x86_64 U-Boot boot script
tools/README.md Removed x86_64 tooling references
tools/Dockerfile Removed old tools Dockerfile
tools/.pre-commit-config.yaml.in Adjusted commented clang-tidy filters
tests/unit_test/balance_test.cpp Added unit tests for RR queue stealing primitives
tests/unit_test/README.md Removed x86_64 unit test reference
tests/unit_test/CMakeLists.txt Added balance_test.cpp to unit tests
tests/system_test/yield_test.cpp Added sys_yield() system test
tests/system_test/wait_system_test.cpp Adjusted child PID handling; renamed entry to wait_test()
tests/system_test/tick_test.cpp Added tick/sleep/runtime tracking system tests
tests/system_test/thread_group_system_test.cpp Refactored thread group tests; renamed entry to thread_group_test()
tests/system_test/system_test.h Expanded test registry, updated QEMU exit, added includes
tests/system_test/stress_test.cpp Added stress system tests (many tasks / wait errors / churn)
tests/system_test/spinlock_test.cpp Tightened SMP barrier semantics and return value
tests/system_test/ramfs_system_test.cpp Renamed to ramfs_test() and updated messages
tests/system_test/main.cpp Expanded test list; improved runner PID capture; added primary-boot guard
tests/system_test/fork_test.cpp Added fork system tests
tests/system_test/exit_system_test.cpp Renamed to exit_test(); adjusted local PID allocation patterns
tests/system_test/ctor_dtor_test.cpp Removed AArch64 FPU setup call
tests/system_test/clone_system_test.cpp Renamed to clone_test(); reap children to avoid runner miscount
tests/system_test/balance_test.cpp Added/updated balance system test using sys_yield()
tests/system_test/affinity_test.cpp Added affinity system tests
tests/system_test/CMakeLists.txt Added many new system tests; removed x86_64 QEMU flags branch
tests/integration_test/aarch64_minimal/main.cpp Updated description; removed local FPU setup routine
tests/integration_test/CMakeLists.txt Removed x86_64 QEMU boot flags branch
tests/AGENTS.md Updated unit-test invocation docs
src/task/wakeup.cpp Refactored wakeup to support per-core wake and added WakeupOne()
src/task/wait.cpp Reworked wait locking/blocking to avoid lost wakeups; added ECHILD error
src/task/tick_update.cpp Call Balance() every 64 ticks
src/task/task_manager.cpp Implemented TaskManager::Balance() work stealing
src/task/sleep.cpp Check pending signals after wake
src/task/signal.cpp Added basic signal support in TaskManager
src/task/schedule.cpp Added kernel_thread_bootstrap(); moved scheduler_started set under lock
src/task/mutex.cpp Improved mutex lock path to avoid lost wakeups; use WakeupOne()
src/task/include/task_manager.hpp Added signal/wakeup APIs and doc; moved GetCurrentCpuSched()
src/task/include/task_fsm.hpp Added atomic cached state for cross-core safe reads
src/task/include/task_control_block.hpp Added SignalState to task aux data
src/task/include/scheduler_base.hpp Declared kernel_thread_bootstrap() for arch entry stubs
src/task/exit.cpp Adjusted exit wake ordering and reparent timing
src/task/block.cpp Added Block(CpuSchedData&, ...) overload to avoid lost wakeups
src/task/CMakeLists.txt Added signal.cpp to build
src/task/AGENTS.md Updated docs to reflect Balance implementation
src/syscall.cpp Added signal + affinity syscalls to dispatcher and implementations
src/memory/memory.cpp Added BmallocLock for allocator thread-safety
src/memory/include/virtual_memory.hpp Updated supported arch list to remove x86_64 mention
src/main.cpp Removed ad-hoc test tasks; added primary-boot guard
src/libc/sk_stdlib.c Removed x86_64/SSE gating for strtod
src/libc/include/sk_stdlib.h Removed x86_64/SSE gating for strtof docs block
src/include/syscall.hpp Added signal/affinity syscall numbers and APIs; removed x86_64 numbering
src/include/signal.hpp Added signal definitions and SignalState
src/include/kernel_config.hpp Increased task/scheduler capacity constants
src/include/interrupt_base.h Updated doc to remove x86_64/APIC mention
src/include/expected.hpp Removed APIC error codes; added signal error codes
src/filesystem/vfs/open.cpp Call FileOps::Open() to prepare FS-specific handle
src/filesystem/vfs/include/vfs_types.hpp Added default FileOps::Open() hook and doc tweaks
src/filesystem/fatfs/include/fatfs.hpp Added FatFsFileOps::Open() override declaration
src/filesystem/fatfs/fatfs.cpp Treat FR_EXIST on mkdir as success; implement FatFS file open hook
src/arch/x86_64/timer.cpp Removed x86_64 timer stub
src/arch/x86_64/syscall.cpp Removed x86_64 syscall stub
src/arch/x86_64/switch.S Removed x86_64 switch stub
src/arch/x86_64/macro.S Removed x86_64 macro stub
src/arch/x86_64/interrupt_main.cpp Removed x86_64 interrupt implementation
src/arch/x86_64/interrupt.cpp Removed x86_64 interrupt implementation
src/arch/x86_64/interrupt.S Removed x86_64 trap return stub
src/arch/x86_64/include/sipi.h Removed x86_64 SIPI header
src/arch/x86_64/include/interrupt.h Removed x86_64 interrupt header
src/arch/x86_64/early_console.cpp Removed x86_64 early console
src/arch/x86_64/boot.S Removed x86_64 boot code
src/arch/x86_64/backtrace.cpp Removed x86_64 backtrace
src/arch/x86_64/arch_main.cpp Removed x86_64 arch init
src/arch/x86_64/apic/io_apic.cpp Removed x86_64 IO APIC driver
src/arch/x86_64/apic/include/io_apic.h Removed x86_64 IO APIC header
src/arch/x86_64/apic/include/apic.h Removed x86_64 APIC header
src/arch/x86_64/apic/apic.cpp Removed x86_64 APIC implementation
src/arch/x86_64/apic/README.md Removed x86_64 APIC docs
src/arch/x86_64/apic/CMakeLists.txt Removed x86_64 APIC build target
src/arch/riscv64/timer.cpp Call CheckPendingSignals() from timer tick
src/arch/riscv64/switch.S Route new thread entry through kernel_thread_bootstrap()
src/arch/riscv64/macro.S Reduced saved trap/callee registers (removed FP saves)
src/arch/riscv64/link.ld Comment formatting update
src/arch/riscv64/interrupt.S Updated trap context offset usage
src/arch/aarch64/timer.cpp Call CheckPendingSignals() from timer tick
src/arch/aarch64/switch.S Route new thread entry through kernel_thread_bootstrap()
src/arch/aarch64/macro.S Reduced trap/callee context sizes (removed FP/SIMD saves)
src/arch/aarch64/link.ld Comment formatting update
src/arch/aarch64/interrupt_main.cpp Handle spurious IRQ IDs explicitly
src/arch/aarch64/interrupt.cpp Changed EOIR write ordering
src/arch/aarch64/interrupt.S Updated trap context offsets for return path
src/arch/README.md Updated to reflect only riscv64/aarch64 support
src/arch/CMakeLists.txt Removed x86_64 subdir linkage
src/arch/AGENTS.md Updated arch structure/docs after x86_64 removal
src/CMakeLists.txt Removed x86_64 QEMU boot flags branch
docs/docker.md Updated commands to riscv64 defaults; removed x86_64 QEMU mention
cmake/x86_64-gcc.cmake Removed x86_64 toolchain file
cmake/functions.cmake Ensure /srv/tftp exists before linking files
cmake/compile_config.cmake Removed x86_64-specific compile/link options
cmake/3rd.cmake Removed x86_64 U-Boot defconfig
README_ENG.md Updated docs to remove x86_64 support claims
README.md Updated docs to remove x86_64 support claims
CMakePresets.json Removed x86_64 preset; updated QEMU flags/device
AGENTS.md Updated repo-level docs for two-arch support
3rd/cpu_io Updated cpu_io submodule revision
.gitignore Added .claude ignore
.github/workflows/workflow.yml Reworked CI to riscv64/aarch64 builds + repeated system tests + publish step
.github/workflows/dev-image.yml Added workflow to build/push devcontainer image
.devcontainer/devcontainer.json Removed x86_64 assembly extension recommendation
.devcontainer/Dockerfile Updated devcontainer packages; removed x86_64 toolchain/QEMU; ensure /srv/tftp
Comments suppressed due to low confidence (2)

tests/system_test/main.cpp:1

  • Using test_and_set(std::memory_order_acquire) does not publish the primary core's initialization to secondary cores (there is no matching release operation on the write). For a one-time init guard, use std::memory_order_acq_rel (or release on the writer and acquire on readers) so other cores reliably observe initialization effects.
    tests/system_test/balance_test.cpp:1
  • The PR title/description focus on a narrow test fix (swap sys_sleep()sys_yield() in balance system test), but the diff also includes substantial changes: implementing TaskManager::Balance(), adding a signal subsystem, resizing kernel limits, adding many new system tests, changing CI, and removing x86_64 support/tooling. Consider updating the PR description/title to reflect the full scope, or splitting into smaller PRs so the balance-test fix can be reviewed and landed independently.

MRNIU added 2 commits March 21, 2026 01:41
Workers in tight sys_yield() loops create extreme scheduling pressure,
triggering a recursive spinlock panic on sched_lock. The window between
Schedule()'s UnLock (interrupts restored) and switch_to allows a timer
interrupt to re-enter the scheduler path.

Fix: batch 10 yields (visible to Balance()) with 1ms sleeps (reduces
lock contention). Total lifetime still covers multiple Balance() cycles.

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
Previously CI grepped for 'Failed: 0' which matches even when tests
time out (e.g. 'Failed: 0 | Timeout: 1'), silently passing a hung test.

Now the kernel test runner prints 'RESULT: ALL TESTS PASSED' only when
every test passes with no failures and no timeouts. CI greps for this
marker instead, correctly catching all failure modes:
- Test assertion failures (no marker printed)
- Kernel PANIC/deadlock (no output at all, timeout 300 kills QEMU)
- Individual test hangs (runner marks as Timeout, marker not printed)

Signed-off-by: Niu Zhihong <zhihong@nzhnb.com>
@MRNIU MRNIU merged commit 381cbca into Simple-XX:main Mar 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants