native module refinement by jeff-hykin · Pull Request #1875 · dimensionalOS/dimos

jeff-hykin · 2026-04-17T05:03:50Z

Native module refinement (rebuild logic and logging) and keyboard interrupt improvement

greptile-apps · 2026-04-17T05:09:09Z

Greptile Summary

This PR introduces NativeModule, a Module subclass that manages a native (C/C++) executable as a supervised subprocess with LCM topic wiring, watchdog thread, and graceful shutdown. It also adds signal.SIG_IGN for SIGINT in the Python worker entrypoint so workers don't race the coordinator during keyboard-interrupt shutdown.

P1 (native_module.py line 269): The second proc.wait(timeout=...) after proc.kill() has no exception handler — a subprocess.TimeoutExpired here propagates uncaught, leaving self._process/self._watchdog non-null and super().stop() never invoked, permanently breaking the module instance.

Confidence Score: 4/5

Safe to merge after addressing the unhandled TimeoutExpired after SIGKILL in stop().

One P1 defect: if proc.wait() after SIGKILL times out, stop() exits mid-cleanup leaving _process/_watchdog set and super().stop() never called, permanently corrupting module state. The remaining findings are P2 and don't block correctness on the happy path.

dimos/core/native_module.py — specifically the stop() method's SIGKILL fallback path (lines 263–269).

Security Review

shell=True in _maybe_build() (native_module.py line 381): build_command is a raw config string passed directly to the shell. If config is ever sourced from untrusted input, this is a command injection vector.

Important Files Changed

Filename	Overview
dimos/core/native_module.py	New NativeModule class wrapping native executables as managed subprocesses; P1 bug: unhandled TimeoutExpired after SIGKILL in stop() leaves module state inconsistent; also uses shell=True for build_command.
dimos/core/coordination/python_worker.py	Adds SIGINT ignore in worker entrypoint for cleaner coordinator-orchestrated shutdown; minor: lock held for up to 5s during shutdown poll, and _suppress_console_output lacks context manager for devnull.
dimos/core/test_native_module.py	New test file for NativeModule covering process crash/watchdog behavior and autoconnect blueprint wiring; tests look correct and cover the main code paths.

Sequence Diagram

sequenceDiagram
    participant C as Coordinator
    participant NM as NativeModule
    participant W as Watchdog Thread
    participant P as Native Process

    C->>NM: start()
    NM->>NM: _maybe_build()
    NM->>NM: _collect_topics()
    NM->>P: Popen(cmd, start_new_session=True)
    NM->>W: Thread(_watch_process).start()
    W->>P: proc.wait() [blocks]

    alt Process crashes unexpectedly
        P-->>W: exit(rc)
        W->>NM: stop() [watchdog-triggered]
    else Normal shutdown
        C->>NM: stop()
        NM->>NM: acquire _stop_lock, set _stopping=True
        NM->>P: send_signal(SIGTERM)
        NM->>P: proc.wait(timeout)
        alt SIGTERM timeout
            NM->>P: proc.kill()
            NM->>P: proc.wait(timeout) [unhandled TimeoutExpired bug]
        end
        NM->>W: watchdog.join(timeout)
        NM->>NM: _process=None, _watchdog=None
        NM->>C: super().stop()
    end

    note over NM,P: SIGINT ignored in python_worker entrypoint
    note over NM,P: so coordinator owns shutdown sequencing

Comments Outside Diff (2)

dimos/core/coordination/python_worker.py, line 274-290 (link)

Lock held for up to 5 s during conn.poll() in shutdown()

self._lock is acquired around conn.poll(timeout=5), so any thread still attempting an RPC (e.g., a stale daemon thread that hasn't noticed shutdown) will block for up to 5 seconds waiting for the lock. In practice shutdown should be quiescent, but consider releasing the lock before the poll or using a shorter timeout with a retry.
dimos/core/coordination/python_worker.py, line 304-309 (link)

File descriptor leak on exception in _suppress_console_output()

devnull is opened without a context manager; if either os.dup2() call raises, devnull.close() is never reached and the file descriptor leaks until GC. Use with open(...) to guarantee cleanup.

_{Reviews (1): Last reviewed commit: "native module refinement" | Re-trigger Greptile}

greptile-apps · 2026-04-17T05:09:12Z

                logger.warning(
-                    "Native process did not exit, sending SIGKILL", pid=self._process.pid
+                    "Native process did not exit, sending SIGKILL",
+                    module=self._mod_label,
+                    pid=proc.pid,
                )
-                self._process.kill()
-                self._process.wait(timeout=5)
-        if self._watchdog is not None and self._watchdog is not threading.current_thread():
-            self._watchdog.join(timeout=DEFAULT_THREAD_JOIN_TIMEOUT)
-        self._watchdog = None
-        self._process = None
+                proc.kill()
+                proc.wait(timeout=self.config.shutdown_timeout)


Unhandled TimeoutExpired after SIGKILL leaves module in inconsistent state

After proc.kill(), the second proc.wait(timeout=self.config.shutdown_timeout) call has no exception handler. If subprocess.TimeoutExpired is raised (e.g., on a heavily-loaded system or in containers where zombie reaping is slow), the exception propagates out of stop(), leaving self._process and self._watchdog non-null and super().stop() never called — breaking any future restart attempt and leaking the watchdog thread.

greptile-apps · 2026-04-17T05:09:20Z

        proc = subprocess.Popen(
            self.config.build_command,
            shell=True,


shell=True with a config-supplied string is a command-injection risk

build_command comes directly from NativeModuleConfig and is passed to Popen with shell=True. If the config is ever hydrated from an external source (environment variable, remote config store, user input), arbitrary shell commands could be injected. Prefer passing the command as a pre-split list and using shell=False, or at minimum document that build_command must only come from trusted sources.

leshy · 2026-04-17T05:18:23Z

 logger = setup_logger()


-class LogFormat(enum.Enum):


why was this killed?

The comments on the PR I thought were saying to simplify the stdout/err and AFAIK it wasn't used.

If we're going to have JSON vs normal logging shouldn't that be for all modules? It doesn't make sense to me that that would be native module specific

it's not native module specific, our py logging system supports rich metadata (dictionaries) attached to each log line, for simplicity for native modules we support just STDOUT/STDERR parsing into strings, but rich json log is our dimos wide default

leshy · 2026-04-17T05:20:56Z

+        """
        exe = Path(self.config.executable)
-        if exe.exists():
+        is_prod = os.environ.get("PROD")


you are introducing magical env var, please use config system and global config

leshy · 2026-04-17T05:21:38Z

+        is_prod = os.environ.get("PROD")
+
+        if is_prod:
+            if not exe.exists():


we do want autobuild to work in prod on the first run, in case of nix this will let us distribute binaries but intelligently locally build when needed (just like pip does for py libs)

leshy · 2026-04-17T05:27:03Z

+
        logger.info(
-            "Executable not found, running build",
+            "Building native module",


We shouldn't by default run build system change detection (even to validate file changes) for all native modules all of the time in dev, this is expensive, toggle this per module if you are developing it. if you have 5 native modules, we don't want to run 5 nix instances every time someone is running your blueprint in dev

paul-nechifor · 2026-04-17T05:34:04Z

+if sys.platform.startswith("linux"):
+    import ctypes
+
+    _LIBC = ctypes.CDLL("libc.so.6", use_errno=True)


Suggested change

_LIBC = ctypes.CDLL("libc.so.6", use_errno=True)

from ctypes.util import find_library

_LIBC = ctypes.CDLL(find_library("c"), use_errno=True)

leshy

some changes

paul-nechifor · 2026-04-17T05:44:22Z

 from dimos.core.module import Module, ModuleConfig
 from dimos.utils.logging_config import setup_logger

+# ctypes is only needed for the Linux child-preexec helper below.  Hoisting


I assume this is LLM generate code because it adds a lot of unnecessary explanations which obscures that it's actually needed for. This could just as easily be:

from ctypes.util import find_library import ctypes def set_process_to_die_when_parent_dies() -> None: if not sys.platform.startswith("linux"): return PR_SET_PDEATHSIG = 1 libc = ctypes.CDLL(find_library("c"), use_errno=True) if libc.prctl(PR_SET_PDEATHSIG, signal.SIGTERM) != 0: err = ctypes.get_errno() raise OSError(err, f"prctl(PR_SET_PDEATHSIG) failed: {os.strerror(err)}") ... self._process = subprocess.Popen( ... preexec_fn=set_process_to_die_when_parent_dies, )

native module refinement

b1ede1e

jeff-hykin added the PlzReview label Apr 17, 2026

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

leshy reviewed Apr 17, 2026

View reviewed changes

Comment thread dimos/core/native_module.py

paul-nechifor reviewed Apr 17, 2026

View reviewed changes

leshy requested changes Apr 17, 2026

View reviewed changes

paul-nechifor reviewed Apr 17, 2026

View reviewed changes

jeff-hykin mentioned this pull request Apr 17, 2026

Jeff/fix/rosnav8 #1791

Draft

6 tasks

	_LIBC = ctypes.CDLL("libc.so.6", use_errno=True)
	from ctypes.util import find_library
	_LIBC = ctypes.CDLL(find_library("c"), use_errno=True)

Conversation

jeff-hykin commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Security Review

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

leshy Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

jeff-hykin Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

leshy Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

leshy Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

leshy Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leshy Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

paul-nechifor Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

leshy left a comment

Choose a reason for hiding this comment

Uh oh!

paul-nechifor Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeff-hykin commented Apr 17, 2026 •

edited

Loading

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading

leshy Apr 17, 2026 •

edited

Loading