Conversation
Greptile SummaryThis PR introduces
Confidence Score: 4/5Safe to merge after addressing the unhandled TimeoutExpired after SIGKILL in stop(). One P1 defect: if proc.wait() after SIGKILL times out, stop() exits mid-cleanup leaving _process/_watchdog set and super().stop() never called, permanently corrupting module state. The remaining findings are P2 and don't block correctness on the happy path. dimos/core/native_module.py — specifically the stop() method's SIGKILL fallback path (lines 263–269).
|
| Filename | Overview |
|---|---|
| dimos/core/native_module.py | New NativeModule class wrapping native executables as managed subprocesses; P1 bug: unhandled TimeoutExpired after SIGKILL in stop() leaves module state inconsistent; also uses shell=True for build_command. |
| dimos/core/coordination/python_worker.py | Adds SIGINT ignore in worker entrypoint for cleaner coordinator-orchestrated shutdown; minor: lock held for up to 5s during shutdown poll, and _suppress_console_output lacks context manager for devnull. |
| dimos/core/test_native_module.py | New test file for NativeModule covering process crash/watchdog behavior and autoconnect blueprint wiring; tests look correct and cover the main code paths. |
Sequence Diagram
sequenceDiagram
participant C as Coordinator
participant NM as NativeModule
participant W as Watchdog Thread
participant P as Native Process
C->>NM: start()
NM->>NM: _maybe_build()
NM->>NM: _collect_topics()
NM->>P: Popen(cmd, start_new_session=True)
NM->>W: Thread(_watch_process).start()
W->>P: proc.wait() [blocks]
alt Process crashes unexpectedly
P-->>W: exit(rc)
W->>NM: stop() [watchdog-triggered]
else Normal shutdown
C->>NM: stop()
NM->>NM: acquire _stop_lock, set _stopping=True
NM->>P: send_signal(SIGTERM)
NM->>P: proc.wait(timeout)
alt SIGTERM timeout
NM->>P: proc.kill()
NM->>P: proc.wait(timeout) [unhandled TimeoutExpired bug]
end
NM->>W: watchdog.join(timeout)
NM->>NM: _process=None, _watchdog=None
NM->>C: super().stop()
end
note over NM,P: SIGINT ignored in python_worker entrypoint
note over NM,P: so coordinator owns shutdown sequencing
Comments Outside Diff (2)
-
dimos/core/coordination/python_worker.py, line 274-290 (link)Lock held for up to 5 s during
conn.poll()inshutdown()self._lockis acquired aroundconn.poll(timeout=5), so any thread still attempting an RPC (e.g., a stale daemon thread that hasn't noticed shutdown) will block for up to 5 seconds waiting for the lock. In practice shutdown should be quiescent, but consider releasing the lock before the poll or using a shorter timeout with a retry. -
dimos/core/coordination/python_worker.py, line 304-309 (link)File descriptor leak on exception in
_suppress_console_output()devnullis opened without a context manager; if eitheros.dup2()call raises,devnull.close()is never reached and the file descriptor leaks until GC. Usewith open(...)to guarantee cleanup.
Reviews (1): Last reviewed commit: "native module refinement" | Re-trigger Greptile
| logger.warning( | ||
| "Native process did not exit, sending SIGKILL", pid=self._process.pid | ||
| "Native process did not exit, sending SIGKILL", | ||
| module=self._mod_label, | ||
| pid=proc.pid, | ||
| ) | ||
| self._process.kill() | ||
| self._process.wait(timeout=5) | ||
| if self._watchdog is not None and self._watchdog is not threading.current_thread(): | ||
| self._watchdog.join(timeout=DEFAULT_THREAD_JOIN_TIMEOUT) | ||
| self._watchdog = None | ||
| self._process = None | ||
| proc.kill() | ||
| proc.wait(timeout=self.config.shutdown_timeout) |
There was a problem hiding this comment.
Unhandled
TimeoutExpired after SIGKILL leaves module in inconsistent state
After proc.kill(), the second proc.wait(timeout=self.config.shutdown_timeout) call has no exception handler. If subprocess.TimeoutExpired is raised (e.g., on a heavily-loaded system or in containers where zombie reaping is slow), the exception propagates out of stop(), leaving self._process and self._watchdog non-null and super().stop() never called — breaking any future restart attempt and leaking the watchdog thread.
| proc = subprocess.Popen( | ||
| self.config.build_command, | ||
| shell=True, |
There was a problem hiding this comment.
shell=True with a config-supplied string is a command-injection risk
build_command comes directly from NativeModuleConfig and is passed to Popen with shell=True. If the config is ever hydrated from an external source (environment variable, remote config store, user input), arbitrary shell commands could be injected. Prefer passing the command as a pre-split list and using shell=False, or at minimum document that build_command must only come from trusted sources.
| logger = setup_logger() | ||
|
|
||
|
|
||
| class LogFormat(enum.Enum): |
There was a problem hiding this comment.
The comments on the PR I thought were saying to simplify the stdout/err and AFAIK it wasn't used.
If we're going to have JSON vs normal logging shouldn't that be for all modules? It doesn't make sense to me that that would be native module specific
There was a problem hiding this comment.
it's not native module specific, our py logging system supports rich metadata (dictionaries) attached to each log line, for simplicity for native modules we support just STDOUT/STDERR parsing into strings, but rich json log is our dimos wide default
| """ | ||
| exe = Path(self.config.executable) | ||
| if exe.exists(): | ||
| is_prod = os.environ.get("PROD") |
There was a problem hiding this comment.
you are introducing magical env var, please use config system and global config
| is_prod = os.environ.get("PROD") | ||
|
|
||
| if is_prod: | ||
| if not exe.exists(): |
There was a problem hiding this comment.
we do want autobuild to work in prod on the first run, in case of nix this will let us distribute binaries but intelligently locally build when needed (just like pip does for py libs)
|
|
||
| logger.info( | ||
| "Executable not found, running build", | ||
| "Building native module", |
There was a problem hiding this comment.
We shouldn't by default run build system change detection (even to validate file changes) for all native modules all of the time in dev, this is expensive, toggle this per module if you are developing it. if you have 5 native modules, we don't want to run 5 nix instances every time someone is running your blueprint in dev
| if sys.platform.startswith("linux"): | ||
| import ctypes | ||
|
|
||
| _LIBC = ctypes.CDLL("libc.so.6", use_errno=True) |
There was a problem hiding this comment.
| _LIBC = ctypes.CDLL("libc.so.6", use_errno=True) | |
| from ctypes.util import find_library | |
| _LIBC = ctypes.CDLL(find_library("c"), use_errno=True) |
| from dimos.core.module import Module, ModuleConfig | ||
| from dimos.utils.logging_config import setup_logger | ||
|
|
||
| # ctypes is only needed for the Linux child-preexec helper below. Hoisting |
There was a problem hiding this comment.
I assume this is LLM generate code because it adds a lot of unnecessary explanations which obscures that it's actually needed for. This could just as easily be:
from ctypes.util import find_library
import ctypes
def set_process_to_die_when_parent_dies() -> None:
if not sys.platform.startswith("linux"):
return
PR_SET_PDEATHSIG = 1
libc = ctypes.CDLL(find_library("c"), use_errno=True)
if libc.prctl(PR_SET_PDEATHSIG, signal.SIGTERM) != 0:
err = ctypes.get_errno()
raise OSError(err, f"prctl(PR_SET_PDEATHSIG) failed: {os.strerror(err)}")
...
self._process = subprocess.Popen(
...
preexec_fn=set_process_to_die_when_parent_dies,
)
Native module refinement (rebuild logic and logging) and keyboard interrupt improvement