HydraIssues

hydracluster: exec reliability improvements — server-side timeout enforcement + in-flight cleanup on disconnect
open improvement Project: hydracluster Reporter: 21 May 2026 19:49

Description

## Problem

The exec system has several gaps that make a stuck command unrecoverable without manual intervention:

1. **No server-side timeout enforcement.** The `timeout` field in ExecRequest is forwarded to the agent but never acted on server-side. If the agent crashes or the process ignores the context cancellation (common with PowerShell child processes on Windows), the in-flight exec stays in `inFlight[nodeID]` forever. The server will return `pending: true` indefinitely.

2. **No in-flight cleanup on disconnect.** When a node goes offline (heartbeat timeout fires, node marked offline), the exec queue is not touched. Queued and in-flight execs persist. When the node reconnects, the old in-flight exec ID is orphaned in the `inFlight` map but the drain goroutine starts fresh — the orphaned entry accumulates over restarts.

3. **No cancellation of in-flight commands.** `CancelOne` and `ClearQueue` only operate on the `queues` slice. Once an exec is dequeued it cannot be cancelled via any API. The comment in exec_store.go:112-115 acknowledges this explicitly.

## Solution

### 1. Server-side in-flight timeout watchdog
Add a background goroutine in execStore that sweeps `inFlight` every N seconds. For each in-flight exec that has a `timeout > 0` and has been in-flight longer than `timeout + grace_period`, synthesise a timeout result and call StoreResult. This unblocks the queue without requiring the agent to respond.

### 2. In-flight cleanup on node disconnect
In StartOfflineChecker (or wherever nodes are marked offline), call a new `execStore.EvictNode(nodeID)` method that moves any in-flight exec for that node back to the front of the queue (so it retries on reconnect) or marks it as failed, depending on a retry flag on the ExecRequest.

### 3. In-flight cancellation endpoint
Add `DELETE /api/v1/nodes/{id}/exec/queue/{execId}` behaviour for in-flight execs: mark them cancelled in the inFlight map, and when the agent eventually reports a result for a cancelled exec ID, discard it and synthesise a cancellation error to any polling caller.

## Context

Full code analysis done 2026-05-21. Key files: exec_store.go (inFlight map, ClearQueue, CancelOne), handlers_body.go (handleBodyExecResult), serve.go (StartOfflineChecker). The agent-side drain loop is in hydrabody pkg/body/exec.go.