#366 hydravoice sub-processes die silently mid-event with no auto-restart (Nerdland 2026-05-23)

Description

## Summary

During the Nerdland event on 2026-05-23, the voice pipeline on body **cosmic-pretzel-98** (node-4c2be4b0, Windows, RTX 5090) failed silently mid-event. The hydravoice.exe parent process kept running and TCP 47994 was listening, but the ffmpeg and hydravoice-player sub-processes had both died. No audio was reaching the virtual cable. The head continued streaming with no indication anything was wrong — voice just stopped working for the audience without any alert.

A manual restart (`hydravoice.exe serve`) fixed it immediately; hydrabody reconnected within 5 seconds and the head resumed normal streaming.

---

## Architecture (as observed)

- **Binary:** `C:\hydravoice\hydravoice.exe`
- **Config:** `C:\WINDOWS\system32\config\systemprofile\.hydravoice\config.yaml`
- **TCP 47994** — WebSocket endpoint for browser mic input
- **UDP 47995** — audio data pipeline (spawned sub-processes bind this)
- hydravoice spawns **ffmpeg** and **hydravoice-player** to route audio to CABLE Input (VB-Audio Virtual Cable)
- hydrabody maintains a local loopback connection to hydravoice for coordination
- hydranode manages body roles but does **not** supervise hydravoice as a child process

---

## Observed failure state

| Component | State |
|-----------|-------|
| hydravoice.exe | Running |
| TCP 47994 | Listening |
| ffmpeg sub-process | Dead |
| hydravoice-player sub-process | Dead |
| UDP 47995 | Not bound |
| WebSocket from 10.10.0.1 (venue router via WireGuard) | Established — but no audio processed |
| Auto-restart by hydranode | Did not happen |

The parent process accepted WebSocket connections and appeared healthy from the outside, but the internal audio pipeline was completely dead. No error surfaced to hydrabody or the operator.

---

## Root cause

hydravoice runs as a plain Windows process — not a Windows service, and not supervised by hydranode. When the ffmpeg or hydravoice-player sub-processes crash (for any reason — OOM, codec error, VB-Audio glitch), the parent process does not exit, so there is nothing for an external supervisor to detect. The sub-process death is silent: no alert, no restart, no health degradation visible to hydrabody or the monitoring stack.

---

## Impact

- Silent mid-event voice failure at a live public event (Nerdland, 2026-05-23)
- No operator alert — problem was discovered manually
- Full voice restoration required manual SSH/exec intervention during the event
- Duration of outage unknown (could have been minutes to hours before discovery)

---

## Suggested solutions

### Option 1 — Install hydravoice as a Windows service (preferred quick win)

hydravoice already has `install` and `uninstall` sub-commands. Registering it as a Windows service gives the SCM automatic restart-on-crash semantics at zero additional code cost.

- Run `hydravoice.exe install` on body setup/deploy
- Configure the service recovery policy: restart on first, second, and subsequent failures
- Add to hydrabody or hydracluster deploy runbook so it is applied on every body provisioning
- Verify that the service account has access to VB-Audio CABLE Input

### Option 2 — hydranode supervised child process

Have hydranode launch hydravoice as a managed child process (similar to how it manages other body roles) and restart it automatically on exit.

- Requires hydranode to know about hydravoice as a supervised role
- Gives central visibility into process lifecycle across all body nodes
- More work than Option 1 but fits the existing hydranode supervision model

### Option 3 — hydrabody health check for local hydravoice connection

hydrabody already maintains a local loopback connection to hydravoice. Extend this to:

- Detect when the connection drops or goes silent
- Emit a health alert to hydracluster / the monitoring stack
- Optionally trigger a restart of hydravoice via exec

This catches the failure even if hydravoice.exe itself stays alive (as in this incident), because the audio pipeline being dead should be detectable from the WebSocket/loopback state.

### Option 4 — /health endpoint on hydravoice

hydravoice currently returns 404 on health check requests. Add a `/health` HTTP endpoint that:

- Returns 200 only when ffmpeg and hydravoice-player sub-processes are alive and UDP 47995 is bound
- Returns 503 (with detail) when the pipeline is degraded

This enables hydrabody, hydranode, and external monitoring (hydrastreamingmonitor) to detect a dead pipeline without needing process-level visibility.

---

## Recommended approach

Option 1 (Windows service install) is the lowest-effort fix and should be done immediately. Options 3 and 4 are complementary and should follow — the goal is that a dead audio pipeline is never silent again: it should be detected within seconds and either self-healed or alerted.

---

## Reproduction scenario

Exact crash trigger is unknown. Suspected causes: VB-Audio CABLE driver glitch, ffmpeg codec error on unusual mic input, or OOM on a GPU-saturated body. To reproduce intentionally: kill the ffmpeg child of hydravoice.exe while a stream is active and verify that (a) the parent stays alive, (b) no restart occurs, and (c) no alert fires.

HydraIssues

Description