HydraIssues

Body selection Phase 1: cross-venue eligibility client failover loop (supersedes #113)
open feature Project: hydracluster Reporter: cederik 29 Apr 2026 18:32

Description

Supersedes #113. Re-scopes the work after a structural review: the client failover loop is now in scope (was a follow-up); Phase 1 needs both halves to be meaningful.

Plan file: /home/claude-user/.claude/plans/expressive-sniffing-pearl.md (Phase 1).

# Why the rescope

The bigger problem this work sits inside: an organization with bodies across multiple venues should run primarily on its own fleet (same-org = cheap), and when the local same-venue body is unavailable, selection should fall over to another same-owner body automatically. Latency is the primary ranking signal for cross-venue choices (real measurement, not proxy). Cost is the tiebreaker (owned < shared < leased < marketplace). Phase 1 is the substrate everything else builds on; Phases 2-5 (real latency measurement, sharing tier and cost, mid-stream re-discovery, marketplace) are downstream issues blocked on this one.

#113 as originally written treated the client failover loop as a follow-up. That is wrong: Feature A enriches the eligibility list, but discoverBody at hydraheadflatscreen/pkg/client/discovery.go:61-74 short-circuits on the first body's WG IP unconditionally if LAN fails, and never iterates. So the richer list goes unused. The failover loop is load-bearing for Phase 1.

# Scope

## Server changes (hydracluster)

1. pkg/api/handlers_api.go:710-774 handleEligibleBodies accepts optional ?head_id=<id>. With it, applies a quickroute then cross-venue partition policy. Without it, preserves current strict-venue behavior (backwards compat).

2. Quickroute (partition gate, NOT a continuous sort key):
- Filter by district == head.district AND venue == head.venue AND owner == head.owner AND status == online AND not drained.
- If non-empty: sort by idle-first, VRAM-desc, return. Done. No latency probing needed.
- Same-venue is by definition the lowest-latency option; no need to compare against cross-venue.

3. Cross-venue fallback (only when quickroute is empty):
- Filter by district == head.district AND status == online AND not drained AND (owner == head.owner OR head.owner in body.allowed_orgs OR marketplace agreement permits).
- For Phase 1: only owner==head.owner. Sharing tier and cost ranking come in Phase 3.
- Sort within partition: idle-first, VRAM-desc (real latency in Phase 2, cost in Phase 3).

4. Add JSON field same_venue: bool per body in the response.

5. Unit tests: quickroute hit, quickroute empty + cross-venue same-owner, no candidates at all, missing head_id (strict-venue fallback).

6. Runbook entry in hydracluster/docs/runbooks/body-selection.md updated to document the partition policy.

7. NO bearer-token-based head identity. ?head_id= query param explicit. (per feedback_no_bearer_token_head_identity.md)

## Client changes (hydraheadflatscreen)

1. pkg/client/discovery.go:24-78 discoverBody:
a. Send &head_id=<id> (already in Config).
b. Failover loop: iterate the eligible list. For each body, probe LAN (TCP 47990, 1s timeout via probeReachable on :9741 proxy) then WG (same probe on the WG IP). Skip to next body when both fail. Return first reachable body's IP.
c. Currently the loop returns the first body's WG IP unconditionally on LAN failure, never moves to the next body. Fix that.

2. pkg/client/localapi.go:23-24 cachedBody / cachedBodyIP: invalidate on stream-launch failure (Moonlight exits during pairing or shortly after launch). Without invalidation, a failed body sticks across attempts.

3. Testbook entry in hydraheadflatscreen/docs/testbooks/.

# Acceptance tests (both must pass)

## Test A: cross-venue selection (peppy -> cosmic)

Targets: peppy-dumpling-32 (node-74adf4f8, bxl1/rupelmonde, owner visit.flanders), cosmic-pretzel-98 (node-4c2be4b0, bxl1/cloud-seven, owner visit.flanders), boom-pickle-38 (node-74f9fbf2, bxl1/rupelmonde).

1. Move boom-pickle to bxl1-test district to vacate peppy's same-venue partition. POST /api/v1/nodes/node-74f9fbf2/district {"district":"bxl1-test","venue":"rupelmonde"}.
2. Verify GET /api/v1/bodies/eligible?head_id=node-74adf4f8 returns only cosmic.
3. Full kiosk reset on peppy.
4. Trigger stream. Expected log lines (literals from pkg/client/config.go:61,64,68): 'LAN host 11.0.11.24 not reachable, falling back to WireGuard' then 'routing via WireGuard: 10.10.100.12'.
5. Stream stable for 6 minutes. No 'stream target changed', no stopMoonlight, no Moonlight subprocess exit.
6. pairing.yaml stores cert against 10.10.100.12.
7. Restore boom-pickle to bxl1/rupelmonde. Verify peppy returns to LAN streaming on boom-pickle.

## Test B: failover loop

1. Leave boom-pickle in bxl1/rupelmonde, leave cosmic available cross-venue (relies on Phase 1 cross-venue eligibility).
2. Install temporary pfctl rule on peppy blocking outbound TCP to boom-pickle's LAN AND WG IPs on :47990.
3. Verify nc -zv to both fails from peppy; nc -zv to cosmic 10.10.100.12:47990 succeeds.
4. Full kiosk reset on peppy.
5. Trigger stream. Expected: agent log shows discoverBody iterating past boom-pickle (probe failures on both transports) to cosmic, then 'routing via WireGuard: 10.10.100.12'.
6. Stream stable 6 minutes.
7. Cleanup pfctl rule. Verify peppy returns to LAN streaming on boom-pickle.

# Risks

- District move on boom-pickle: confirm hydranode does not react to district change before running on production body. Provisioning is per-role so typically safe; grep hydranode for district-reactive paths first or test on a lower-stakes node.
- Failover-loop probe timing: total probe time per body is up to 2s (LAN 1s + WG 1s). With N eligible bodies, worst case discoverBody takes 2N seconds. Phase 2's measured-latency sort and parallel probing are improvements; Phase 1 sequential is acceptable.
- Cache invalidation: stream-launch failure detection has to be reliable enough not to thrash. Initial heuristic: Moonlight exit within 10s of pairing OR pairing failure. Safe default: invalidate on any stream-launch failure.

# Constraints (from existing memory)

- No HeadBodyID pinning. Selection through discovery only. (feedback_body_selection_procedure)
- No bearer-token head identity. Explicit ?head_id= query param. (feedback_no_bearer_token_head_identity)
- bxl1-test is the parking district for scoping a body out of selection. (reference_bxl1_test_district)
- No em dashes. (feedback_no_emdash)
- No manual deploys; tag and push, CI auto-deploys. (feedback_no_manual_deploy)
- Memory entries must be backed by runbook entries in the owning repo. (feedback_memory_needs_runbook)

# Action

Close #113 and proceed against this issue. Plan file holds the full Phase 1-5 vision; this issue scopes Phase 1 only.