Date: 2026-06-01
Status: Proposed. Design-review artifact — the architectural decision the owner blesses before any code lands. No code ships with this PR. The parachute-patterns/migrations/2026-06-01-hub-as-supervisor.md propagation checklist lands with Phase 1, not with this design PR.
Companions:
2026-05-18-v06-deploy-architecture.md — single-container, hub-as-supervisor deploy shape this generalizes off-container2026-05-26-fly-migration-path.md — Fly as a peer self-host target alongside Render; both are "container runtime = the process manager"2026-05-28-operator-mintable-vault-admin.md — operator-token mint path the CLI→supervisor auth reuses2026-04-20-module-architecture.md — module protocol (services.json, startCmd) the supervisor boots from../../parachute-patterns/patterns/canonical-ports.md — 1939 hub-pin (no fallback) that the port-race ordering in §7 must respect../../parachute-patterns/patterns/governance.md — RC versioning + reviewer-gated PR discipline the phasing followsiss self-heal on start hub (the auth bootstrap in §3 must not race)Grounding note. This doc was written against the real
parachute-hubsource atmain. Everyfile:linecitation was read, not assumed. The earlier thin draft made factual errors (it claimed the CLI mints a host-admin Bearer "the same way the SPA does," which is wrong — see §3); those are corrected here. Where the code does not yet offer a clean mechanism, this doc says so and proposes one rather than hand-waving.
Parachute runs two incompatible process models today, and which one you get depends entirely on how you deployed, not on any deliberate choice.
Model A — manager-less detached daemon (parachute start / expose / init). The hub is spawned detached and unref()'d, tracked only by a pidfile; each module is an independent detached+unref()'d daemon with its own pidfile. There is no supervisor. supervisor.ts:6-30 states the design intent in its own header: the on-box flow "spawns module daemons detached + unref'd, writes a pidfile, and walks away — process lifecycle becomes the operator's problem (launchd, systemd, or a follow-up parachute restart). … If vault crashes, nothing brings it back." Used on Mac laptops and every Linux VM (EC2 / Hetzner / any VPS).
ensureHubRunning → defaultHubSpawner (hub-control.ts:74-86, 200) — spawns hub-server.ts directly, not serve, so the detached hub has no Supervisor at all.defaultSpawner (lifecycle.ts:74-95) — detached: true + proc.unref(), stdio → a per-service logfile.~/.parachute/<svc>/run/<svc>.pid + processState() (process-state.ts:87), whose own docstring anchors unknown = "may be externally managed (launchd-era)."Model B — hub-as-supervisor (parachute serve). The hub runs in the foreground with an in-process Supervisor (supervisor.ts) that spawns modules as attached children, multiplexes their logs into hub stdout, and crash-restarts them on a budget. Used only inside containers today — the Dockerfile CMD ["bun","src/cli.ts","serve"] on Render and Fly.
The split produces four concrete, recurring failure classes:
EC2 ≠ Render. The exact same hub package behaves differently depending on deploy substrate. A Render box gets a supervisor, reboot survival (persistent disk + container restart), and working UI module-management. An EC2 box gets none of it. "Self-hosted with good defaults" is undermined when the defaults depend on the substrate.
No reboot survival off-container. On a VM or Mac, nothing keeps the hub or modules alive across a reboot. The README itself admits an at-login auto-start (parachute start --boot) is only "on the post-launch roadmap" (README.md:211). After a reboot, the box is simply down until the operator SSHes in and re-runs commands.
UI module-management is broken off-Render. The admin SPA drives module install/restart/upgrade/uninstall through POST /api/modules/:short/*, which require a supervisor. Without one, those endpoints return 503 supervisor_unavailable (hub-server.ts:1797-1808 and :1844-1855): "module operations require parachute serve (supervisor mode); on-box CLI uses parachute install/upgrade/restart." The detached ensureHubRunning hub has no supervisor, so the SPA's module manager is dead on every VM/Mac install. proxy-state.ts:86-130 even carries a whole second classification branch ("on-box CLI mode: no supervisor — fall back to pidfile") to cope.
Stale-daemon-drift bugs. Independent detached daemons drift from the hub's current state. The origin-pinned-credential class (the recurring "not signed in to the hub" / Cloudflare 401, hub#481/#480) is one face of it: a daemon that captured the hub origin at init keeps a stale iss after expose until something restarts it. The detached model multiplies the number of independently-restartable things that can hold stale state. (This design does not fix the origin-pinning class — that is the self-heal-on-start work — but it collapses the number of drift surfaces and makes the self-heal reliably reachable, because there is one restart authority instead of N independent daemons.)
The decision: retire Model A. Run parachute serve (Model B) under a per-platform process manager everywhere. The target runtime already exists and ships in containers; the work is mostly deletion + repointing, plus making serve reboot-persistent off-container and closing a small set of capability gaps the supervisor has versus the detached path.
One runtime: parachute serve — the hub in the foreground with an in-process Supervisor, modules as attached children. One outer keeper per platform:
| Platform | Process manager | Hub unit | Notes |
|---|---|---|---|
| Linux VM (EC2 / Hetzner / VPS) | systemd | system unit (/etc/systemd/system/parachute-hub.service) when root; user unit (~/.config/systemd/user/parachute-hub.service) + loginctl enable-linger when non-root |
linger is the hub#494 gotcha, carried over |
| Mac laptop | launchd | LaunchAgent ~/Library/LaunchAgents/computer.parachute.hub.plist, RunAtLoad + KeepAlive |
installed by default at init (decision D2) |
| Render / Fly | container runtime | the image CMD already runs serve; the runtime is the manager |
unchanged — this is the existing shape |
| Init-less host (minimal cloud image, Docker-without-tini, nspawn) | none | graceful fallback → print the foreground parachute serve invocation + exit non-zero |
no background operation — see §9 risk and D1 |
The hub unit is load-bearing. Under the unified model, modules are attached children: they die with the hub (serve's stop() SIGTERMs all children before server.stop(), serve.ts:342-351; the supervisor map is transient in-memory, re-derived from services.json on every boot, supervisor.ts:28-29, 151). This is a deliberate trade. The detached model survived hub death because children were unref()'d; we give that up because the process manager restarts the hub, and serve re-boots every module from services.json via bootSupervisedModules (serve-boot.ts:55). The whole reason the migration needs the connector-service machinery is to make "the process manager restarts the hub" reliably true off-container.
Reuse src/cloudflare/connector-service.ts. That file already implements the exact per-platform install/remove seam the hub unit needs — it just hard-codes the cloudflared command and naming. It has:
ConnectorServiceDeps seam (platform / getuid / homeDir / userName / which / run / file ops, connector-service.ts:51-101) so tests inject fakes.installLaunchd (:269) — writes a LaunchAgent plist (RunAtLoad + KeepAlive, :163-166), launchctl bootout → bootstrap gui/<uid> → kickstart -k (:304-321), with a legacy load -w fallback for old macOS.installSystemd (:331) — system vs user unit by uid (:341), Restart=always RestartSec=5 (:204-205), best-effort loginctl enable-linger for non-root user units guarded by a which("loginctl") probe + try/catch (:372-387, the hub#494 fix).{ outcome: "installed" | "fallback" } contract (:212-225) that never throws — a missing tool degrades to fallback with a warning, rather than hard-failing the calling command.removeConnectorService teardown (:438).Generalizing it (factor the cloudflared-specifics into a ManagedUnit descriptor) is Phase 2.
The principle: the running hub's in-process Supervisor is the single lifecycle authority. The CLI is a client that drives it over the module-ops HTTP API (api-modules-ops.ts), the same API the admin SPA already uses. Per-module pidfile spawning (lifecycle.ts defaultSpawner) is retired.
The thin draft was wrong. It claimed the CLI "mints a loopback admin Bearer the same way the SPA does … via /admin/host-admin-token." It cannot. handleHostAdminToken (admin-host-admin-token.ts:66-91) requires (a) a valid parachute_hub_session browser cookie set by a password login, and (b) first-admin identity. A CLI process has no session cookie. That path is the SPA's, full stop.
The real on-box credential is the operator token (operator-token.ts): ~/.parachute/operator.token, mode 0600, a hub-issued JWT. The admin scope-set (OPERATOR_TOKEN_SCOPE_SETS.admin, operator-token.ts:92-103) carries parachute:host:admin — exactly the scope the module-ops API gates on (api-modules-ops.ts:67). So the mechanism is:
The CLI reads
~/.parachute/operator.tokenand presents it as theAuthorization: BearertoPOST /api/modules/:short/<op>on the loopback hub. It does NOT mint its own token in parallel — it reads the existing on-disk operator token viauseOperatorTokenWithAutoRotate(operator-token.ts:391), which validates against the hub DB + issuer and opportunistically re-mints a within-7d-of-expiry token in place.
This is correct because:
parachute:host:admin under the default (admin) scope-set.iss is self-healed to the hub's current origin on start hub (selfHealOperatorTokenIssuer, hooked into start-hub, operator-token.ts:534), so a token minted pre-expose validates after expose.The auth precondition this introduces (honest gap). Today parachute start vault needs no token at all — it touches pidfiles directly. Under the unified model, every per-module verb is an authenticated module-ops call, so it needs an operator token. A fresh box that never ran parachute auth set-password / rotate-operator has no operator.token and would 401. We must close the bootstrap:
parachute init already establishes first-admin and (per the operator-mintable arc) should mint the operator token as part of setup. Phase 3 makes init guarantee an operator token exists after a successful init (mint-on-init if absent), so the steady-state operator never sees a 401.no operator token — run \parachute auth rotate-operator` to mint one(the existingOperatorTokenExpiredErrormessage shape,operator-token.ts:413`). Not a raw 401.operator-token.ts:9-13 is explicit that loopback is not trusted (browser extensions and compromised postinstalls hit 127.0.0.1 too). The operator token is the loopback caller's proof of operator authority.A module-ops call requires a running hub to answer it. If the operator runs parachute restart vault while the hub is down, there is nothing to call. Resolution — the CLI ensures the hub unit is up first, then drives the supervisor:
GET /health on the configured hub port). If it answers, go to step 4.systemctl [--user] start parachute-hub.service / launchctl kickstart -k gui/<uid>/computer.parachute.hub. (If no unit is installed — e.g. a never-migrated box — fail with "run parachute migrate to install the hub unit," or in init install it.) This replaces the old ensureHubRunning detached spawn (hub-control.ts:200); the new "ensure hub" means "ensure the unit is started," never a detached bun hub-server.ts.defaultPortListening connect-probe, lifecycle.ts:121, applied to the hub port). Bounded; on timeout, surface the hub unit's recent log (journald/launchctl print or the unit's log file) so a wedged hub is diagnosable, not a silent hang.iss, operator-token.ts:534) and call the module-op.This is the explicit resolution of the verdict's bootstrap finding: read the token after the hub is ready, never mint in parallel.
POST /api/modules/:short/* handlers live in api-modules-ops.ts; they require a non-optional supervisor (api-modules-ops.ts:184) — which is always present under serve, so the 503 supervisor_unavailable gate (hub-server.ts:1844-1855) becomes unreachable. That is the off-Render UI fix, for free.
| Command | Post-unification behavior |
|---|---|
serve |
The universal runtime, invoked by the platform unit's ExecStart. Foreground hub + Supervisor. Unchanged. |
start <svc> |
Needs a new POST /api/modules/:short/start (see below). Ensure-hub (3.2) → call start → supervisor.start(req) with the boot-derived SpawnRequest. |
start (no svc) |
Ensure-hub. The hub's boot already started every installed module (bootSupervisedModules); start with no svc becomes "ensure the hub unit is up" (which transitively boots all modules). |
stop <svc> |
Needs a new POST /api/modules/:short/stop → supervisor.stop(short). (Today only install/upgrade/restart/uninstall exist; uninstall stops-then-removes, which is the wrong verb for "just stop.") |
stop (no svc) |
Stop the hub unit via the platform manager: systemctl stop / launchctl bootout. Children die with it. Must go through the manager, never a PID signal — see the launchd KeepAlive note below. |
restart <svc> |
POST /api/modules/:short/restart → supervisor.restart(short). 404-fallthrough: if the module is not_supervised (404, api-modules-ops.ts:733-740), fall through to start <svc>. See §6. |
restart (no svc) |
Restart the hub unit: systemctl restart / launchctl kickstart -k. NOT "restart every module" — restarting the hub re-boots all modules anyway. Single-module discipline: restarting the hub ≠ a fan-out of per-module restarts. |
upgrade <svc> |
Module: bun add -g / git-pull then supervisor.restart. Hub: special — see §5. |
logs <svc> |
New GET /api/modules/:short/logs tap (see §6 — this is a logging-architecture change, not one endpoint). |
status |
Module rows: read supervisor.list(). Hub row: query the platform manager (systemctl is-active / launchctl print), since the supervisor does not supervise the hub. See §6. |
init |
Install + start the hub unit (launchd by default on Mac, D2), wait for hub readiness, guarantee an operator token exists, then run the install wizard / vault install against the loopback hub. Replaces the ensureHubRunning detached spawn (init.ts:374). |
expose / expose off |
See §4 — folds into the managed-unit story; expose off no longer stops the hub. |
migrate |
Idempotent detached→supervised cutover + unit install. See §7. |
Two new endpoints to add (Phase 1): POST /api/modules/:short/start (a pure supervisor.start(req) with the serve-boot-derived SpawnRequest — PORT/.env/HUB_ORIGIN injection per serve-boot.ts:95-119) and POST /api/modules/:short/stop (supervisor.stop). The verdict correctly flagged that start cannot be aliased to install: handleInstall→runInstall always runs the full install sequence (bun add -g or the isLinked probe, services.json seed, installDir stamp, well-known refresh, api-modules-ops.ts:579-714) — a heavy, network-touching path for what should be a pure spawn of an already-installed module.
expose folds in)Generalize connector-service.ts into a ManagedUnit descriptor { label, execStart: string[], env: Record<string,string>, logPath } and emit two units from the same machinery: the existing cloudflared connector and the new hub unit.
systemd (Linux VM). parachute-hub.service. System unit when root, user unit + linger when not — exactly the connector's branch (connector-service.ts:341, installSystemd). The new requirement versus the connector: an Environment= block (the connector's renderer emits none).
# /etc/systemd/system/parachute-hub.service (system; or ~/.config/systemd/user/… for user)
# Generated by parachute — do not edit by hand.
[Unit]
Description=Parachute hub (serve + supervisor)
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
# User=<operator> (system unit only — drop hub privileges)
Environment=PARACHUTE_HOME=/home/<operator>/.parachute
Environment=PORT=1939
# PARACHUTE_HUB_ORIGIN intentionally omitted — resolveStartupIssuer derives it
# (RENDER_EXTERNAL_URL / FLY_APP_NAME / expose-state), and start-hub self-heals
# the operator token + vault .env to the current origin. Baking a stale origin
# here would re-create the iss-mismatch class. See §9.
# PATH + BUN_INSTALL are load-bearing for supervised children that resolve a
# bun-linked binary — see §9 "linger env" risk.
Environment=PATH=/home/<operator>/.bun/bin:/usr/local/bin:/usr/bin:/bin
Environment=BUN_INSTALL=/home/<operator>/.bun
ExecStart=/abs/path/to/bun /abs/path/to/parachute-hub/src/cli.ts serve
Restart=always
RestartSec=5
# Crash-loop ceiling — without this a wedged hub (corrupt DB, held port)
# respawns forever, and each respawn re-boots every module (§6 / §9).
StartLimitIntervalSec=300
StartLimitBurst=5
[Install]
WantedBy=multi-user.target # default.target for a user unit
launchd (Mac). computer.parachute.hub.plist, RunAtLoad + KeepAlive (mirroring renderLaunchdPlist, connector-service.ts:143-173), with a ThrottleInterval to bound the KeepAlive respawn rate (launchd's default throttle is 10s; we set it explicitly so the hub-crash-loop story is the same as systemd's StartLimit). An EnvironmentVariables dict carries PARACHUTE_HOME / PORT / PATH / BUN_INSTALL. ProgramArguments is [<abs bun>, <abs cli.ts>, "serve"] — launchd does not search $PATH, so absolute paths are resolved via which at install time (the connector already does this, :181).
Both renderers gain an Environment / EnvironmentVariables block (the connector emits none today). This is load-bearing: supervised children inherit process.env (supervisor.ts:409), which under a linger-started systemd user unit is the systemd user-manager's minimal env, not the operator's login shell. If BUN_INSTALL + a PATH that includes bun's global bin aren't in the unit env, a bun-linked vault/scribe spawn can fail to resolve its linked binary on cold boot while working fine on an SSH login (the lifecycle.ts:695 resolution failure resurfacing through the unit env — "works when I'm logged in, broken on reboot," the hardest class to debug). See §9.
PARACHUTE_HOME capture at install timeCONFIG_DIR / SERVICES_MANIFEST_PATH are resolved at import time from process.env.PARACHUTE_HOME (serve.ts:30-33). The unit installer must capture the operator's current PARACHUTE_HOME at install time and bake it into the unit env — not hard-code the default — so an operator running a non-default home gets a unit pinned to their home. The launchd EnvironmentVariables and systemd Environment= both carry it.
expose and the connector fold in (BLOCKER 3 — resolved)The thin draft only discussed the cloudflared connector. The real coupling is broader and inverts under a managed hub.
(a) expose off no longer stops the hub. Today exposeOff stops the hub when the last layer is torn down (expose.ts:458-464: "Hub lives only as long as some layer is exposed … stop the hub"). Under a load-bearing platform unit with Restart=always / KeepAlive, calling stopHub would be immediately undone by the manager. So the "hub exists only while exposed" invariant inverts: the hub is now a persistent unit that runs whether or not any layer is exposed. expose off tears down the exposure (tailscale serve/funnel config or the cloudflared connector unit) and leaves the hub running. The CLI output changes accordingly (no more "hub stopped" line on expose off). This is a deliberate, owner-visible behavior change (D3).
(b) expose (the default path) is Tailscale, not cloudflared. expose tailnet / expose public (the supported path, expose.ts:367) is Tailscale serve / funnel — Tailscale's own daemon persists the serve config; there is no Parachute-installed unit for it. The cloudflared path (--cloudflare) is the one with a connector unit. So "expose" is two unrelated persistence stories:
expose only needs the hub reachable on loopback (which the hub unit now guarantees). The old ensureHubRunning call in expose.ts:323 becomes "ensure the hub unit is up" (§3.2).ManagedUnit machinery (it already installs one via installConnectorService, expose-cloudflare.ts:457). expose off --cloudflare removes the connector unit (removeConnectorService, :471) and leaves the hub unit running.(c) The post-expose vault restart (expose.ts:395-407, restarting hub-dependent services so they pick up the new origin) goes through the supervisor (restart <svc> → module-ops), not lifecycle.restart. The origin self-heal (selfHealVaultHubOrigin / selfHealOperatorTokenIssuer) still fires on the supervised restart.
upgrade hub self-modification (BLOCKER 2 — resolved)upgrade.ts special-cases the hub: hubTarget() fabricates a synthetic services.json row (upgrade.ts:369-378) because the hub isn't in services.json, dispatched at :387 (single-target) and :425 (sweep, hub-first). Both branches finish by calling restartFn (default lifecycleRestart, :255), which today goes stopHub/ensureHubRunning. Under the target model the hub IS the running unit, so "restart the hub" must become a manager restart of the very process driving the command.
upgrade hub works as:
Rewrite the binary. Same as today — upgradeNpm does bun add -g @openparachute/hub@<channel> (upgrade.ts:632); upgradeLinked does git pull --ff-only + bun install --frozen-lockfile in the bun-linked checkout (:543-573). The unit's ExecStart points at <abs bun> <abs cli.ts> serve (linked) or the installed bin path (npm), so rewriting the package on disk changes what the unit runs on its next start. Idempotent skip-restart heuristics (HEAD unchanged / version unchanged) are preserved (:556-559, :639-642).
Restart the unit, not the process. Replace the hub branch's restartFn with systemctl restart parachute-hub.service / launchctl kickstart -k. The manager tears down the old hub (children die with it), then starts the new binary, which re-boots every module from services.json. This is a clean swap: the old hub process exits, the manager brings up the new one. The CLI command (parachute upgrade hub) returns once the restart is dispatched; it does not need to outlive the old hub.
The SPA-initiated case (first-class — the no-CLI audience needs it). A Render/Fly operator has no shell; upgrading the hub from the admin SPA is exactly what the unification is for (the SPA as the management surface everywhere). So this is a core requirement, not a hazard to avoid. The hub is not a supervised module — there is no /api/modules/hub/* path (CURATED_MODULES rejects hub) — so it gets a dedicated endpoint, and the real constraint is that the hub can't restart itself synchronously (the request would die with the old process before reporting success). Resolution — POST /api/hub/upgrade (host-admin gated):
202 upgrade started immediately with the target version.detached:true + unref() — the one legitimate place a detached process survives, because it must outlive the hub): the helper does the binary rewrite (bun add -g @openparachute/hub@<channel> / linked git-pull), then triggers the platform-appropriate restart — systemctl restart / launchctl kickstart -k on a VM/Mac, or on a container (where the runtime re-runs CMD) the rewrite + a graceful hub exit so the runtime brings it back on the new binary. The self-restart hazard is handled by the helper owning the restart, not the request handler./health + the reported version until the new hub answers (bounded; on timeout, surface the unit/platform log + "the hub may still be coming up — refresh shortly").
Container nuance to resolve in implementation: on Render/Fly an in-place upgrade only persists if the hub is installed to the persistent disk (BUN_INSTALL on the mounted volume — the same place runtime module installs land); if the hub is baked into the image, an in-place bun add -g is lost on the next container restart and the real path is a platform redeploy. The SPA must reflect whichever is true (in-place vs "redeploy from your platform dashboard"). OPEN (implementation detail, §10/D4): confirm the Render/Fly image installs the hub to the persistent disk (in-place-upgradable) vs image-pinned (redeploy-only).Ordering in the sweep. The existing hub-first sweep order (upgrade.ts:425, "a dispatcher upgrade can't be undermined mid-sweep by a service upgrade that restarts hub") is preserved — but note that restarting the hub unit now re-boots all modules, so a hub upgrade mid-sweep already restarts every module. The sweep should upgrade the hub binary, restart the unit (which re-boots all modules onto current code), then upgrade each module package and supervisor.restart it individually. The hub-first invariant still holds.
bun-linked dev path. Aaron's parachute is bun-linked to the checkout. upgrade hub on the linked path is git pull + bun install + unit restart. The unit's ExecStart already points at src/cli.ts in the checkout, so the new code is live on the unit's next start with no re-link. This also resolves the services.json stale-version-cache footgun (hub#243): the unit restart re-reads the package.
This is a correctness regression, not optional hardening, and it must land before the cutover.
The detached spawner sets detached: true (lifecycle.ts:83) specifically to put each module in its own process group, so kill(-pid) reaps wrapped grandchildren; the header comment is explicit: "wrapped startCmds like pnpm exec tsx server.ts leave the tsx grandchild bound to the port after stop → restart hits EADDRINUSE." defaultAlive/defaultKill are group-aware (:150-180). The supervisor's defaultSpawnFn (supervisor.ts:403-415) spawns attached with no process group, and stop() calls proc.kill("SIGTERM") on the leader only (supervisor.ts:219). For any module whose startCmd is a wrapper, cutting over to the supervisor re-opens the exact EADDRINUSE-on-restart bug the detached path was built to fix (hub#88) — and because Phase 3 makes serve+supervisor the only runtime, the regression ships to every box, not just containers.
Resolution — Phase 2, gated before Phase 3: add process-group spawn to the supervisor (detached: true in defaultSpawnFn) and group-signal in stop() (kill(-pgid, ...) with the same ESRCH/bare-pid fallback as defaultAlive/defaultKill). The supervisor stays attached for stdio (it must keep piping child stdout into the hub log — pipeOutput, supervisor.ts:357): detached: true does not detach the stdio pipes when stdio is explicitly ["ignore","pipe","pipe"], so this is "own process group for signalling, pipes still wired." Add a regression test that a wrapped startCmd's grandchild is reaped on stop/restart (the round-trip: spawn wrapper → grandchild binds port → restart → fresh spawn binds the same port without EADDRINUSE).
The supervisor map is populated only by bootSupervisedModules at serve startup and by install. So a module that crashed-and-exhausted-its-budget-then-the-hub-restarted, or whose services.json row failed resolveSpec at boot (skipped, serve-boot.ts:69-93), or was installed out-of-band, is absent from the map. supervisor.restart(short) returns undefined → handleRestart returns 404 not_supervised (api-modules-ops.ts:733-740); runUpgrade fails with "upgraded but supervisor had no live entry — try install first" (:813-820). Under Model A, restart <svc> = stop+start and always works regardless of prior state.
Resolution: the CLI client treats 404 not_supervised from restart as "fall through to start" (the new POST /api/modules/:short/start from §3.3, which calls supervisor.start with the boot-derived SpawnRequest). For upgrade, the "no live entry" failure similarly falls through to start after the package rewrite. This makes restart <svc> / upgrade <svc> total over module state, matching Model A's stop+start semantics.
The hub unit's Restart=always / KeepAlive is the outer keeper. The supervisor's per-module budget (maxRestarts=3 in a 60s window, then crashed-and-stays-crashed, supervisor.ts:318-331) is the inner keeper. They compose badly in two ways:
A wedged hub respawns forever with no cap. Without a ceiling, a hub that crash-loops on boot (corrupt hub.db, held port) respawns every RestartSec=5 indefinitely, and each respawn re-runs bootSupervisedModules → re-spawns all modules → nested crash storms. Under Model A a single wedged module never took down its siblings. Resolution: the hub unit carries StartLimitIntervalSec=300 + StartLimitBurst=5 (systemd) and a ThrottleInterval (launchd) — §4.1. After the burst, systemd holds the unit in failed and parachute status / systemctl status shows why. A wedged hub becomes a visible failed unit, not an infinite tight loop.
A hub bounce masks a persistently-broken module. The supervisor's crash budget is transient (supervisor.ts:28-29) — it resets on every hub boot. So a module that should stay crashed gets re-booted fresh on every hub restart, masking a persistently-broken module as a flapping one. Resolution (accepted limitation + mitigation): the budget is intentionally per-hub-lifetime; the mitigation is the hub's own StartLimit (above) — a healthy hub doesn't bounce, so a healthy-hub + broken-module scenario does hit the module budget and lands crashed. Only a flapping hub resets module budgets, and the StartLimit caps that. We accept the blast-radius increase ("children die with hub") as the explicit trade for one runtime; the StartLimit is what makes it bounded rather than catastrophic. OPEN (minor): whether to persist module crash-counts across hub boots is a future refinement, not Phase-1 scope.
status hub row. hubRow reads processState(HUB_SVC) (pidfile) + readHubPort (status.ts:173-205) — both retired. The supervisor does not supervise the hub (supervisor.ts:25-29), so there is no supervisor entry to read for the hub. Resolution: the hub row queries the platform manager: systemctl [--user] is-active parachute-hub.service (active/failed/inactive) / launchctl print gui/<uid>/computer.parachute.hub (state + last exit). On Render/Fly the hub row reports "container runtime (managed)" — there's no on-box manager to query, and the hub answering /health is the liveness signal. The hub-row branch is platform-dispatched, the same shape as the unit installer.
proxy-state classification. classifyUpstream's Mode-2 pidfile fallback (proxy-state.ts:114-130) becomes dead for modules once everything is supervised (Mode-1 supervisor classification is authoritative). That is correct and intended. The retained signal for "is this module actually serving?" is the 30s boot-window in Mode-1 (proxy-state.ts:97-104) — which only papers over the alive-but-unbound case for 30s. See §6.5.
This is broader than "no per-service file." Three detached-model capabilities have multiple consumers that the supervisor doesn't yet replace:
Per-service logfile. Detached writes ~/.parachute/<short>/logs/<short>.log (lifecycle.ts openSync→stdio). Consumers: parachute logs <svc> tails it (lifecycle.ts:998-1051); readLogTail surfaces the boot error inline on start failure (:191-200, 706-725); the "running but no log file" diagnostic (:1000-1013). The supervisor multiplexes child stdout into hub stdout with a [short] prefix and writes no per-service file (supervisor.ts:357-401); it streams and discards (no ring buffer). A naive /api/modules/:short/logs SSE tap can only show output from connect time forward — losing the boot-time crash lines, which pumpLines itself notes are "likely the most important one — the exit cause" (supervisor.ts:394-397). And off-box, journald/launchd capture only the merged hub stream, so parachute logs vault off-box would return the whole hub log, not vault's.
pumpLines, so logs <svc> and start-failure tails can replay recent output including the boot/crash lines. The new GET /api/modules/:short/logs serves the ring buffer + an optional follow stream. This is a logging-architecture change touching the supervisor (buffer), the API (endpoint), and the CLI (logs client) — scoped explicitly into Phase 2, not hand-waved as "one endpoint."Port-readiness verification. lifecycle.start polls the actual port post-spawn to catch alive-but-never-bound (hub#487, lifecycle.ts:738-781). The supervisor marks a module running the instant Bun.spawn returns a pid (supervisor.ts:290-295) — no port check. So a module that spawns-but-never-binds (the classic bun-linked notes-serve resolution failure, lifecycle.ts:695) shows running while unreachable; classifyUpstream papers over it for only 30s then flips to persistent-error with no structured cause. Since the whole point of Phase 3 is "UI module-management works everywhere," shipping a supervisor that can't distinguish alive-but-unbound from healthy degrades the surface being unified.
defaultPortListening), promoting the module to running only after the port binds (or marking a started-but-unbound substate). Intersects the still-open #188 (systematic missing-dependency UX).Structured preflight / start-error. lifecycle.start runs ensureExecutable preflight and recordStartError/clearStartError onto the services.json row, feeding the missing-dependency UX (lifecycle.ts:638-687, the MissingDependencyError wire shape). The supervisor does neither.
start records a structured start-error onto the module state (and optionally the services.json row) on spawn-preflight failure, so the SPA + status keep the friendly missing-dependency surface. Intersects #188.parachute migrate is extended into the idempotent detached→supervised cutover + unit installer. Today it sweeps the legacy ~/.parachute layout and reads pidfiles (migrate.ts).
The verdict caught a concrete race: the connector installer starts the unit as part of install (enable --now, connector-service.ts:399; launchd bootstrap + kickstart -k, :305/:320). If the hub-unit installer reuses that machinery verbatim, "install the unit" would start a second hub on 1939 while the detached hub is still bound → EADDRINUSE → crash-loop under Restart=always. The hub pins 1939 with no fallback (canonical-ports pattern), so this is a hard collision.
Resolution — generalize the installer with an explicit start: boolean, and order the cutover stop-before-start:
kill(0)), and each module's pidfile/liveness.{ start: false } mode that does daemon-reload but not enable --now / bootstrap).stopHub (SIGTERM→SIGKILL + lsof orphan-adoption on 1939, hub-control.ts:329-338) for the hub; lifecycle.stop per module. Plus a per-module lsof sweep (see 7.2).systemctl enable --now / launchctl bootstrap). The hub comes up on a free 1939 and re-boots every module from services.json./health and supervisor.list() shows the expected modules running.ManagedUnit naming; tailscale needs nothing.This is the explicit stop-detached-FIRST-then-start-unit ordering. It costs a brief downtime window (between step 3 and step 5) — accepted as correct over the racy "install-while-running" the reused installer would otherwise do.
stopHub's lsof orphan-adoption probes only the canonical port 1939 (hub-control.ts:309). A module whose pidfile is stale-but-process-alive (the unknown/externally-managed case) won't be found by readPid and won't be stopped → it stays bound to its port → the supervised re-spawn hits EADDRINUSE and burns its 3-restart budget → lands crashed. Resolution: the cutover does an lsof sweep per services.json port (mirroring the hub orphan-adoption), adopting and killing any process bound to a module's declared port before starting the unit. Step 6's verify catches anything missed.
migrate refuses to sweep while services run, via listRunningServices (migrate.ts:347-368), which checks processState(HUB_SVC) pidfile (:353). Once the hub runs under a unit with no pidfile, processState(HUB_SVC) reports the hub as NOT running → the refuse-while-running guard (migrate.ts:406-419) silently fails open → migrate could archive ~/.parachute state out from under a live unit-managed hub. Resolution: listRunningServices (and any guard reading processState(HUB_SVC)) gains a platform-manager check for the hub — systemctl is-active / launchctl print — so a unit-managed hub is correctly detected as running and the archive guard holds.
parachute migrate --teardown (or a dedicated verb) removes the hub unit via the generalized removeConnectorService-shaped teardown (connector-service.ts:438 — bootout + rm plist / disable --now + rm unit + daemon-reload), idempotent + best-effort. This is also the rollback path if the cutover misbehaves: tear down the unit, and the operator can fall back to a foreground serve.bun remove -g @openparachute/hub), the unit would persist pointing at a deleted ExecStart and crash-loop under Restart=always. Resolution: ship + document a teardown hook (the uninstall path / a postuninstall, or an explicit "run parachute migrate --teardown before removing the hub package" instruction), so removing the package removes the unit. Phase 5 owns this.An operator on the detached model who simply bun add -g @openparachute/hub@<new> (or auto-upgrades) lands the cutover code without running migrate: new code expects a unit, no unit is installed, and the detached spawners are gone (Phase 5). After the next reboot they have a dead hub with no prompt. Resolution — migrate-on-first-start / auto-detect-and-offer: the first time post-cutover code runs a lifecycle verb and finds (a) no hub unit installed and (b) evidence of a prior detached install (pidfiles / services.json), it offers to run the cutover (parachute migrate), or in a non-interactive context prints the exact command. We keep pidfile readers for one release so the detector can see the old state; the detached spawners are removed in Phase 5. We do not silently auto-migrate (archiving is destructive-adjacent) — we detect and offer.
The unit's ExecStart points at the bun-linked checkout's src/cli.ts serve. A git pull in the checkout + unit restart picks up new code with no re-link. This is the same path upgrade hub (linked) uses (§5).
Six independently-shippable, reviewer-gated PRs (governance rule 1 + the mandatory reviewer dispatch). Each code-touching PR bumps rc.N per governance rule 2. The parachute-patterns/migrations/2026-06-01-hub-as-supervisor.md propagation checklist lands with Phase 1 and is updated by every subsequent phase (it tracks which PR landed each propagation item — README rewrite, hub CLAUDE.md Architecture block, help text, the FIRST_PARTY_FALLBACKS note, etc.).
| Phase | Scope | Independently shippable? | Gates |
|---|---|---|---|
| 1. Module-ops client + new endpoints + migration file | Add POST /api/modules/:short/{start,stop}; add a CLI module-ops client that reads operator.token and drives the running hub (§3.1); ship the migration checklist file. No behavior cutover — additive. |
Yes (additive; behind the existing serve path) | 3 |
| 2. Generalize connector-service + supervisor hardening | Factor connector-service.ts into a ManagedUnit (env block, install-without-start mode, hub naming). Supervisor: process-group spawn + kill(-pgid) (§6.1), per-module log ring buffer (§6.5), post-spawn port-readiness + structured start-error (§6.5). |
Yes (supervisor changes are container-safe improvements; the generalized installer is unused until 3) | Blocks 3 |
3. init + start/stop/restart <svc> cutover |
init installs + starts the hub unit (launchd default on Mac, D2), guarantees an operator token, runs the wizard against the loopback hub. start/stop/restart <svc> drive the supervisor with 404-fallthrough (§6.2). Status hub row reads the platform manager (§6.4). |
Yes (after 2) | 4, 5 |
4. expose + upgrade hub cutover |
expose/expose off decoupled from hub lifecycle (§4.3); connector folds into ManagedUnit. upgrade hub restarts the unit (§5). + the SPA POST /api/hub/upgrade endpoint + detached one-shot helper (§5.3) so the no-CLI (Render/Fly) audience can upgrade the hub from the admin SPA. |
Yes (after 3) | 5 |
5. migrate cutover + retire detached spawners |
Extend migrate (ordering §7.1, orphan sweep §7.2, archive-guard fix §7.3, teardown §7.4, auto-offer §7.5). Remove defaultSpawner/ensureHubRunning/defaultHubSpawner detached paths; thin process-state.ts to readers-only. |
Yes (after 4) | 6 |
| 6. Docs + test sweep | Rewrite README "Service lifecycle" (§9 D2/R14), hub CLAUDE.md Architecture block, help text; run audit-canonical-refs.sh; finalize the migration checklist. |
Yes (docs-only; skips rc per the doc-only exemption) | — |
Why process-group reaping is Phase 2, before the Phase 3 cutover: §6.1. Retiring the detached spawner (Phase 5) without group-spawn in the supervisor re-opens the EADDRINUSE-on-restart bug on every box. Phase 2 lands it as a hard dependency, verified by a regression test, before any cutover.
Every verdict finding (blocker / major / minor), folded in with its resolution or an explicit OPEN — owner decision.
| # | Severity | Finding | Resolution |
|---|---|---|---|
| R1 | blocker | CLI→module-ops auth claim was factually wrong (not /admin/host-admin-token) |
Resolved (§3.1): CLI reads ~/.parachute/operator.token (carries parachute:host:admin), presents it as Bearer to loopback module-ops; never mints in parallel. |
| R2 | blocker | Bootstrap chicken-and-egg: minting a token needs a running hub; module-op when hub is down | Resolved (§3.2): ensure-hub-unit-first (probe → systemctl/launchctl start → wait readiness → read token). Read token after hub ready to avoid racing the start-hub iss self-heal (hub#481). |
| R3 | blocker | New auth precondition: fresh box has no operator.token → 401 on every per-module verb | Resolved (§3.1): init guarantees an operator token (mint-on-init if absent); missing-token failure is an actionable "run parachute auth rotate-operator," not a raw 401. |
| R4 | blocker | upgrade hub self-modification absent; hub isn't a supervised module (no /api/modules/hub/*) |
Resolved (§5): rewrite the binary → restart the unit (systemctl restart / launchctl kickstart), not the process. SPA-initiated hub-upgrade is first-class (owner 2026-06-01, D4) via a dedicated POST /api/hub/upgrade + detached one-shot helper; container in-place-vs-redeploy is an open impl detail (§5.3). |
| R5 | blocker | expose/hub-lifecycle coupling missed; expose off stops the hub; expose is Tailscale not cloudflared |
Resolved (§4.3): expose off no longer stops the hub (the manager would undo it) — invariant inverts. Tailscale = no Parachute unit; cloudflare = connector ManagedUnit. |
| R6 | blocker | start cannot be aliased to install (full network-touching install path) |
Resolved (§3.3): add a real POST /api/modules/:short/start = pure supervisor.start(req) with boot-derived SpawnRequest. |
| R7 | major | Process-group reaping regression — supervisor spawns attached, kills leader only → EADDRINUSE-on-restart for wrapper startCmds | Resolved (§6.1): group-spawn + kill(-pgid) in the supervisor, landed in Phase 2 before the Phase 3 cutover, with a regression test. Not optional. |
| R8 | major | restart/upgrade of a not-supervised module → 404 not_supervised |
Resolved (§6.2): CLI treats 404 as fall-through to start (and upgrade falls through after the package rewrite). Total over module state, matching Model A. |
| R9 | major | Crashed hub takes all modules down; no backoff cap; hub bounce resets module crash budget | Resolved (§6.3): StartLimitIntervalSec/StartLimitBurst (systemd) + ThrottleInterval (launchd) cap hub respawn; a wedged hub becomes a visible failed unit. Cross-boot crash-count persistence is OPEN (minor refinement). |
| R10 | major | Migration double-spawn / port-1939 race (reused installer couples install+start) | Resolved (§7.1): generalized installer gets { start: false }; cutover order = write-unit → stop-detached → verify-port-free → start-unit. |
| R11 | major | status/proxy-state for the hub itself under-specified (supervisor doesn't supervise hub) |
Resolved (§6.4): hub row queries the platform manager (is-active / launchctl print); Render/Fly report "container runtime (managed)." |
| R12 | major | logs <svc> loses per-service file; supervisor streams+discards (no boot-crash replay); off-box returns merged hub log |
Resolved (§6.5, Phase 2): supervisor keeps a bounded per-module ring buffer; GET /api/modules/:short/logs serves buffer + follow. Logging-architecture change, scoped into Phase 2. |
| R13 | major | Concurrency: in-process budget vs unit Restart=always compose badly |
Resolved (§6.3): same StartLimit/Throttle fix; the budget is per-hub-lifetime by design, bounded by the hub's own crash ceiling. |
| R14 | major | README "Service lifecycle" rewrite is far larger than a one-line "no launchd" nod | Resolved (D2/R23 + Phase 6): README:181-211 fully rewritten — run/<svc>.pid + logs/<svc>.log state model retired, unknown=externally-managed semantics removed, "Migrating from launchd" subsection reversed, parachute start --boot roadmap line resolved (this design is it). Tracked as discrete migration-checklist lines. |
| R15 | major | Uninstall/teardown of the hub unit + migrate-archive safety guard failing open | Resolved (§7.3, §7.4): listRunningServices gains a platform-manager hub check (guard holds); explicit --teardown + package-uninstall hook removes the unit. |
| R16 | minor | Orphan adoption single-port only; stale-pidfile-but-alive module stays bound | Resolved (§7.2): lsof sweep per services.json port during cutover. |
| R17 | minor | launchd KeepAlive fights an intentional hub stop (SIGTERM resurrects) | Resolved (§3.3): hub stop/restart MUST go through launchctl bootout/kickstart, never a PID signal; the table is explicit. |
| R18 | minor | Port-readiness + structured-preflight parity loss undercuts the headline UX win | Resolved (§6.5, Phase 2): post-spawn port-readiness gate + structured start-error in the supervisor; intersects #188. |
| R19 | minor | No-init-system hosts lose background operation entirely (worse than detached) | Acknowledged caveat (§2 table, D1): init-less hosts get foreground-serve-only (no background), which is worse than detached's "survives until reboot." Explicitly documented; OPEN/D1 whether to keep a transient-unref fallback for this narrow population. |
| R20 | minor | systemd user-unit linger env: supervised children inherit minimal env, can't resolve bun-linked binary on cold boot | Resolved (§4.1): the unit Environment/EnvironmentVariables block carries PATH + BUN_INSTALL (not just PARACHUTE_HOME/PORT), so cold-boot linger spawns resolve linked modules. |
| R21 | minor | PARACHUTE_HOME import-time resolution; unit must capture the operator's current home |
Resolved (§4.2): installer captures the current PARACHUTE_HOME at install time, bakes it into unit env. |
| R22 | minor | Migration checklist file not created; propagation surface uncatalogued | Resolved (phasing): ships with Phase 1 (the originating implementation PR, per workspace policy), not this design PR — noted in the PR body. |
| R23 | minor | D2 reverses a shipped, documented selling point ("no launchd") | Resolved (R14 + D2): not a copy nod — discrete migration-checklist lines rewrite README:183 AND the "Migrating from launchd" subsection (which currently tells operators to remove the mechanism we now install). |
| R24 | minor | init vault-install + CLI-wizard sub-steps not enumerated as touch points |
Resolved (§3.3 init row + §7): init order = install-unit → start-unit → wait hub readiness → guarantee operator token → wizard/vault-install against loopback; hub-port readiness reuses defaultPortListening; a unit-installs-but-hub-never-binds case surfaces the unit log (no silent wizard hang). |
The owner's fork leans, written as explicit, flippable decisions, plus what the verdicts surface as needing the owner.
D1 — Full retirement of the detached model. Lean: yes, full-retire (Phase 5 deletes defaultSpawner/ensureHubRunning detached spawners). Alternative: keep a transient-unref fallback for the narrow init-less-host population (R19) so they retain "survives until reboot." Rationale for full-retire: two models is the root problem; keeping a fallback re-introduces the split. Flip cost: low — the fallback is the existing code; not deleting it in Phase 5 preserves it. Owner call: accept the init-less-host regression (R19) as the price of one runtime, or keep the narrow fallback?
D2 — Mac = launchd by default at init. Lean: yes. Alternative: opt-in (init prints the launchd install command but doesn't run it). Rationale: reboot survival is the headline win; making it opt-in on Mac leaves laptops in the same down-after-reboot state. This reverses a shipped, documented selling point — README:183 sells "no launchd, no manual bun serve, no hunting for PIDs," and README:201-207 tells operators to remove a launchd agent. The reversal is defensible (the old plist was vault-specific and manual; the new one is the hub, installed and managed by Parachute), but it needs the full README rewrite (R14/R23), not a nod. Owner call: launchd-by-default, or opt-in on Mac?
D3 — parachute start/stop/restart <svc> preserved by driving the running supervisor. Lean: yes — preserve the verbs, repoint them at module-ops. Alternative: deprecate per-module CLI verbs entirely in favor of the SPA. Rationale: the CLI verbs are muscle-memory and scriptable; preserving them (now as supervisor clients) is strictly better than the detached pidfile path. This also makes expose off stop-the-hub behavior invert (§4.3) — an owner-visible change. Owner call: confirm the verbs stay; confirm expose off leaving the hub running is acceptable.
D4 — Idempotent cutover migration; SPA hub-upgrade affordance. Decided (owner, 2026-06-01): SPA-driven upgrade hub is first-class, not CLI-only. Render/Fly operators have no shell, so upgrading the hub from the SPA is a core requirement — the unification exists to make the SPA the management surface everywhere. The self-restart hazard is handled by a dedicated POST /api/hub/upgrade → detached one-shot helper that survives the hub exit + drives the platform restart, with the SPA polling for the new version (§5.3). Idempotent migrate cutover: yes. Remaining open implementation detail: the Render/Fly in-place-vs-image-redeploy question (§5.3) — confirm the hub is installed to the persistent disk (in-place-upgradable) or surface "redeploy from your dashboard" if image-pinned.
Owner-surfaced (from verdicts):
logs off-box semantics (R12) — the ring-buffer tap works on-box; off-box (Render/Fly) the merged hub stream is what journald/the container log holds. Lean: logs <svc> on-box reads the ring buffer via the endpoint; off-box documents that the platform's log viewer shows the merged hub stream (with [short] prefixes for grep). Owner confirm this is acceptable for the container deploys.The target runtime (serve + Supervisor + api-modules-ops) already ships in containers. The retirement is mostly deletion + repointing, not new architecture:
serve reboot-persistent by generalizing connector-service.ts to install a launchd/systemd unit running <bun> <cli> serve (Phase 2/3).lifecycle.ts start/stop/restart <svc> to drive the running hub's module-ops API (reading operator.token) instead of defaultSpawner/pidfiles (Phase 3).ensureHubRunning detached-bringup sites — init.ts:374, expose.ts:323, expose-cloudflare.ts:659, lifecycle start-hub — with "ensure the unit is installed + started" (Phase 3/4).The single biggest correctness payoff — UI module-management working everywhere — falls out automatically: once everything runs under serve, the supervisor_unavailable 503 in hub-server.ts:1797-1808/1844-1855 is unreachable.