Supervision Subsystem¶
The Supervision Subsystem is a platform-level, cross-cutting subsystem that standardizes process orchestration for the SmrtHub desktop stack: startup ordering, health, restart/quarantine behavior, and operator control.
Primary implementation component: SmrtApps/CSApps/SmrtHubSupervisor/README.md
Overview & responsibilities¶
- Start SmrtHub components deterministically based on a declarative manifest.
- Enforce containment and clean teardown of process trees.
- Provide self-healing with backoff, storm guard, and quarantine.
- Expose operator control surfaces (local control channel) and health/status surfaces.
Contracts and invariants¶
All supervised components must follow these contracts:
- Manifest-driven identity
- Components are addressed by stable IDs declared in the Supervisor manifest.
-
IDs must be unique and stable across releases.
-
Deterministic, policy-aligned paths
- Executable paths are tokenized and resolved into the canonical staging layout (
Apps/<Configuration>/<RID>/...). -
Components must not rely on ad-hoc relative working directories.
-
Graceful shutdown
- Components must honor a shutdown request and exit within the configured timeout.
-
Long-running operations must be interruptible.
-
Readiness semantics (when enabled)
- If a component declares a readiness probe, it must expose the probe target reliably and quickly.
-
Dependents must not start until prerequisites are ready.
-
Observability is mandatory
- Components must emit structured logs via the unified logging subsystem.
-
Start/stop, readiness, and fatal failures must be visible in logs.
-
No restart storms
- Components should fail fast when misconfigured and provide clear diagnostics.
- Supervisor may quarantine components after repeated failures; components must tolerate being unavailable.
Integration points¶
- HubWindow / UI control surfaces: issue operator commands (status, restart, shutdown) through the approved control channel.
- Support and diagnostics: support bundle and evidence capture workflows rely on Supervisor state (quarantine, restart counters).
- Health monitoring: a health/status surface exposes Supervisor and component states for tools and operators.
Configuration, paths, and operational data¶
- Manifest and staging layout are owned by the Supervisor component and the repo build/staging scripts.
- Logging and config/state artifacts must still follow Operational Data Policy.
Observability & diagnostics¶
- Supervisor emits unified logs (JSON + text) and should include:
- manifest identity/hash
- component IDs and transitions (starting, ready, exited, restarting, quarantined)
- restart counters and backoff windows
- operator commands (who/what, without secrets)
Security/privacy notes¶
- Control surfaces are local-only and must be ACL’d to the correct user context.
- Do not log secrets from environment variables or config payloads.
Testing & validation expectations¶
- Validate manifest parsing and validation (unique IDs, no dependency cycles, probe constraints).
- Validate dependency ordering and readiness behavior.
- Validate restart policies and quarantine thresholds.
- Validate shutdown semantics and Job Object teardown.