Supervision Capability (Flow)¶
This document describes the end-to-end technical flow for supervising the SmrtHub desktop stack: manifest evaluation, startup ordering, health, operator control, self-healing, and shutdown.
Subsystem contract reference: README.Files/Subsystems/Supervision/README.md
Purpose and scope¶
- Explain how Supervisor turns a declarative manifest into running processes.
- Document the key operator/control flows and expected system behavior.
- Provide troubleshooting entry points for common failure cases.
Entry points¶
- User session start (or explicit launch) starts the Supervisor.
- Operator actions via local control surfaces (status, restart, shutdown, pause/resume restarts).
- Automation/health tooling runs CLI commands (dry-run, status) to validate deployment.
Participating Components/Subsystems¶
- Supervision Subsystem:
README.Files/Subsystems/Supervision/README.md - Supervisor component:
SmrtApps/CSApps/SmrtHubSupervisor/README.md - Runtime apps: HubWindow, ClipboardMonitor, TriggerManager, MouseHookPipe, PythonApp, and other staged executables.
- Logging Subsystem (observability):
README.Files/Subsystems/Logging/README.md - Configuration Subsystem (where control/config artifacts live):
README.Files/Subsystems/Configuration/README.md
Sequence / flow¶
- Load configuration and manifest
- Supervisor loads its configuration overlays and reads the component manifest.
-
Manifest validation fails fast with aggregated errors.
-
Compute startup plan
- Supervisor topologically sorts the dependency graph.
-
Readiness probes gate dependent startups.
-
Launch processes under containment
- Each component process is launched and tracked.
-
Process trees are contained so shutdown can be reliable.
-
Readiness and steady state
- Supervisor waits for readiness probes (when configured).
-
Components transition to “running/ready” state.
-
Failure detection and self-healing
- On process exit/failure, Supervisor evaluates restart policy.
- Applies exponential backoff with jitter.
- Quarantines components after thresholds are exceeded.
-
Storm guard can pause global restarts when the system is unstable.
-
Operator control
- Operator uses the control channel to request STATUS, RESTART, SHUTDOWN, PAUSE/RESUME.
-
Responses are structured and logged.
-
Shutdown and teardown
- Supervisor suppresses restarts and coordinates graceful shutdown.
- Enforces timeouts and performs forced teardown only when required.
Contracts and data shapes¶
- Not documented yet: canonical wire contracts for the Supervisor control channel and status responses.
- See the Supervision Subsystem contract doc for invariants and required behaviors:
README.Files/Subsystems/Supervision/README.md.
Configuration, operational data, and paths¶
- Supervisor configuration overlays and operational artifacts must follow the Operational Data Policy.
- Not documented yet: canonical file names/paths for supervision-specific state beyond what is described in the Supervisor component README.
Failure modes and expected behaviors¶
- Manifest invalid: fail fast, do not start any components; produce an aggregated operator-facing error.
- Dependency not ready: dependent components do not start; failure is observable in logs/STATUS.
- Crash loop: backoff + quarantine prevent endless churn; storm guard prevents system-wide restart storms.
- Shutdown timeout: forced teardown is applied to the process tree; log the decision and impacted components.
Observability & diagnostics¶
- Primary evidence is per-component unified logs and Supervisor logs.
- STATUS output should allow quick triage: which component is quarantined, restart counters, last exit codes.
- Support bundles should include Supervisor state plus component logs for the relevant time window.
Testing & validation expectations¶
- Validate
--dry-runagainst staged apps. - Validate startup ordering and readiness gating.
- Validate restart/quarantine logic with a controlled crash harness.
- Validate control channel authorization/ACL behavior.
- Validate clean shutdown of process trees.