Skip to content

Supervision Capability (Flow)

This document describes the end-to-end technical flow for supervising the SmrtHub desktop stack: manifest evaluation, startup ordering, health, operator control, self-healing, and shutdown.

Subsystem contract reference: README.Files/Subsystems/Supervision/README.md

Purpose and scope

  • Explain how Supervisor turns a declarative manifest into running processes.
  • Document the key operator/control flows and expected system behavior.
  • Provide troubleshooting entry points for common failure cases.

Entry points

  • User session start (or explicit launch) starts the Supervisor.
  • Operator actions via local control surfaces (status, restart, shutdown, pause/resume restarts).
  • Automation/health tooling runs CLI commands (dry-run, status) to validate deployment.

Participating Components/Subsystems

  • Supervision Subsystem: README.Files/Subsystems/Supervision/README.md
  • Supervisor component: SmrtApps/CSApps/SmrtHubSupervisor/README.md
  • Runtime apps: HubWindow, ClipboardMonitor, TriggerManager, MouseHookPipe, PythonApp, and other staged executables.
  • Logging Subsystem (observability): README.Files/Subsystems/Logging/README.md
  • Configuration Subsystem (where control/config artifacts live): README.Files/Subsystems/Configuration/README.md

Sequence / flow

  1. Load configuration and manifest
  2. Supervisor loads its configuration overlays and reads the component manifest.
  3. Manifest validation fails fast with aggregated errors.

  4. Compute startup plan

  5. Supervisor topologically sorts the dependency graph.
  6. Readiness probes gate dependent startups.

  7. Launch processes under containment

  8. Each component process is launched and tracked.
  9. Process trees are contained so shutdown can be reliable.

  10. Readiness and steady state

  11. Supervisor waits for readiness probes (when configured).
  12. Components transition to “running/ready” state.

  13. Failure detection and self-healing

  14. On process exit/failure, Supervisor evaluates restart policy.
  15. Applies exponential backoff with jitter.
  16. Quarantines components after thresholds are exceeded.
  17. Storm guard can pause global restarts when the system is unstable.

  18. Operator control

  19. Operator uses the control channel to request STATUS, RESTART, SHUTDOWN, PAUSE/RESUME.
  20. Responses are structured and logged.

  21. Shutdown and teardown

  22. Supervisor suppresses restarts and coordinates graceful shutdown.
  23. Enforces timeouts and performs forced teardown only when required.

Contracts and data shapes

  • Not documented yet: canonical wire contracts for the Supervisor control channel and status responses.
  • See the Supervision Subsystem contract doc for invariants and required behaviors: README.Files/Subsystems/Supervision/README.md.

Configuration, operational data, and paths

  • Supervisor configuration overlays and operational artifacts must follow the Operational Data Policy.
  • Not documented yet: canonical file names/paths for supervision-specific state beyond what is described in the Supervisor component README.

Failure modes and expected behaviors

  • Manifest invalid: fail fast, do not start any components; produce an aggregated operator-facing error.
  • Dependency not ready: dependent components do not start; failure is observable in logs/STATUS.
  • Crash loop: backoff + quarantine prevent endless churn; storm guard prevents system-wide restart storms.
  • Shutdown timeout: forced teardown is applied to the process tree; log the decision and impacted components.

Observability & diagnostics

  • Primary evidence is per-component unified logs and Supervisor logs.
  • STATUS output should allow quick triage: which component is quarantined, restart counters, last exit codes.
  • Support bundles should include Supervisor state plus component logs for the relevant time window.

Testing & validation expectations

  • Validate --dry-run against staged apps.
  • Validate startup ordering and readiness gating.
  • Validate restart/quarantine logic with a controlled crash harness.
  • Validate control channel authorization/ACL behavior.
  • Validate clean shutdown of process trees.