Skip to content

Component: SmrtHubSupervisor

Canonical source: SmrtApps/CSApps/SmrtHubSupervisor/README.md (mirrored below)


SmrtHub Supervisor

Enterprise-grade process supervisor for SmrtHub desktop components. Manages startup ordering, restarts with backoff, quarantine, storm guard, health reporting, structured logging, and secure manifest-driven configuration.

  • Supervision Subsystem (contracts and invariants): README.Files/Subsystems/Supervision/README.md
  • Supervision Capability (end-to-end flow): README.Files/Capabilities/Supervision/README.md
  • Retention Subsystem (contracts and invariants): README.Files/Subsystems/Retention/README.md
  • Retention Capability (end-to-end flow): README.Files/Capabilities/Retention/README.md
  • Retention & Legal Hold Plan (Phase 3): README.Files/System/Plans/SmrtHub-Retention-and-LegalHold.README.md

Dependencies / integrations

  • Integrates with staged executables under Apps/<Configuration>/<RuntimeIdentifier>/ (launch + supervision).
  • Uses Windows process containment primitives (for example Job Objects) to manage process trees.
  • Not documented yet: a complete list of all external integrations (IPC transports, health endpoint security, and evidence exporters).

Support bundle

  • Supervisor-initiated diagnostics and compliance exports should be packaged via Smrt.SupportBundle.
  • Not documented yet: the exact bundle presets and which Supervisor state artifacts must always be included.

Highlights

  • Manifest-driven orchestration with per-component policies and dependencies
  • Windows Job Objects containment for clean teardown of child processes
  • Exponential backoff with jitter; storm guard and quarantine (restart storm protection)
  • Health endpoint (Phase 1: unauth localhost; Phase 2: auth + TLS)
  • Signed Storage Guard telemetry ingestion with mutual TLS + shared-secret auth (fetches from the dedicated Storage Guard service host)
  • Background Storage Guard telemetry worker surfaces ACL/quota forecasting data to logs + health endpoint consumers (emits storage_guard_quota + acl_drift_detected events when detectors fire)
  • Supervisor-hosted automation hooks capture retention evidence automatically whenever quota or ACL drift detectors trip, so auditors have fresh exports without manual steps
  • One-click compliance evidence export that hashes/signs the resulting bundle for auditors
  • Metrics collector publishes live Supervisor + component restart metrics for dashboards or health probes
  • Structured logging (console dev, rolling file, EventLog in production when permitted)
  • System event monitor hosted service logs Windows power/session transitions so operators can correlate system state with component restarts
  • Tokenized paths with automatic discovery / fallback search
  • Central “Apps” staging layout for deterministic executable paths
  • Operator control channel (named pipe) with JSON commands (pause/resume restarts, status, restart component, shutdown)
  • Hardened manifest validation (aggregated errors + dependency cycle detection)
  • Planned security: Authenticode + hash catalog validation (Phase 2) & SBOM generation (CI)

Project layout

SmrtApps/CSApps/SmrtHubSupervisor/
  ├── CLI/                     # Non-run commands (status, dry-run, runbook)
  ├── Config/                  # Typed configuration (SupervisorConfig, manifest models)
  ├── Core/                    # Process supervision engine, JobObject, probes
  ├── Diagnostics/             # Diagnostics bundle stubs + system event monitor
  ├── Health/                  # Health monitor and endpoint
  ├── Security/                # Signature/hash validation stubs
  ├── ComponentManifest.json   # Component list and policies
  ├── ComponentManifest.schema.json
  ├── appsettings.json         # Defaults
  ├── appsettings.Development.json
  ├── appsettings.Production.json
  └── SmrtHubSupervisor.csproj

Build and stage apps

The repo uses a central staging layout so the Supervisor can launch apps from a single, predictable location.

  • Staging layout: Apps/<Configuration>/<RuntimeIdentifier>/<AppName>/
  • Example: Apps/Debug/win-x64/TriggerManager/TriggerManager.exe

Build + stage CSApps only:

Tools/Clean-Build/BuildApps.ps1 -Scope CSApps -Configuration Debug -RuntimeIdentifier win-x64

Build + stage platform services (src) too:

Tools/Clean-Build/BuildApps.ps1 -Scope SRC -Configuration Debug -RuntimeIdentifier win-x64

Full solution build + stage:

Tools/Clean-Build/BuildApps.ps1 -Scope "Full Solution" -Configuration Debug -RuntimeIdentifier win-x64

Clean staging (optional):

# Remove staged apps safely (skips running apps)
Tools/Clean-Build/CleanApps.ps1 -CleanAppsStaging -StagingConfiguration Debug -StagingRuntimeIdentifier win-x64

Visual Studio Build vs Central Staging

  • Visual Studio Build/Rebuild compiles to each project’s local bin folder only; it does NOT copy to the central Apps/<Configuration>/<RID> staging.
  • The Supervisor resolves ${AppsRoot} to the central staging tree at runtime. If you run Supervisor from its bin folder without staging, it will start components from whatever was last staged (potentially stale).
  • To ensure all components are fresh, run the staging task/script above after a Visual Studio build.

Run

From the Supervisor output folder (or from staging):

# Validate everything without launching processes
SmrtHubSupervisor.exe --dry-run

# Print component manifest + config overview
SmrtHubSupervisor.exe --status

# Normal run (starts all components)
SmrtHubSupervisor.exe

Notes: - DOTNET_ENVIRONMENT controls config overlays (e.g., Development vs Production). - SMRTHUB_ prefix environment variables can override settings (see appsettings.json).

CLI helpers

All CLI verbs run from the Supervisor output folder (or staged Apps/.../SmrtHubSupervisor/). Commands prefixed with -- exit after completion without starting supervision:

Command Status
--status Loads config + manifest, prints component summary
--dry-run Validates manifest, resolves executable paths
--print-runbook Emits troubleshooting guide
--restart <id> (Stub) requires live supervisor IPC; prints placeholder today
--stop-all (Stub) placeholder until control-channel binding lands
--dump-diagnostics (Stub) reserved for future diagnostics bundle
--retention-status Shows currently persisted retention policies, active holds, and last update metadata
--retention-apply --configuration <path> [--updated-by <id>] Validates and atomically writes a full retention configuration JSON file
--legal-hold-add --name <str> --scope <str> [options] Creates a legal hold entry (reason/tickets/expiry optional) and persists it via Smrt.Retention
--legal-hold-clear --id <guid> Clears a previously created legal hold (blocked for system-generated holds)
--retention-export [--output <dir>] Copies the current policies, holds, and verification artifacts into a timestamped export folder
--compliance-report [--output <dir>] [--relative-window-hours <int>] Generates a signed compliance bundle + .sha256 hash + JSON summary
--retention-verify [--evidence <file>] [--signature <file>] [--trust-root <file> ...] [--json] Validates the retention verification payload + signature against trusted secrets (exit code 0 = valid)
--storage-guard-verify [--snapshot <file>] [--signature <file>] [--trust-root <file> ...] [--json] Validates the latest Storage Guard snapshot/signature pair against configured trust roots (exit code 0 = valid)

Retention verbs honor Supervisor:Retention settings (paths + admin gating). Production defaults require elevation; Development defaults disable the admin check so automated tests can run.

Compliance reports default to %ProgramData%/SmrtHub/Compliance/Reports (or %LocalAppData% in Development) and reuse the Support Bundle compliance preset so auditors get immutable evidence without touching ProgramData manually.

Use SmrtHubSupervisor.exe with no arguments for the long-running host after staging the apps.

Component manifest

File: ComponentManifest.json (validated by ComponentManifest.schema.json).

Key fields per component: - id, displayName, type (exe) - path: supports tokens and resolves to an absolute executable path - args: optional argument list - env: optional environment variables - dependencies: optional array of component ids - restartPolicy: Never | Always | OnFailure - maxRestarts: number of restart attempts before quarantine - backoffSeconds: sequence used with jitter for restarts - readinessProbe: type = none | http | tcp; per-type settings

Path tokens

Supported tokens are expanded at runtime and by CLI: - ${RepoRoot} -> repository root (auto-detected) - ${CSAppsRoot} -> ${RepoRoot}/SmrtApps/CSApps - ${SrcRoot} -> ${RepoRoot}/SmrtApps/src - ${Configuration} -> inferred from Supervisor base dir (e.g., Debug) - ${RuntimeIdentifier} -> inferred RID (e.g., win-x64) - ${TargetFramework} -> inferred TFM (e.g., net8.0-windows) - ${AppsRoot} -> ${RepoRoot}/Apps/${Configuration}/${RuntimeIdentifier}

Recommended pattern (uses staging):

{
  "id": "TriggerManager",
  "type": "exe",
  "path": "${AppsRoot}/TriggerManager/TriggerManager.exe",
  "restartPolicy": "OnFailure",
  "maxRestarts": 5,
  "backoffSeconds": [2,4,8,16,32],
  "dependencies": ["MouseHookPipe"],
  "readinessProbe": { "type": "none" }
}

Resolution logic: 1) Expand tokens ➜ if absolute path exists, use it 2) If relative ➜ resolve relative to Supervisor base directory 3) If path contains RID folder ➜ also try without RID 4) Fallback search ➜ scan nearest bin/** for <Name>.exe (most recent)

Validation hardening

On startup the manifest is validated with comprehensive checks and aggregated error reporting:

  • Version format (^\d+\.\d+$)
  • Component IDs: regex, uniqueness
  • Display names and paths: required, path sanity (allows ${Tokens})
  • Args/env: no nulls, no empty env keys
  • Restart policy vs. maxRestarts/backoff coherence
  • Dependencies: existence, no self/duplicates, cycle detection (DFS)
  • Readiness probes: type-specific fields (http/tcp/none) and min constraints

If any issues are found, Supervisor fails fast with a single message listing all problems.

Configuration (appsettings)

Section: Supervisor

  • HealthEndpoint
  • Enabled: true|false
  • Url: e.g., http://localhost:5050
  • RequireAuth: true|false
  • AllowedGroups: ["SmrtHub Supervisors","Administrators"]

  • Security

  • ValidateSignatures: enable Authenticode validation (planned)
  • ValidateHashCatalog: verify hash catalog (planned)
  • TrustedPublisher: cert subject for publisher validation

  • Supervision

  • GlobalRestartLimit: max restarts in window (storm guard)
  • GlobalRestartWindowSeconds: window seconds for the limit
  • StormGuardCooldownSeconds: cooldown after trip
  • ShutdownTimeoutSeconds: graceful shutdown budget
  • StableUptimeSeconds: reset backoff after a stable run
  • StorageGuardClient
  • Enabled: turn on the HTTPS client that talks to the new Storage Guard service host (defaults to false until certificates/secrets are provisioned)
  • BaseUrl: HTTPS listener for the service host (e.g., https://localhost:5065)
  • SnapshotEndpoint / SignatureEndpoint: relative APIs exposed by the service host (default /v1/storage-guard/*)
  • SharedSecretHeader: header name carrying the shared secret loaded from StorageGuardSecretStore
  • TimeoutSeconds: HTTP timeout budget per request
  • ClientCertificate: certificate store location/name/thumbprint presented for mutual TLS
  • AllowInvalidServerCertificate: dev-only bypass for self-signed server certs
  • AllowDevelopmentCertificatelessAuth: when true (and DOTNET_ENVIRONMENT=Development) Supervisor skips mutual TLS certs and relies solely on the shared secret—pair this with the service host’s AllowDevelopmentClientBypass for local testing.
  • StorageGuardTelemetry
  • Enabled: toggles the background polling worker (still requires StorageGuardClient.Enabled)
  • RefreshIntervalSeconds: cadence for refreshing telemetry; a minimum 30-second floor is enforced to prevent floods
  • EmitStructuredLogs: when true, Storage Guard detectors publish structured events (storage_guard_quota, acl_drift_detected, plus the baseline storage_guard_quota_assessment) so SIEM/web dashboards can alert on them before UI wiring lands
  • StorageGuardAutomation
  • Enabled: master kill-switch for Supervisor-triggered automation while keeping telemetry online
  • ExportOnQuotaWarning / ExportOnQuotaCritical: automatically run retention-export when quota risk escalates into those detector states
  • ExportOnAclDrift: same automation path when Storage Guard reports ACL drift counts at or above MinimumAclDriftCount
  • CooldownMinutes: shared throttle so repeated detector reports do not spam retention exports; applies independently to quota and ACL triggers
  • OperatorIdentity / ExportDestinationOverride: customize the manifest identity stamped on exports and the folder where automation should drop bundles (defaults track the CLI root)
  • StorageGuardTrust
  • TrustRootPaths: ordered list of storage-guard-secret.json files (or directories containing them) that the --storage-guard-verify and --retention-verify CLI commands should trust. Defaults point at %ProgramData%/SmrtHub/Config/storage-guard/ with Development overlays also checking %LocalAppData% for per-developer secrets. When omitted, the helper falls back to StorageGuardSecretStore.GetSecretPath().
  • SystemSpecs
  • Enabled: when true, Supervisor writes a best-effort snapshot for other components to consume.
  • TimeoutSeconds: total startup budget for capture + write.
  • EnableWmi / WmiTimeoutMs: toggles WMI enrichment (CPU name, total memory).
  • WinRtTypeProbes: list of WinRT type names to probe via late-bound reflection (advanced; may be empty in production).
  • AI OCR note: Windows AI Text Recognition uses Microsoft.Windows.AI.Imaging.TextRecognizer.
    • windowsAiTextRecognizerRuntimeAvailable: whether the Supervisor process can locate and call TextRecognizer.GetReadyState().
    • windowsAiTextRecognizerReadyState: string form of the readiness state (for example Ready, NotReady). The snapshot does not call EnsureReadyAsync() at startup.
  • Evaluations: UI-friendly capability results (status + reasons). Current ID: ocr.windowsAi.
  • Output: %ProgramData%/SmrtHub/Logs/system-info/system-specs.json (falls back to %LocalAppData% if ProgramData is unavailable).
  • Retention
  • PolicyPath / LegalHoldPath: absolute paths to the JSON artifacts managed by Smrt.Retention. Defaults land under %ProgramData%/SmrtHub/Config/retention/ with a Development override to %LocalAppData% for individual dev profiles.
  • Cli.RequireAdmin: when true (default) retention verbs enforce Administrator privileges before mutating files. Development overlay disables this guard so automated tests and non-admin dev sessions can experiment safely.
  • Cli.ExportRoot: base directory used by --retention-export when --output is omitted. Defaults to %ProgramData%/SmrtHub/Logs/system-info/retention/exports (falling back to %LocalAppData% in dev).
  • ComplianceReport
  • OutputRoot: target directory for the compliance bundle + .sha256 hash + JSON summary when --output is omitted. Defaults to %ProgramData%/SmrtHub/Compliance/Reports (Development overrides to %LocalAppData%).
  • FileNamePrefix: bundle prefix handed to ComplianceReportGenerator (default smrthub-compliance).
  • RelativeWindowHours: default log window applied when --relative-window-hours isn’t specified.

When StorageGuardClient.Enabled is true the Supervisor fails fast if the shared secret or client certificate is missing, ensuring we never run without cryptographic coverage. The Storage Guard service host remains responsible for persisting storage-guard.json and storage-guard.sig; the Supervisor only consumes the signed snapshot for telemetry/logging.

Health endpoint

  • Enabled by config; default URL comes from appsettings.
  • Returns Supervisor + component status JSON produced by MetricsCollector (uptime, restarts, quarantine flags).
  • Includes Storage Guard telemetry summaries (quota risk, ACL drift count, retention config hash, signed snapshot timestamps) whenever the background polling worker has data.
  • Plan: enable auth/TLS in production; Windows group-based access.

Example payload (Phase 1):

{
  "status": "Healthy",
  "uptime": "0h 12m 4s",
  "sessionId": "2b4d...",
  "timestamp": "2025-11-08T03:12:04.782Z",
  "components": [
    { "id": "MouseHookPipe", "state": "Ready", "restarts": 1, "uptime": "0h 11m 58s", "quarantined": false }
  ],
  "storageGuard": {
    "dataAvailable": true,
    "quotaRisk": "Warning",
    "availableBytes": 5368709120,
    "freePercent": 0.08,
    "aclDriftCount": 0,
    "retentionConfigHash": "8F0B4B1CF8E64D0E4BE5B226F0F1E6F4C5C2A6B4F7C0E9D58E3B4301C8459D9A",
    "issues": [],
    "aclInsights": []
  }
}

Logging

  • Console during development plus Smrt.Logging rolling files under %AppData%/SmrtHub/Logs/smrt-hub-supervisor/ (JSON + text, created automatically on startup)
  • Windows Event Log sink is enabled only in Production AND when the process is elevated; guarded to avoid crashes when not permitted.

System event monitor

  • Diagnostics/SystemEventMonitor.cs runs as an IHostedService when Supervisor starts and subscribes to Microsoft.Win32.SystemEvents.
  • Each power-mode or session-switch event is logged with the same Smrt logging stack (text + JSON + HTML export) so operator timelines show both Supervisor restarts and OS state transitions.
  • The service auto-unsubscribes on shutdown and disables itself on non-Windows hosts, keeping behavior predictable across environments.

Troubleshooting

  • Dry-run says NOT FOUND
  • Ensure you’ve built + staged apps (see Build and stage apps)
  • Confirm tokens resolve as expected; ${AppsRoot} is recommended
  • Starts then exits quickly, no logs
  • Check logs directory exists; file sink should create it automatically
  • EventLog sink is gated; non-elevated Production runs won’t use it
  • Components restart repeatedly
  • Inspect component-specific logs under staging folder
  • Increase backoffSeconds or fix dependency/readiness issues
  • Quarantine occurs after maxRestarts. Clear manually by restarting Supervisor.

Development tips

  • One-off: --print-runbook for built-in operational guidance
  • Keep manifests close to reality; prefer ${AppsRoot} to avoid TFM/RID drift
  • Use CleanApps.ps1 -CleanAppsStaging to reset the staging area (skips running apps)
  • To exercise Storage Guard end-to-end locally, set DOTNET_ENVIRONMENT=Development so appsettings.Development.json enables the client, allows self-signed server certs, and (optionally) bypasses mutual TLS while you bootstrap secrets.

Roadmap (security / ops)

Phase 1 (Implemented): - Console supervisor, manifest validation, restart policies (backoff + quarantine), storm guard - Health endpoint (localhost, no auth), named pipe operator channel - Logging alignment via SmrtHub.Logging

Phase 2 (Hardening / Production): - Windows Service host - Health endpoint auth (Windows group / bearer) + TLS - Authenticode signature + hash catalog verification prior to launch - Diagnostics bundle export on failure or manual trigger - SBOM generation in CI

Phase 3 (Operational polish / future): - CLI auth integration & richer RBAC - OpenTelemetry exporters (metrics/traces) - GUI dashboard - Alert routing (email / chat) - Cross-platform service abstraction (Linux systemd)

Operator control channel (named pipe)

  • Local-only, per-user named pipe secured by ACLs (current user + Administrators). Pipe name pattern: SmrtHubSupervisor_<UserSid> so each Windows logon session gets an isolated channel.
  • Line-based JSON request/response protocol; one command per connection.
  • Supported commands:
  • SHUTDOWN{ ok: true, message: "Shutting down" }
  • STATUS{ ok: true, data: { restartsPaused, stormGuardActive, components:[...] } }
  • RESTART <id>{ ok: true|false }
  • PAUSE_RESTARTS / RESUME_RESTARTS{ ok: true }
  • CLEAR_QUARANTINE <id> (planned) → clears quarantine without restart

HubWindow tray “Exit SmrtHub” triggers SHUTDOWN for graceful teardown.


If you run into anything odd, run --dry-run first and check the Supervisor log in logs/.


Security & Validation (Phase 2 preview) - Signature validation rejects unsigned/tampered binaries (Authenticode subject match) - Hash catalog verification blocks altered executables - Audit log records operator actions (restart, shutdown, clear quarantine)

Diagnostics Bundle (planned) - Manifest, sanitized config, last N log lines, metrics snapshot, environment summary

SBOM (planned) - Generated CycloneDX or SPDX stored with build artifacts for supply-chain audits


Document coverage: This README now incorporates the implementation plan; the separate plan document has been retired.

Generated output

  • bin/ and obj/ are build artifacts produced by the .NET SDK and remain excluded from recursive README coverage.