Skip to content

SLA-Backed Escalation Playbook

Applies to: SmrtHub infrastructure/security incidents spanning HubWindow, Supervisor, Storage Guard, Support Bundle pipelines, and retention tooling. This document complements the Operational Data, Privacy/Security, and Logging READMEs under README.Files/ and inherits all canonical path + telemetry rules from those sources.

1. Goals & Scope

  • Provide a deterministic response sequence for any infra or security disruption before UI/portal surfaces exist.
  • Define severity tiers, acknowledgements, mitigation budgets, and evidence expectations that align with contractual SLAs.
  • Enumerate the telemetry that must be captured for every escalation so future auditors can correlate Storage Guard, Support Bundle, and retention artifacts without extra digging.
  • Supply a text-first decision tree so operators can act immediately via CLI/automation while dashboards are still under construction.

2. Severity Matrix & SLA Timelines

Severity Definition Initial Acknowledgement Containment/Mitigation Evidence Package Stakeholder Comms
Sev 0 (Critical) Data access loss, signed evidence corruption, or confirmed compromise of retention/legal artifacts. 5 minutes via Supervisor CLI or PagerDuty equivalent. 30 minutes to isolate affected components (Supervisor shutdown, Storage Guard snapshot freeze, retention lock). 60 minutes: Support Bundle (smrthub-support), retention export, Storage Guard signed snapshot + verification JSON. 45 minutes: Security + Infra leads, customer success distribution.
Sev 1 (High) Imminent quota exhaustion, ACL drift breaking automated exports, or Storage Guard signature mismatch without data loss. 15 minutes. 2 hours to remediate (e.g., expand volume, fix ACLs, regenerate signatures). 4 hours: Support Bundle + Storage Guard telemetry + supervisor metrics snapshot. 2 hours: Infra + Compliance owners.
Sev 2 (Medium) Non-blocking telemetry gaps, delayed snapshots, or failed scheduled compliance reports. 1 hour. 1 business day or before next scheduled capture, whichever is sooner. Attach most recent Storage Guard snapshot + new CLI run results. Daily digest to Ops backlog.

Reference: SLA definitions extend the Operational Data Policy (§4) and Privacy/Security Policy (§6). When conflicts emerge, this playbook wins for timelines; underlying policies still govern artifact locations and redaction rules.

3. Mandatory Telemetry Checklist

For every escalation (regardless of severity) capture the following before closing the incident: 1. Storage Guard Snapshot + Signature (%ProgramData%/SmrtHub/Logs/system-info/storage-guard.{json,sig}) fetched through the signed HTTP client. Verify with StorageGuardSignatureVerifier and attach the verification JSON. 2. Retention Export via SmrtHubSupervisor.exe --retention-export or HubWindow tray (when available). Store manifest + operator identity alongside the incident ticket. 3. Support Bundle generated with the compliance preset (DiagnosticsActions.GenerateSupportBundleAsync default categories) to capture logs, retention data, and Storage Guard verification. 4. Unified HTML Logs from SmrtHub.Logging.ExportUnifiedHtmlLogs scoped to the affected components (Supervisor, Storage Guard host, HubWindow). 5. Quota Forecast Snapshot (new telemetry described in §4) to illustrate the ACL/quota state at detection + post-mitigation. Capture the supervisor health payload or compliance summary fields storageGuard.aclDriftCount and storageGuard.retentionConfigHash alongside the quota risk so auditors see both the drift volume and the exact retention payload hash. 6. Incident Timeline JSON (appendix) recording: detection timestamp, acknowledgement timestamp, mitigation start/stop, and validation notes.

Embed these outputs in the same evidence folder referenced by the Compliance Report README so auditors can cross-link between exports.

4. Decision Trees & Telemetry Hooks

4.1 Infra vs Security Branching

  1. Is Storage Guard signature invalid or missing?
  2. Yes: Treat as Security (Sev0/1). Immediately pause Supervisor restarts, lock retention directories (set ACL read-only via StorageGuardAclDetector guidance), and regenerate the signature via the Storage Guard service host. Continue at step 3.
  3. No: Continue to step 2.
  4. Is quota forecast risk >= Warning?
  5. Yes: Follow Infra branch—trigger quota remediation (expand SmrtSpace volume or free space) and configure Supervisor telemetry emitters (see §4.2) to keep Prometheus/Grafana aware until cleared.
  6. No: Continue to step 3.
  7. Are ACL insights degraded (MissingPrincipal, InheritanceBroken, AccessDenied)?
  8. Yes: Security branch with retention admin involvement; capture StorageGuardAclInsight list and run Supervisor CLI --retention-status for a before/after diff.
  9. No: Continue to step 4.
  10. Does compliance reporting fail (CLI or tray) for more than two consecutive runs?
  11. Yes: Treat as Infra (Sev2) unless combined with security signals above.
  12. No: Document as telemetry gap and backlog the fix.

4.2 Telemetry Hook Actions (pre-UI)

  • Supervisor Metrics Collector Extension: add Storage Guard aggregate fields (latest quota risk, ACL drift count, retention config hash) so the health endpoint exposes them for Grafana even before HubWindow renders dashboards.
  • Log Routing: ensure every escalation automatically tags logs with incidentId and severity (same value recorded in the incident timeline JSON). This is implemented via the logging enrichers already documented in README.Files/SmrtHub.Logging.README.md.
  • Alert Hooks: until UI toggles exist, wire the new detector outputs (quota warning, ACL drift) into the same structured log stream consumed by the monitoring pipeline. Appendix B in this file lists the log templates.

5. Communication Runbook

  1. Acknowledgement template (<=5 min)
  2. Subject: [SmrtHub][SevX] <short description>
  3. Include detection source (Supervisor health, Storage Guard HTTP, manual) and whether customer impact is confirmed or suspected.
  4. Mitigation updates every interval defined in §2. Summaries must reference the evidence folder path plus current quota risk + ACL status values.
  5. Close-out once verification passes: attach the compliance report summary + retention export manifest, and log the final Storage Guard quota risk.

6. Hand-Off Targets

  • Infra On-Call: Owns volume expansion, Supervisor restarts, alert tuning.
  • Security On-Call: Owns ACL remediation, signature validation, retention integrity.
  • Compliance Lead: Owns retention exports + reporting back to auditors; ensures Support Bundle + compliance summaries are archived per README.Files/Smrt.SupportBundle.README.md.

7. Appendices

  • Appendix A: Incident timeline JSON schema (mirrors Support Bundle timeline.json).
  • Appendix B: Log templates
  • "event":"storage_guard_quota","risk":"Warning","incidentId":"<guid>"
  • "event":"acl_drift_detected","target":"storage-guard-secret","status":"MissingPrincipal"
  • "event":"retention_export_complete","manifest":".../manifest.json","operator":"DOMAIN\\user"

Keep this file in lockstep with: - README.Files/SmrtHub-Operational-Data-Policy-v1.0.README.md - README.Files/SmrtHub-Privacy-and-Security-Policy.README.md - README.Files/SmrtHub.Logging.README.md

Any change to telemetry locations or retention flows must update those documents and this playbook simultaneously.