Skip to content

Component: Smrt.ExtractStructuredText

Canonical source: SmrtApps/src/Smrt.ExtractStructuredText/README.md (mirrored below)


Smrt.ExtractStructuredText

Structured document extraction planning + orchestration (tables/forms/key-value + document structure).

Overview and responsibilities

  • Plans structured extraction execution (provider selection and ordered candidates).
  • Defines the vendor-agnostic contracts used by host executors.

Public surface / entry points

  • Planning/orchestration APIs and execution contracts (see source for the public types).

Dependencies and integrations

  • Requests CapabilityId.DocumentStructuredExtraction via Smrt.CloudProviders.
  • Delegates execution to host-supplied executors (mux implementations live in Smrt.ExtractStructuredText.Host).

Configuration and operational data

  • No persistent config/state is owned by this library.
  • Hosts may enforce a local PDF page cap via SMRTHUB_LOCAL_OCR_MAX_PDF_PAGES_PER_DOCUMENT when implementing “render pages → OCR”.

Observability and diagnostics

  • Log metadata only (planned candidates + used provider).
  • Never log document bytes or extracted content.

Testing and validation

  • Build (Debug, win-x64):
    • dotnet build SmrtApps/src/Smrt.ExtractStructuredText/Smrt.ExtractStructuredText.csproj -c Debug -r win-x64
    • (End-to-end wiring) dotnet build SmrtApps/src/Smrt.ExtractStructuredText.Host/Smrt.ExtractStructuredText.Host.csproj -c Debug -r win-x64
  • Unit tests:
    • dotnet test SmrtApps/src/Smrt.ExtractStructuredText.Tests/Smrt.ExtractStructuredText.Tests.csproj -c Debug -r win-x64 --no-build

Support Bundle

  • Not applicable directly (library); collect host application logs via Support Bundle.

Design

  • Vendor-agnostic core library.
  • Requests CapabilityId.DocumentStructuredExtraction from Smrt.CloudProviders and produces an execution contract.
  • Delegates execution to a host-supplied executor interface (IDocumentStructuredExtractionExecutor).
  • Host mux is provided by Smrt.ExtractStructuredText.Host (StructuredExtractionExecutorMux).
  • Async provider support is available via IDocumentStructuredExtractionAsyncExecutor and Smrt.ExtractStructuredText.Host.StructuredExtractionAsyncExecutorMux.
  • Local structured extraction (SmrtHub) uses Windows OCR directly (AI OCR when available, else legacy OCR) and normalizes layout geometry (lines/words/bounding boxes) into the structured schema.
  • Logs metadata only (planned candidates + used provider). Never log document bytes or extracted content.

Inputs

  • Document payload bytes (content type + bytes), or
  • Host-resolved file path references (file selection is owned by the host).

Common Content Types

  • Local Windows OCR hosts typically operate on raster images (e.g., image/png, image/jpeg, image/tiff).
  • Some hosts may also support PDF (application/pdf) by rendering pages to images before OCR.
  • Cloud providers may support additional formats depending on vendor/service (PDF, Office docs, HTML, etc.).

Local vs Cloud Expectations

  • Local (Windows OCR) is optimized for screenshots and scan-like inputs. It is generally image-first, and any non-image support (like PDF) is typically implemented as “render to images → OCR”.
  • Cloud providers frequently support richer document formats, but constraints are service-specific (file size limits, page limits, supported MIME types, encryption rules, etc.).

Input Limitations (Local)

Hosts using Windows OCR should assume at least these limitations unless a provider explicitly documents otherwise:

  • Password-protected/encrypted PDFs are not supported.
  • Office document formats (DOCX/PPTX/XLSX), HTML, and other non-image formats are not supported without a dedicated converter/rasterizer.
  • Very large documents may be slow when rendered + OCR’d locally.

Optional Safety Limit (PDF Page Cap)

Local Windows OCR hosts that implement PDF as “render pages → OCR” may optionally enforce a per-document page cap via SMRTHUB_LOCAL_OCR_MAX_PDF_PAGES_PER_DOCUMENT.

Status

  • Library skeleton + contracts implemented.
  • Local Windows OCR executors are provided by Smrt.ExtractStructuredText.Host; HubWindow wiring is planned, not implemented here.