Skip to content

Component: Smrt.ExtractStructuredText.Host

Canonical source: SmrtApps/src/Smrt.ExtractStructuredText.Host/README.md (mirrored below)


Smrt.ExtractStructuredText.Host

Host-side structured extraction execution helpers.

Overview and responsibilities

  • Provides host-owned execution for Smrt.ExtractStructuredText (provider candidate mux + optional local executors).
  • Keeps the core library vendor-agnostic by isolating provider execution details in the host layer.

Public surface / entry points

  • StructuredExtractionExecutorMux
  • StructuredExtractionAsyncExecutorMux
  • Optional local provider executors (Windows OCR)

Dependencies and integrations

  • Consumes contracts/planning from Smrt.ExtractStructuredText.
  • Local executors integrate with Windows OCR APIs via late-binding.

Configuration and operational data

  • Environment variable SMRTHUB_LOCAL_OCR_MAX_PDF_PAGES_PER_DOCUMENT may be used by hosts to cap local PDF page processing.
  • No canonical config/state files are owned by this library.

Observability and diagnostics

  • Logs must not include document bytes or extracted content.
  • If attempt outcomes are logged, log metadata only (provider id, elapsed, status).

Testing and validation

  • Build (Debug, win-x64):
    • dotnet build SmrtApps/src/Smrt.ExtractStructuredText.Host/Smrt.ExtractStructuredText.Host.csproj -c Debug -r win-x64
    • dotnet build SmrtApps/src/Smrt.ExtractStructuredText.Tests/Smrt.ExtractStructuredText.Tests.csproj -c Debug -r win-x64
  • Unit tests:
    • dotnet test SmrtApps/src/Smrt.ExtractStructuredText.Tests/Smrt.ExtractStructuredText.Tests.csproj -c Debug -r win-x64 --no-build

Support Bundle

  • Not applicable directly (library); collect logs from the hosting application via Support Bundle.

Purpose

This project is the host-owned execution layer for Smrt.ExtractStructuredText.

  • Provides a simple mux (StructuredExtractionExecutorMux) that tries a planned provider candidate list in order.
  • Provides an async mux (StructuredExtractionAsyncExecutorMux) for upload + poll providers.
  • Keeps the core library vendor-agnostic (no vendor SDK types in Smrt.ExtractStructuredText).

This package also includes optional local Windows OCR executors that emit layout geometry into StructuredDocument:

  • WindowsAiOcrStructuredExtractionProviderExecutor (Windows AI OCR, when available/ready)
  • WindowsLegacyOcrStructuredExtractionProviderExecutor (Windows.Media.Ocr fallback)

Notes

  • The mux does not log document bytes or extracted content.
  • Exceptions are captured into failure reasons (type + message) to help callers diagnose issues without dumping sensitive payloads.
  • Hosts are responsible for supplying concrete IStructuredExtractionProviderExecutor implementations (including registering the local executors above if desired).
  • Async hosts supply IStructuredExtractionAsyncProviderExecutor implementations.

Local Input Formats

The built-in local OCR executors accept inputs that can be decoded/rendered into Windows.Graphics.Imaging.SoftwareBitmap:

  • Raster images: anything Windows.Graphics.Imaging.BitmapDecoder can decode (commonly: PNG, JPEG/JPG, BMP, GIF, TIFF; and in many environments also ICO and Windows Photo/HD Photo/JXR).
  • PDF: application/pdf is supported by rendering each page via Windows.Data.Pdf and running OCR per page (results are merged into a single StructuredDocument).

Current Limitations

  • Password-protected/encrypted PDFs are not supported.
  • Non-image document formats (DOCX, PPTX, HTML, etc.) are not supported by the local executors today.
  • Very large PDFs may be slow; consider cloud providers for heavy document workloads.

Optional Safety Limit (PDF Page Cap)

Hosts may set SMRTHUB_LOCAL_OCR_MAX_PDF_PAGES_PER_DOCUMENT to a positive integer to cap how many pages are rendered/OCR’d per PDF input. When unset/invalid, PDFs are processed without a page cap.