Component: Smrt.ExtractStructuredText.Host¶
Canonical source:
SmrtApps/src/Smrt.ExtractStructuredText.Host/README.md(mirrored below)
Smrt.ExtractStructuredText.Host¶
Host-side structured extraction execution helpers.
Overview and responsibilities¶
- Provides host-owned execution for
Smrt.ExtractStructuredText(provider candidate mux + optional local executors). - Keeps the core library vendor-agnostic by isolating provider execution details in the host layer.
Public surface / entry points¶
StructuredExtractionExecutorMuxStructuredExtractionAsyncExecutorMux- Optional local provider executors (Windows OCR)
Dependencies and integrations¶
- Consumes contracts/planning from
Smrt.ExtractStructuredText. - Local executors integrate with Windows OCR APIs via late-binding.
Configuration and operational data¶
- Environment variable
SMRTHUB_LOCAL_OCR_MAX_PDF_PAGES_PER_DOCUMENTmay be used by hosts to cap local PDF page processing. - No canonical config/state files are owned by this library.
Observability and diagnostics¶
- Logs must not include document bytes or extracted content.
- If attempt outcomes are logged, log metadata only (provider id, elapsed, status).
Testing and validation¶
- Build (Debug, win-x64):
dotnet build SmrtApps/src/Smrt.ExtractStructuredText.Host/Smrt.ExtractStructuredText.Host.csproj -c Debug -r win-x64dotnet build SmrtApps/src/Smrt.ExtractStructuredText.Tests/Smrt.ExtractStructuredText.Tests.csproj -c Debug -r win-x64
- Unit tests:
dotnet test SmrtApps/src/Smrt.ExtractStructuredText.Tests/Smrt.ExtractStructuredText.Tests.csproj -c Debug -r win-x64 --no-build
Support Bundle¶
- Not applicable directly (library); collect logs from the hosting application via Support Bundle.
Related docs¶
Smrt.ExtractStructuredText: SmrtApps/src/Smrt.ExtractStructuredText/README.mdSmrt.ExtractText(unstructured OCR): SmrtApps/src/Smrt.ExtractText/README.md
Purpose¶
This project is the host-owned execution layer for Smrt.ExtractStructuredText.
- Provides a simple mux (
StructuredExtractionExecutorMux) that tries a planned provider candidate list in order. - Provides an async mux (
StructuredExtractionAsyncExecutorMux) for upload + poll providers. - Keeps the core library vendor-agnostic (no vendor SDK types in
Smrt.ExtractStructuredText).
This package also includes optional local Windows OCR executors that emit layout geometry into StructuredDocument:
WindowsAiOcrStructuredExtractionProviderExecutor(Windows AI OCR, when available/ready)WindowsLegacyOcrStructuredExtractionProviderExecutor(Windows.Media.Ocrfallback)
Notes¶
- The mux does not log document bytes or extracted content.
- Exceptions are captured into failure reasons (type + message) to help callers diagnose issues without dumping sensitive payloads.
- Hosts are responsible for supplying concrete
IStructuredExtractionProviderExecutorimplementations (including registering the local executors above if desired). - Async hosts supply
IStructuredExtractionAsyncProviderExecutorimplementations.
Local Input Formats¶
The built-in local OCR executors accept inputs that can be decoded/rendered into Windows.Graphics.Imaging.SoftwareBitmap:
- Raster images: anything
Windows.Graphics.Imaging.BitmapDecodercan decode (commonly: PNG, JPEG/JPG, BMP, GIF, TIFF; and in many environments also ICO and Windows Photo/HD Photo/JXR). - PDF:
application/pdfis supported by rendering each page viaWindows.Data.Pdfand running OCR per page (results are merged into a singleStructuredDocument).
Current Limitations¶
- Password-protected/encrypted PDFs are not supported.
- Non-image document formats (DOCX, PPTX, HTML, etc.) are not supported by the local executors today.
- Very large PDFs may be slow; consider cloud providers for heavy document workloads.
Optional Safety Limit (PDF Page Cap)¶
Hosts may set SMRTHUB_LOCAL_OCR_MAX_PDF_PAGES_PER_DOCUMENT to a positive integer to cap how many pages are rendered/OCR’d per PDF input. When unset/invalid, PDFs are processed without a page cap.