Component: Smrt.ExtractStructuredText¶
Canonical source:
SmrtApps/src/Smrt.ExtractStructuredText/README.md(mirrored below)
Smrt.ExtractStructuredText¶
Structured document extraction planning + orchestration (tables/forms/key-value + document structure).
Overview and responsibilities¶
- Plans structured extraction execution (provider selection and ordered candidates).
- Defines the vendor-agnostic contracts used by host executors.
Public surface / entry points¶
- Planning/orchestration APIs and execution contracts (see source for the public types).
Dependencies and integrations¶
- Requests
CapabilityId.DocumentStructuredExtractionviaSmrt.CloudProviders. - Delegates execution to host-supplied executors (mux implementations live in
Smrt.ExtractStructuredText.Host).
Configuration and operational data¶
- No persistent config/state is owned by this library.
- Hosts may enforce a local PDF page cap via
SMRTHUB_LOCAL_OCR_MAX_PDF_PAGES_PER_DOCUMENTwhen implementing “render pages → OCR”.
Observability and diagnostics¶
- Log metadata only (planned candidates + used provider).
- Never log document bytes or extracted content.
Testing and validation¶
- Build (Debug, win-x64):
dotnet build SmrtApps/src/Smrt.ExtractStructuredText/Smrt.ExtractStructuredText.csproj -c Debug -r win-x64- (End-to-end wiring)
dotnet build SmrtApps/src/Smrt.ExtractStructuredText.Host/Smrt.ExtractStructuredText.Host.csproj -c Debug -r win-x64
- Unit tests:
dotnet test SmrtApps/src/Smrt.ExtractStructuredText.Tests/Smrt.ExtractStructuredText.Tests.csproj -c Debug -r win-x64 --no-build
Support Bundle¶
- Not applicable directly (library); collect host application logs via Support Bundle.
Related docs¶
- Host executors: SmrtApps/src/Smrt.ExtractStructuredText.Host/README.md
- Unstructured OCR: SmrtApps/src/Smrt.ExtractText/README.md
Design¶
- Vendor-agnostic core library.
- Requests
CapabilityId.DocumentStructuredExtractionfromSmrt.CloudProvidersand produces an execution contract. - Delegates execution to a host-supplied executor interface (
IDocumentStructuredExtractionExecutor). - Host mux is provided by
Smrt.ExtractStructuredText.Host(StructuredExtractionExecutorMux). - Async provider support is available via
IDocumentStructuredExtractionAsyncExecutorandSmrt.ExtractStructuredText.Host.StructuredExtractionAsyncExecutorMux. - Local structured extraction (SmrtHub) uses Windows OCR directly (AI OCR when available, else legacy OCR) and normalizes layout geometry (lines/words/bounding boxes) into the structured schema.
- Logs metadata only (planned candidates + used provider). Never log document bytes or extracted content.
Inputs¶
- Document payload bytes (content type + bytes), or
- Host-resolved file path references (file selection is owned by the host).
Common Content Types¶
- Local Windows OCR hosts typically operate on raster images (e.g.,
image/png,image/jpeg,image/tiff). - Some hosts may also support PDF (
application/pdf) by rendering pages to images before OCR. - Cloud providers may support additional formats depending on vendor/service (PDF, Office docs, HTML, etc.).
Local vs Cloud Expectations¶
- Local (Windows OCR) is optimized for screenshots and scan-like inputs. It is generally image-first, and any non-image support (like PDF) is typically implemented as “render to images → OCR”.
- Cloud providers frequently support richer document formats, but constraints are service-specific (file size limits, page limits, supported MIME types, encryption rules, etc.).
Input Limitations (Local)¶
Hosts using Windows OCR should assume at least these limitations unless a provider explicitly documents otherwise:
- Password-protected/encrypted PDFs are not supported.
- Office document formats (DOCX/PPTX/XLSX), HTML, and other non-image formats are not supported without a dedicated converter/rasterizer.
- Very large documents may be slow when rendered + OCR’d locally.
Optional Safety Limit (PDF Page Cap)¶
Local Windows OCR hosts that implement PDF as “render pages → OCR” may optionally enforce a per-document page cap via SMRTHUB_LOCAL_OCR_MAX_PDF_PAGES_PER_DOCUMENT.
Status¶
- Library skeleton + contracts implemented.
- Local Windows OCR executors are provided by
Smrt.ExtractStructuredText.Host; HubWindow wiring is planned, not implemented here.