Library Developer Home Overview Address Documents PDF to Text/JSON Decode & Analyse PDF Email Subject to UTF8

Decode & Analyse PDF

This is an advanced API and is only available via a websocket connection. When called, the API makes callbacks to request and validate data within the callers database.

This API takes a raw PDF document and attempts to decode the contents of the document. It goes further than simple PDF to Text extraction and tries to validate and confirm contents against the callers database. It checks things like are part codes valid, is this a valid purchase order, does it match the purchase order, is this a valid supplier. This API is a primary part of Fieldpine Document Ingest Service

If you only require the text within a PDF, use the PDF to Text/JSON API.

The resulting output from this API can typically be directly applied to systems and does not need additional cleanup.

Functionality & Logic:

Open Websocket to your Ingest host
Send your PDF document over the websocket
The Server:
- Extracts Text from PDF
- Uses OCR if image based PDF
- Extracts tables and content
- Calls online services to assist if needed
- Validates information discovered to ensure correct
- If an invoice references a purchase order, checks the invoice against that purchase order
You respond to queries sent via the websocket
Final result delivered
WebSocket closed

Open Websocket

Send Initial Request

MyWebSocket.send(
  JSON.stringify(
    {
        a: "wc1.emailhttp.decodepdf",
        v: {
            pdf_text:   // PDF text in JSON array - not currently used by server
            pdf_base64: // RAW binary PDF document, encoded as base64
            options: [ "keyword1", "keyword2", ... ]
        }
    }
));

Options

An array of keywords or parameters to influence how the decode process works

Keyword	Description
query-json1	Instructs the server to send data queries using the Json1 format defined below.
gdsproxy	Informs the server that it can send binary, internal format, queries over the websocket. This keyword should only be used if you are indeed passing all queries to a Fieldpine Retail installation

Process Data Queries

The server will either Text or binary packets over the websocket for you to process. You must respond to every request, even if it is simply a "none" type response.

The server can use various protocols depending on which you selected in your "options" with the initial request

Data sent to the server as a data query response is not logged or recorded (excluding highly transient use such as temp file)

Some of the queries the server sends may appear overly broad in some cases. This is because the server is attempting to solve OCR type issues (is it a i,I,L,l, or 1 character) or user typing issues (the PDF has "po-123445" did they possibly mistype "po-12345"). Some invoice layouts are not always clear as to what goes where - they have several labels such as "your reference" and "order number" and "sales order" and "reference" and "id" - which one contains a purchase order number? (hint - it varies by sender)

Data Queries - query-json1

TBS.