Invoice Validation with SAP Document Information Extraction API & Python

Estimated read time 11 min read

Invoices are essential, but validating them manually can be tedious and error-prone. In this blog post, we’ll walk through a Python-based workflow that leverages SAP’s Document Information Extraction (DOX) API to extract invoice data and intelligently compare it against contract records. The goal? Validate invoices and highlight discrepancies—all in one go.

Supported document types for the SAP Document Information Extraction Service include supplier invoice, purchase order, payment advice, and business cards. More to come in the roadmap.

This blog post builds on an excellent introductory guide by Joni – here – which covers getting started with SAP DOX API and testing it via Swagger. Here, we introduce Python logic. This post builds upon an understanding of SAP DOX API, with Python modelling doing the heavy lifting of:

OCR-based document post and extraction (via SAP DOX API)Confidence scoring and validation of key fieldsAutomatic detection and reporting of mismatches

TL;DR: Automate invoice data extraction and validation using SAP DOX API and Python. OCR your PDFs, validate against expected fields with confidence scoring, and flag mismatches—all in one streamlined Python pipeline.

This guide is for: SAP BTP developers, automation engineers, finance teams exploring DOX integration, or anyone dealing with high-volume invoice processing.

Step 1: Authentication with SAP BTP

First, we authenticate against SAP BTP using OAuth 2.0 to retrieve an access token.
Update <TENANT> to your SAP BTP tenant name.

The Bearer Token is required for all subsequent API calls to the DOX endpoint.

import requests
import json

client_id = “”
client_secret = “”
oauth_url = “https://<TENANT>.authentication.ap10.hana.ondemand.com/oauth/token?grant_type=client_credentials”

# Get access token
def get_access_token():
token_response = requests.post(
oauth_url,
data={
“grant_type”: “client_credentials”,
“client_id”: client_id,
“client_secret”: client_secret,
},
headers={
“Content-Type”: “application/x-www-form-urlencoded”
}
)

if token_response.status_code == 200:
return token_response.json().get(“access_token”)
else:
raise Exception(f”Failed to get token: {token_response.status_code} – {token_response.text}”)

Step 2: Submitting an Invoice PDF for Document Information Extraction

Next, we upload a sample invoice PDF (in this case, Thai_Invoice_with_TH_date.pdf) to the DOX API. The POST/document/jobs endpoint performs a document submission (e.g a pdf, png or jpeg file) for asynchronous processing. If successful, the API returns a job ID which we use to track and fetch the results. This ID can be used with other endpoints, including GET/document/jobs/{id} and DELETE/document/jobs/{id}.

Sample Invoice:

import requests

url = “https://<TENANT>.ap10.doc.cloud.sap/document-information-extraction/v1/document/jobs”
access_token = get_access_token()
file_path = “Thai_Invoice_with_TH_date.pdf”

headers = {
“Authorization”: f”Bearer {access_token}”,
“accept”: “application/json”
}

# simulate -F flags in curl
files = {
‘file’: (file_path, open(file_path, ‘rb’), ‘application/pdf’),
‘options’: (
None,
”'{
“schemaName”: “SAP_invoice_schema”,
“clientId”: “c_00”,
“documentType”: “invoice”,
“receivedDate”: “2020-02-17”,
“enrichment”: {
“sender”: {
“top”: 5,
“type”: “businessEntity”,
“subtype”: “supplier”
}
}
}”’,
‘application/json’
)
}

response = requests.post(url, headers=headers, files=files)

if response.status_code in [200, 201]:
print(“Document submitted successfully!”)
print(“ID:”, response.json().get(“id”))
else:
print(“Submission failed”)
print(“Status Code:”, response.status_code)
print(“Response:”, response.text)

Step 3: Retrieving Extracted Data

Using the job ID, we retrieve the results. Replace {id} with the ID in the previous output. From the SAP Document Information Extraction interface, you’ll also see your extracted fields formatted.

The extracted fields (like receiver address, receiver name, document date, and net amount) are returned with confidence scores, allowing us to evaluate the reliability of each extracted field.

You get the actual confidence scores via the DOX API instead of an extraction confidence range on the interface.

url = f”https://<TENANT>.ap10.doc.cloud.sap/document-information-
extraction/v1/document/jobs/{ID}”
response = requests.get(url, headers=headers)

if response.status_code == 200:
data = response.json()
fields = data.get(“extraction”, {}).get(“headerFields”, [])

if not fields:
print(“No header fields extracted.”)
else:
print(“Extracted Fields:”)
for field in fields:
print(f”- {field[‘name’]}: {field[‘value’]} (confidence: {round(field[‘confidence’], 2)})”)
else:
print(“Failed to retrieve job data”)
print(“Status:”, response.status_code)
print(“Response:”, response.text)

Step 4: Document Confidence Scoring and Mandatory Field Check

We define a list of mandatory fields and set a minimum confidence threshold (e.g., 70%).

The Document Confidence Score is reflected at the end of the output after checking for matching and missing fields.

min_confidence = 0.7 # threshold to flag low-match fields

mandatory_fields = {
“receiverName”: “Buyer Name”,
“receiverAddress”: “Buyer Address”,
“documentDate”: “Date Issued”,
“documentNumber”: “Invoice Number”,
“currencyCode”: “Currency”,
“taxAmount”: “VAT Value”,
“senderAddress”: “Company Address”,
“senderName”: “Company Name”,
“taxId”: “Tax ID”
}

if response.status_code == 200:
data = response.json()
fields = data.get(“extraction”, {}).get(“headerFields”, [])

# Convert extracted fields into a dictionary: name => field
extracted = {f[“name”]: f for f in fields}

total_score = 0
max_score = len(mandatory_fields)
low_match_flags = []

print(“Mandatory Field Check:n”)

for key, label in mandatory_fields.items():
field = extracted.get(key)
if field:
total_score += 1 # Presence of field
print(f”{label} ({key}): {field[‘value’]}”)
else:
print(f”{label} ({key}): MISSING”)

print(“nResult Summary:”)
print(f”Fields Found: {total_score}/{max_score}”)

if total_score < max_score:
print(“nLow Match Fields:”)
# Check for missing fields and flag low match
for key, label in mandatory_fields.items():
if key not in extracted:
print(f” – {label} ({key})”)

document_confidence_score = round((total_score / max_score) * 100, 2)
print(f”nDocument Confidence Score: {document_confidence_score}%”)

else:
print(“Failed to fetch job results.”)
print(f”Status Code: {response.status_code}”)
print(f”Response: {response.text}”)

Future Expansion

For integration and tracking – you’ll want your enterprise system as part of the Document Information Extraction process. While keeping all validation logic in Python, you can expand the case with fuzzy matching, error handling, and real contract databases, including SAP S/4HANA or SAP Build Process Automation:

Pull real data dynamicallyPush results (e.g. approved/rejected invoices)Trigger workflows (emails, dashboards, auto-approval/review flagging etc.)Store validated invoice results in your backendLog confidence scores, mismatches, timestamps

All in all, this blog post provides a preliminary look into:

Invoice document posting and extraction with the SAP DOX APIInvoice validationConfidence scoring

If you process high volumes of invoices, SAP Document Information Extraction can help you save countless hours and reduce errors.

Leave your thoughts or questions in the comment section below!

 

​ Invoices are essential, but validating them manually can be tedious and error-prone. In this blog post, we’ll walk through a Python-based workflow that leverages SAP’s Document Information Extraction (DOX) API to extract invoice data and intelligently compare it against contract records. The goal? Validate invoices and highlight discrepancies—all in one go.Supported document types for the SAP Document Information Extraction Service include supplier invoice, purchase order, payment advice, and business cards. More to come in the roadmap.This blog post builds on an excellent introductory guide by Joni – here – which covers getting started with SAP DOX API and testing it via Swagger. Here, we introduce Python logic. This post builds upon an understanding of SAP DOX API, with Python modelling doing the heavy lifting of:OCR-based document post and extraction (via SAP DOX API)Confidence scoring and validation of key fieldsAutomatic detection and reporting of mismatchesTL;DR: Automate invoice data extraction and validation using SAP DOX API and Python. OCR your PDFs, validate against expected fields with confidence scoring, and flag mismatches—all in one streamlined Python pipeline.This guide is for: SAP BTP developers, automation engineers, finance teams exploring DOX integration, or anyone dealing with high-volume invoice processing.Step 1: Authentication with SAP BTPFirst, we authenticate against SAP BTP using OAuth 2.0 to retrieve an access token.Update <TENANT> to your SAP BTP tenant name.The Bearer Token is required for all subsequent API calls to the DOX endpoint.import requests
import json

client_id = “”
client_secret = “”
oauth_url = “https://<TENANT>.authentication.ap10.hana.ondemand.com/oauth/token?grant_type=client_credentials”

# Get access token
def get_access_token():
token_response = requests.post(
oauth_url,
data={
“grant_type”: “client_credentials”,
“client_id”: client_id,
“client_secret”: client_secret,
},
headers={
“Content-Type”: “application/x-www-form-urlencoded”
}
)

if token_response.status_code == 200:
return token_response.json().get(“access_token”)
else:
raise Exception(f”Failed to get token: {token_response.status_code} – {token_response.text}”)Step 2: Submitting an Invoice PDF for Document Information ExtractionNext, we upload a sample invoice PDF (in this case, Thai_Invoice_with_TH_date.pdf) to the DOX API. The POST/document/jobs endpoint performs a document submission (e.g a pdf, png or jpeg file) for asynchronous processing. If successful, the API returns a job ID which we use to track and fetch the results. This ID can be used with other endpoints, including GET/document/jobs/{id} and DELETE/document/jobs/{id}.Sample Invoice:import requests

url = “https://<TENANT>.ap10.doc.cloud.sap/document-information-extraction/v1/document/jobs”
access_token = get_access_token()
file_path = “Thai_Invoice_with_TH_date.pdf”

headers = {
“Authorization”: f”Bearer {access_token}”,
“accept”: “application/json”
}

# simulate -F flags in curl
files = {
‘file’: (file_path, open(file_path, ‘rb’), ‘application/pdf’),
‘options’: (
None,
”'{
“schemaName”: “SAP_invoice_schema”,
“clientId”: “c_00”,
“documentType”: “invoice”,
“receivedDate”: “2020-02-17”,
“enrichment”: {
“sender”: {
“top”: 5,
“type”: “businessEntity”,
“subtype”: “supplier”
}
}
}”’,
‘application/json’
)
}

response = requests.post(url, headers=headers, files=files)

if response.status_code in [200, 201]:
print(“Document submitted successfully!”)
print(“ID:”, response.json().get(“id”))
else:
print(“Submission failed”)
print(“Status Code:”, response.status_code)
print(“Response:”, response.text)Step 3: Retrieving Extracted DataUsing the job ID, we retrieve the results. Replace {id} with the ID in the previous output. From the SAP Document Information Extraction interface, you’ll also see your extracted fields formatted.The extracted fields (like receiver address, receiver name, document date, and net amount) are returned with confidence scores, allowing us to evaluate the reliability of each extracted field.You get the actual confidence scores via the DOX API instead of an extraction confidence range on the interface.url = f”https://<TENANT>.ap10.doc.cloud.sap/document-information-
extraction/v1/document/jobs/{ID}”
response = requests.get(url, headers=headers)

if response.status_code == 200:
data = response.json()
fields = data.get(“extraction”, {}).get(“headerFields”, [])

if not fields:
print(“No header fields extracted.”)
else:
print(“Extracted Fields:”)
for field in fields:
print(f”- {field[‘name’]}: {field[‘value’]} (confidence: {round(field[‘confidence’], 2)})”)
else:
print(“Failed to retrieve job data”)
print(“Status:”, response.status_code)
print(“Response:”, response.text)Step 4: Document Confidence Scoring and Mandatory Field CheckWe define a list of mandatory fields and set a minimum confidence threshold (e.g., 70%).The Document Confidence Score is reflected at the end of the output after checking for matching and missing fields.min_confidence = 0.7 # threshold to flag low-match fields

mandatory_fields = {
“receiverName”: “Buyer Name”,
“receiverAddress”: “Buyer Address”,
“documentDate”: “Date Issued”,
“documentNumber”: “Invoice Number”,
“currencyCode”: “Currency”,
“taxAmount”: “VAT Value”,
“senderAddress”: “Company Address”,
“senderName”: “Company Name”,
“taxId”: “Tax ID”
}

if response.status_code == 200:
data = response.json()
fields = data.get(“extraction”, {}).get(“headerFields”, [])

# Convert extracted fields into a dictionary: name => field
extracted = {f[“name”]: f for f in fields}

total_score = 0
max_score = len(mandatory_fields)
low_match_flags = []

print(“Mandatory Field Check:n”)

for key, label in mandatory_fields.items():
field = extracted.get(key)
if field:
total_score += 1 # Presence of field
print(f”{label} ({key}): {field[‘value’]}”)
else:
print(f”{label} ({key}): MISSING”)

print(“nResult Summary:”)
print(f”Fields Found: {total_score}/{max_score}”)

if total_score < max_score:
print(“nLow Match Fields:”)
# Check for missing fields and flag low match
for key, label in mandatory_fields.items():
if key not in extracted:
print(f” – {label} ({key})”)

document_confidence_score = round((total_score / max_score) * 100, 2)
print(f”nDocument Confidence Score: {document_confidence_score}%”)

else:
print(“Failed to fetch job results.”)
print(f”Status Code: {response.status_code}”)
print(f”Response: {response.text}”)Future ExpansionFor integration and tracking – you’ll want your enterprise system as part of the Document Information Extraction process. While keeping all validation logic in Python, you can expand the case with fuzzy matching, error handling, and real contract databases, including SAP S/4HANA or SAP Build Process Automation:Pull real data dynamicallyPush results (e.g. approved/rejected invoices)Trigger workflows (emails, dashboards, auto-approval/review flagging etc.)Store validated invoice results in your backendLog confidence scores, mismatches, timestampsAll in all, this blog post provides a preliminary look into:Invoice document posting and extraction with the SAP DOX APIInvoice validationConfidence scoringIf you process high volumes of invoices, SAP Document Information Extraction can help you save countless hours and reduce errors.Leave your thoughts or questions in the comment section below!   Read More Technology Blogs by SAP articles 

#SAP

#SAPTechnologyblog

You May Also Like

More From Author