Setup and Configure custom documents on Document Information Extraction service on SAP BTP

Estimated read time 16 min read

Welcome folks!

This is the second blog in the series End-To-End : Consume SAP BTP AI Service (Document Information Extraction) from ABAP

As the first step, you must have Part 1 : Setup BTP Trial Account and subscribe to Document Information Extraction service

SAP provides three standard document types already setup on Document Information Extraction service as of August 2024 namely Purchase Order, Payment Advice, and Invoice. These document types are preconfigured with typical fields for the respective document type.

Any other document types (custom) need to be setup and configured as needed, as we will see as part of this blog.

Example:
We will use the below Marksheet PDF document as a custom document sample to work on for the purpose of this blog. 

 

Before we begin the actions on the system, let us first understand the three main features of the Document Information Extraction service.

Schema Configuration

“Use this Document Information Extraction UI feature to create schemas containing data fields found in standard or custom document types. As an administrator, you can use these schemas as a basis for creating templates. End users must select a schema and can also select a corresponding template when adding documents.”

Schema is used to define and map the fields of the document with the corresponding placeholders at header and item levels. Each field has a Field name, Label, Description, Data Type and Setup Type.

Template

“Use this Document Information Extraction UI feature to create, reuse, edit, and delete templates based on schemas and document types. End users can select templates together with a corresponding schema to extract information from business documents of the appropriate type and structure.”

Templates are based on schemas and enable you to show the position of extraction fields in a particular document layout. Once a template is created based on a schema with precise location of fields on a document template, the next extraction runs understand the precise location of the values on a document.

Use this Document Information Extraction UI feature to upload documents to the service and get machine learning predictions for the extracted header fields and line items.”

Documents are the actual uploaded documents against a schema and a template. They also contain extraction results to review, edit and confirm.

Now that we have basic understanding of what each UI feature does in the service, we will proceed to the setup and configuration of custom document types on Document Information Extraction service on SAP BTP Trial Account that we have created in the previous blog. We will use Document Information Extraction UI on BTP platform to achieve this.

 

Login to SAP BTP Cockpit and navigate to Your Trial Account

 

 

Go to Trial Home.Navigate to the Subaccount.Click Instances and Subscriptions

 

Go to Application Document Information Extraction from the Subscriptions tab.

 

Click on Schema Configuration from left pane, if not already selected.Here you can already see standard Schemas provided by SAP. We will create a new schema for our custom document type.Click Create to create a new custom document schema.

 

Provide Schema Name (Custom_Document_Schema), Description, Document Type (Custom), and OCR Engine Type (Document).

 

The custom schema is created. Navigate to the same by clicking on the schema name from the list of available schemas.Now we will add the fields which we want to extract from the PDF document, when uploaded.For this use case(Marksheet), we want to extract below information from header and item levels.Header Fields: Student Name, Student IDItem Fields: Subject, Marks, Grade, RemarksLet us configure the above mentioned fields one after the other.Click on Add button besides the Header Fields section. New Field information pane will open on the right side.Add the details for the Student Name field at header level as shown below.Important: Setup Type (auto or manual) will determine the extraction method of the OCR and utilization of machine learning and Generative AI capabilities. Details can be found here.We will use Setup Type auto for our purpose for all the fields at header and item level.

 

Similarly add other fields required for extraction at Header and Item levels one by one.Once all the fields are added, the Schema should look like below.Note that the Schema status is Draft at this stage.

 

Now we will create a Template based on the schema that we just created. Templates are extremely important to improve accuracy and efficiency of extraction results, especially for custom document types.

Important: 
We can use templates together with a corresponding schema to extract information from business documents of the appropriate type and structure through UI on BTP Cockpit or even through ABAP program that we will cover in the next blog. Templates are based on schemas and enable you to show the position of extraction fields in a particular document layout. After creating a template, we use the Document feature to associate one or more documents with it. You then edit the extraction results for these documents, indicating the location of fields and their values. This improves the quality and success rate of extraction results.Click on Templates from the left pane.Click on Create buttonEnter Template Name (Template_Marksheet), Description, Document Type (Custom), Schema (Custom_Document_Schema)

 

Activate the Template once created.

 

Navigate to the Schema (Custom_Document_Schema) and Activate the Same.

Now we will upload a Document (Marksheet PDF) to extract the results and map them with corresponding fields, in case the default OCR engine is unable to recognize the same.Click on Documents from the left pane.Click Upload Documents button.Provide details as Document Type (Custom), Schema (Custom_Document_Schema), Template (Template_Marksheet)Browse the Marksheet PDF (as shown in the beginning of this blog). Click Confirm

 

You may find the uploaded Document status as PENDING. Document Information Extraction typically provides extraction results for an average document in about 30 seconds. However, processing can take longer if the task involved is more complex – for example, if the documents processed are large.Once the processing is completed, the status will change to READY and Extraction Results link will be enabled.

 

Click on Extraction ResultsNext action is important for future OCR detection of custom fields for the current custom document.Here you can notice machine learning model Extraction Confidence Range classified by colors: red (confidence between 0% and 50%), orange (confidence between 51% and 79%), and green (confidence between 80% and 100%).Notice that, in our case, standard OCR engine has detected only two fields, Student ID and Student Name, automatically, that too only with RED Confidence Range ( 0% to 50% confidence of the accuracy of this detection). No field at Item level is detected at all. This is ok for a custom document types.The extracted (detected) field values are also auto highlighted on the PDF document.

 

Now we will help the system with manual inputs in terms of mapping the PDF fields with configured schema fields at header and item levels.Click Edit button.First let us add Item fields. Click on + icon besides Line Items One Line Item is added with blank values for all item fields from the configured custom schema (i.e. Subject, Marks, Grade, Remarks)

 

Click on the Marks (item field) on the first row of the uploaded document.A popup will appear to map the selected field with the field of schema.Here, map the selected field with Line Items > Marks field at Row Index 1.Similarly maps the other item fields – Subject, Grade and Remarks to respective Line Item field for Row Index 1 as we are choosing fields from the 1st row of the data table.This action identifies the location (co-ordinates) of the each field and maps them with the schema fields.

 

Now that we have manually mapped each field of a row of the Item Level data, we can see the same mapping in the Line Items section on the right pane of extraction results.Similarly, we can confirm the extraction result of low confidence Header Fields (Student ID and Student Name) by manually linking the document field with corresponding schema field.Notice that, now, these fields show confidence level in green (confidence between 80% and 100%) as we have manually verified and mapped the fields. Thus, the system will now rely on this mapping.Click Save

 

Click Add to Template from the topSelect Template_Marksheet from the dropdown list.Click Add.

 

Now, we can see the uploaded document linked to the template in the Associated Documents tab.

 

Now, let us upload the document again to check if the document extraction data has improved from the previous attempt.Navigate to Documents from the left pane and upload the document as done previously.Wait till the document status changes from PENDING to READY

 

Click on Extraction ResultsNotice that 2 header fields and 8 Line Items are extracted as part of results.

 

Expand the Line Items to see individual line item values.

 

Congratulations! You have successfully created, configured, mapped and tested custom document type on Document Information Extraction service on SAP BTP Trial platform using the Document Information extraction UI interface on BTP Cockpit.

Now we will now move to Part 3 : Consume SAP BTP Document Information Extraction service for custom documents in ABAP of the blog series End-To-End : Consume SAP BTP AI Service (Document Information Extraction) from ABAP.

See you at the next steps!

Cheers!

Tejas Jani

 

​ Welcome folks!This is the second blog in the series End-To-End : Consume SAP BTP AI Service (Document Information Extraction) from ABAPAs the first step, you must have Part 1 : Setup BTP Trial Account and subscribe to Document Information Extraction serviceSAP provides three standard document types already setup on Document Information Extraction service as of August 2024 namely Purchase Order, Payment Advice, and Invoice. These document types are preconfigured with typical fields for the respective document type.Any other document types (custom) need to be setup and configured as needed, as we will see as part of this blog.Example:We will use the below Marksheet PDF document as a custom document sample to work on for the purpose of this blog.  Before we begin the actions on the system, let us first understand the three main features of the Document Information Extraction service.Schema Configuration”Use this Document Information Extraction UI feature to create schemas containing data fields found in standard or custom document types. As an administrator, you can use these schemas as a basis for creating templates. End users must select a schema and can also select a corresponding template when adding documents.”Schema is used to define and map the fields of the document with the corresponding placeholders at header and item levels. Each field has a Field name, Label, Description, Data Type and Setup Type.Template”Use this Document Information Extraction UI feature to create, reuse, edit, and delete templates based on schemas and document types. End users can select templates together with a corresponding schema to extract information from business documents of the appropriate type and structure.”Templates are based on schemas and enable you to show the position of extraction fields in a particular document layout. Once a template is created based on a schema with precise location of fields on a document template, the next extraction runs understand the precise location of the values on a document.Document”Use this Document Information Extraction UI feature to upload documents to the service and get machine learning predictions for the extracted header fields and line items.”Documents are the actual uploaded documents against a schema and a template. They also contain extraction results to review, edit and confirm.Now that we have basic understanding of what each UI feature does in the service, we will proceed to the setup and configuration of custom document types on Document Information Extraction service on SAP BTP Trial Account that we have created in the previous blog. We will use Document Information Extraction UI on BTP platform to achieve this. Login to SAP BTP Cockpit and navigate to Your Trial Account  Go to Trial Home.Navigate to the Subaccount.Click Instances and Subscriptions Go to Application Document Information Extraction from the Subscriptions tab. Click on Schema Configuration from left pane, if not already selected.Here you can already see standard Schemas provided by SAP. We will create a new schema for our custom document type.Click Create to create a new custom document schema. Provide Schema Name (Custom_Document_Schema), Description, Document Type (Custom), and OCR Engine Type (Document). The custom schema is created. Navigate to the same by clicking on the schema name from the list of available schemas.Now we will add the fields which we want to extract from the PDF document, when uploaded.For this use case(Marksheet), we want to extract below information from header and item levels.Header Fields: Student Name, Student IDItem Fields: Subject, Marks, Grade, RemarksLet us configure the above mentioned fields one after the other.Click on Add button besides the Header Fields section. New Field information pane will open on the right side.Add the details for the Student Name field at header level as shown below.Important: Setup Type (auto or manual) will determine the extraction method of the OCR and utilization of machine learning and Generative AI capabilities. Details can be found here.We will use Setup Type auto for our purpose for all the fields at header and item level. Similarly add other fields required for extraction at Header and Item levels one by one.Once all the fields are added, the Schema should look like below.Note that the Schema status is Draft at this stage. Now we will create a Template based on the schema that we just created. Templates are extremely important to improve accuracy and efficiency of extraction results, especially for custom document types.Important: We can use templates together with a corresponding schema to extract information from business documents of the appropriate type and structure through UI on BTP Cockpit or even through ABAP program that we will cover in the next blog. Templates are based on schemas and enable you to show the position of extraction fields in a particular document layout. After creating a template, we use the Document feature to associate one or more documents with it. You then edit the extraction results for these documents, indicating the location of fields and their values. This improves the quality and success rate of extraction results.Click on Templates from the left pane.Click on Create buttonEnter Template Name (Template_Marksheet), Description, Document Type (Custom), Schema (Custom_Document_Schema) Activate the Template once created. Navigate to the Schema (Custom_Document_Schema) and Activate the Same.Now we will upload a Document (Marksheet PDF) to extract the results and map them with corresponding fields, in case the default OCR engine is unable to recognize the same.Click on Documents from the left pane.Click Upload Documents button.Provide details as Document Type (Custom), Schema (Custom_Document_Schema), Template (Template_Marksheet)Browse the Marksheet PDF (as shown in the beginning of this blog). Click Confirm You may find the uploaded Document status as PENDING. Document Information Extraction typically provides extraction results for an average document in about 30 seconds. However, processing can take longer if the task involved is more complex – for example, if the documents processed are large.Once the processing is completed, the status will change to READY and Extraction Results link will be enabled. Click on Extraction ResultsNext action is important for future OCR detection of custom fields for the current custom document.Here you can notice machine learning model Extraction Confidence Range classified by colors: red (confidence between 0% and 50%), orange (confidence between 51% and 79%), and green (confidence between 80% and 100%).Notice that, in our case, standard OCR engine has detected only two fields, Student ID and Student Name, automatically, that too only with RED Confidence Range ( 0% to 50% confidence of the accuracy of this detection). No field at Item level is detected at all. This is ok for a custom document types.The extracted (detected) field values are also auto highlighted on the PDF document. Now we will help the system with manual inputs in terms of mapping the PDF fields with configured schema fields at header and item levels.Click Edit button.First let us add Item fields. Click on + icon besides Line Items One Line Item is added with blank values for all item fields from the configured custom schema (i.e. Subject, Marks, Grade, Remarks) Click on the Marks (item field) on the first row of the uploaded document.A popup will appear to map the selected field with the field of schema.Here, map the selected field with Line Items > Marks field at Row Index 1.Similarly maps the other item fields – Subject, Grade and Remarks to respective Line Item field for Row Index 1 as we are choosing fields from the 1st row of the data table.This action identifies the location (co-ordinates) of the each field and maps them with the schema fields. Now that we have manually mapped each field of a row of the Item Level data, we can see the same mapping in the Line Items section on the right pane of extraction results.Similarly, we can confirm the extraction result of low confidence Header Fields (Student ID and Student Name) by manually linking the document field with corresponding schema field.Notice that, now, these fields show confidence level in green (confidence between 80% and 100%) as we have manually verified and mapped the fields. Thus, the system will now rely on this mapping.Click Save Click Add to Template from the topSelect Template_Marksheet from the dropdown list.Click Add. Now, we can see the uploaded document linked to the template in the Associated Documents tab. Now, let us upload the document again to check if the document extraction data has improved from the previous attempt.Navigate to Documents from the left pane and upload the document as done previously.Wait till the document status changes from PENDING to READY Click on Extraction ResultsNotice that 2 header fields and 8 Line Items are extracted as part of results. Expand the Line Items to see individual line item values. Congratulations! You have successfully created, configured, mapped and tested custom document type on Document Information Extraction service on SAP BTP Trial platform using the Document Information extraction UI interface on BTP Cockpit.Now we will now move to Part 3 : Consume SAP BTP Document Information Extraction service for custom documents in ABAP of the blog series End-To-End : Consume SAP BTP AI Service (Document Information Extraction) from ABAP.See you at the next steps!Cheers!Tejas Jani   Read More Technology Blogs by SAP articles 

#SAP

#SAPTechnologyblog

You May Also Like

More From Author

+ There are no comments

Add yours