Troubleshooting Document Grounding Pipelines

Estimated read time 25 min read

This article will cover an in-depth look into the Document Grounding Service. Including a deep dive into its MS Sharepoint Integration and a collection of issues we usually encounter with the setup routine of pipelines for Document Grounding in Joule.

The scenario: We have implemented Joule and want to enable it to consume company specific knowledge documents. In that process we follow the documentation assets to setup Document Grounding. In this sequence, we face difficulties or unclarities. This blog post should be a starting point to understand different troubleshooting vectors to successfully register a  Sharepoint with Document Grounding.

 

 

Step 1: Validate access to the Document Grounding API:

Once the service instance is successfully set up following the official documentation, the next step is to use the Document Grounding API to create a pipeline—typically targeting a source like SharePoint. This setup process remains largely manual and can be cumbersome.

The service keys provided by BTP include an X.509 certificate and key, embedded within a JSON structure. These must be manually extracted and saved as separate .crt and .key files.

Here’s how to troubleshoot this step-by-step:

Check 1: Can you obtain a token for the Document Grounding API:

Before interacting with the Document Grounding API, verify that you have the correct urlclient_id, and the required certificates. These credentials are essential for authenticating against the API.

Start by performing a POST request to the authorization endpoint to obtain a bearer token. This step confirms that your setup is valid and that the API is reachable.

Here’s what the response should look like:

Subsequently you would use the bearer token in the next request to the URL of the Document Grounding Service.

Typical symptoms that might occur are 4xx errors on the very first “Get Token” request you try. If you are not able to retrieve the bearer token – that most likely means there is an issue with the way you formulated the request. Ensure that your API client (recommendation here is Bruno) is passing the needed details correctly.

Points to watch out: 

URLClient IDClient Certificate including the proper wildcard – needs to match the URL

If these three points are good, you will be able to obtain the bearer token. Let’s go through some possible issues:

Symptom 1:  “401 invalid_client – Client not found”

Typical issue: The client_id is missing as part of the URL params.

Symptom 2: “401 invalid_client – Bad credentials”:

Typical issue: The client certificate is either missing or not the right certificate for this client id.

Symptom 3: “404 Not found”:

Typical issue: The URL is incorrect. Ensure that it ends on the /oauth2/token

Symptom 4: “400 Error”:

Typical issue: The URL is incorrect. Ensure that it ends on the /oauth2/token (in this example it was /oauth2/authorize – copied directly from the key instead of correcting it)

Symptom 5: “OPENSSL BAD_BASE64_DECODE”:

Typical issue: The certificate or the key file is incorrectly formatted. 

 

Check 2: Are you able to access the Document Grounding Pipeline API:

Given the bearer token, now it is time to try accessing the Document Grounding pipeline API. Typically you would do that in order to create the pipeline. Since usually the next request already would be the pipeline creation – and we might have two hard to distinguish error categories here – I recommend to try the simplest GET request the Grounding API has. 

So execute a GET on /pipelines to double check if you do get a valid response from the API. You want to retrieve a 200 OK response code and a body with a most likely empty list of pipelines.

 

This way you can be sure that in general you talk to the API correctly. If you are prompted with errors in that step – then you need to double check the following points:

URLClient IDBearer Token in “Authorization” header needs to be there and valid (it expires after x hours)Client Certificate in place with proper wildcard for this URL as well

If you do not get a 200 response – most likely you are again doing something wrong in the formulation of the request to the API:

Symptom 1: “401 Unauthorized. Ensure you have proper token and subscription”

Typical Issue: The request is missing a proper bearer token – or it is invalid by now.

Symptom 2: “404 Not found”

Typical Issue: The URL is wrong.

 

Now that you can be sure that you are indeed able to talk to the API of Document Grounding – we can come to the next challenge:

 

Step 2: Validate SharePoint Credentials:

The most frequent issue we encounter when customers set up Document Grounding for SharePoint is misconfigured API credentials for the Microsoft Graph SharePoint API. While the process may seem intuitive, it often leads to errors.

In many organizations, Microsoft administration is handled by dedicated departments. As a result, teams rarely receive the correct credentials with the necessary privileges on the first attempt. Adjusting these credentials typically involves lengthy and bureaucratic procedures.

Let’s quickly review what can go wrong:

Symptom 1: “401 External system returned an unexpected response”

Typical Issue: The credentials to the MS Graph API you maintained in the Destination do not have access to the respective API on Microsoft end to access the SharePoint sites documents.

Deep Dive Investigating Access Issues on Microsoft Entra side:

Now for this specific symptom we need to deep dive. The reason is – from the document grounding service – we get this generic 401 error message. But this is usually no big help to figure out what exactly is wrong with the credentials. There can be several possible reasons:

Invalid Client ID / Client Secret values – could be a simple typoURL’s – the SharePoint site URL, as well as the token service URL could be wrongMissing permissions in generalMissing permission on that specific SharePoint site

So let’s dive into it step by step.

The first thing I highly recommend is to use the App registration in combination with the Site.selected permission set. In the official document grounding API we also offer the possibility to use a “Oauth2Password” flow – but in general talking to many customers and Microsoft admins – the clear preference from a data governance and technical perspective is for sure the App registration. There are several reasons for this: One for example is the ability to restrict the access to only the needed SharePoint sites. The second one are organizational policies. Some orgs require multi factor authentication for the user flow and this is something clearly not possible for a technical service. You can work around it with Ip allow listing – but this is quite cumbersome and error-prone.

That being said – we will concentrate on the recommended way of using “Oauth2ClientCredentials”.

How to check the SharePoint credentials and get detailed error messages:

SAP provides a specific note on “How to test the SharePoint API accessibility” https://me.sap.com/notes/3623622/E this note contains a number of cURL requests that can be used to validate what exactly is coming back from the MS Graph API. My clear recommendation here is: Use the requests given here (import them to Bruno) and get clarity on the exact error message. That way you will be able to much better narrow down what needs to be configured on Entra side.

The connection check to SharePoint consists of two steps: Obtaining a bearer token and then accessing the sites contents.

Check 1: Obtain the Microsoft Login Token:

 

Possible symptoms that might occur when you are executing this request:

Symptom 1: Invalid Request scope is missing

Typical Issue: The scope header is missing

Symptom 2: Tenant Identifier is invalid

Typical Issue: You are not using the right tenant ID for your Microsoft Entra tenant.

Symptom 3: App Registration / Client ID not found

Typical Issue: You are using the wrong identifier for your app registration. Ensure to take the (client id) from your app registration.

Symptom 4: Invalid Secret provided

Typical Issue: The secret value you 

 

Check 2: Access the MS Graph SharePoint API with the token:

Now that we can be sure, that the MS App Registration Credentials are in fact valid, we can check if the also have the permissions for that particular sharepoint:

Utilize the “Connect to Sharepoint” request from the Note. The expected response should look like this:

It should give you some details about your site and return with a 200 OK.

Now lets have a look at possible symptoms of permission errors:

Symptom 1: “401 InvalidAuthenticationToken”:

Typical Issue: This is just an issue with having an invalid token copied from the first request / or invalidated by time.

 

Symptom 2: “404 Not found”:

Typical Issue: This is referring to the part of the URL where you need to input the hostname and the sitename. Ensure that you are using the actual site name and not just some sort of description. Best is to take it directly from the Site URL you also open in the browser.

 

Symptom 3: “401 General Exception”:

Typical Issue: This indicates a general issue with the permission to the sharepoint site api – your app registration is most likely missing the sites selected permission or it is not fully granted.

 

In MS Entra this should look like this:

On the App registrations API permission page – you need to ensure to have the Microsoft Graph “Sites.Selected” Permission added – and Status needs to be “Granted for xxx”.

Symptom 4: “403 Forbidden Access Denied”:

Typical Issue: In case of a 403 Access Denied on the other hand side – it is and indication that again the permissions are the issue – but in this case specifically the link between your app registration and the sharepoint site is missing. Meaning you might have the sites selected application permission granted – but as the name suggests this permission works in the combination with an additional step to setup the link between that app registration and the sharepoint site. 

How to validate Sharepoint Permissions in MS Entra:

This one is a bit harder to validate – because there is no UI in Entra. Instead the admin has the option to either use a PowerShell command to grant this:

Grant-PnPAzureADAppSitePermission -AppId <appId> -DisplayName <name> -Permissions Write

This would respond with a granted message if done correctly.

To validate one can use the following format:

Get-PnPAzureADAppSitePermission -Site https://contoso.sharepoint.com/sites/ProjectX

And again you should get the indication whether your App Registry has the permission to access that specific Sharepoint site.

Alternatively he can use a POST Request via the MS Graph Explorer:

POST https://graph.microsoft.com/v1.0/sites/{siteId}/permissions

The response should include you App Registry:

 

{
“roles”: [ “write” ],
“grantedToIdentities”: [
{
“application”: {
“id”: “00000000-1111-2222-3333-444444444444”,
“displayName”: “MY Azure AD App”
}
}
]
}

 

Similarly the GET would allow you to see which applications have access to the site. Both ways are similar.

Note: You need to perform the link between the app registration and the sharepoint site for each new site you want to replicate.

Given all these steps – we can be sure that the MS Credentials have the proper permissions for our job.

Step 3: Validate Pipeline Configuration:

Last but not least – given a working set of credentials and being able to retrieve content from the MS side – this should allow us now to create a pipeline. Below there are some additional checks to be done to ensure a proper configuration of the pipeline:

Check 1: SharePoint Site Identifier 

In the step to create the Pipeline – you are asked to provide a site name. This is a bit a tricky one – because SharePoint Sites have a title. A readable nice title –  but this is not necessarily the site identifier. Instead – the name parameter needs to be filled with the ID of the Sharepoint site. And this is best found on the URL.  For example if your Sharepoint is reachable under 

https://contoso.sharepoint.com/sites/HR%20Team

Then the site identifier would be HR%20Team

Note: Be aware of the URL encoding.

A seldom topic that might occur here is that you have a sub-site structure. This is to the best of my knowledge currently not supported by document grounding and sharepoint itself has deprecated this feature.

Check 2: SharePoint include paths

The optional parameter to limit the replication of content into the document grounding service with a include path is also something to watch out for. A typical symptom is that, after configuring an include path, no documents appear in the replication results — indicating that the path definition is likely too restrictive or incorrectly specified.

The easiest way to validate this is by checking directly in SharePoint. In this example, I’m using the default document library, “Shared Documents”, and have created a folder named “teched” containing a few sample files. To determine the correct value for the include path, simply examine the document’s URL — it reveals the folder structure that can be referenced in your pipeline configuration.

https://sap.sharepoint.com/sites/210637/Shared%20Documents/teched/Board%20Onsite%20San%20Marino.docx

Given an link roughly like this – I would now use the following path to limit the replication only to the teched folder: “/teched”

Note: We do not prepend the path with the /Shared%20Documents 

Check 3: Pipeline created but no documents

How to check – see this documentation page. It’s pretty straight forward. Whenever you create a pipeline to a site – it will in a matter of  a few minutes trigger a replication job. You can then see the status of that execution and determine what is going on. Sometimes this might lead to a empty list of documents. This usually points to a issue.

If you’re not seeing any documents, you may have defined the include path too narrowly. In that case, I recommend first testing with the root path—that is, omit the include path entirely. Once replication works, you can refine the configuration in subsequent iterations to identify which parts of your SharePoint content are being synchronized.

Keep in mind that, due to the historical evolution of SharePoint, not all structures are recognized as valid document libraries. For example, subsites are no longer supported by Document Grounding, as this feature has been deprecated in SharePoint itself. If your content resides in a subsite, the only reliable solution is to migrate it to a standard site and its associated document library.

Conclusion:

If you can answer the following questions with yes – then you succeeded in the troubleshooting

Did the pipeline get created with a 201 status code and a pipeline_id? 
Are any documents showing up for the pipeline execution? 
Does Joule answer my questions about the documents?

Hope you enjoyed the blog post and in case of any other symptoms you found – feel free to comment below.

 

​ This article will cover an in-depth look into the Document Grounding Service. Including a deep dive into its MS Sharepoint Integration and a collection of issues we usually encounter with the setup routine of pipelines for Document Grounding in Joule.The scenario: We have implemented Joule and want to enable it to consume company specific knowledge documents. In that process we follow the documentation assets to setup Document Grounding. In this sequence, we face difficulties or unclarities. This blog post should be a starting point to understand different troubleshooting vectors to successfully register a  Sharepoint with Document Grounding.  Step 1: Validate access to the Document Grounding API:Once the service instance is successfully set up following the official documentation, the next step is to use the Document Grounding API to create a pipeline—typically targeting a source like SharePoint. This setup process remains largely manual and can be cumbersome.The service keys provided by BTP include an X.509 certificate and key, embedded within a JSON structure. These must be manually extracted and saved as separate .crt and .key files.Here’s how to troubleshoot this step-by-step:Check 1: Can you obtain a token for the Document Grounding API:Before interacting with the Document Grounding API, verify that you have the correct url, client_id, and the required certificates. These credentials are essential for authenticating against the API.Start by performing a POST request to the authorization endpoint to obtain a bearer token. This step confirms that your setup is valid and that the API is reachable.Here’s what the response should look like:Subsequently you would use the bearer token in the next request to the URL of the Document Grounding Service.Typical symptoms that might occur are 4xx errors on the very first “Get Token” request you try. If you are not able to retrieve the bearer token – that most likely means there is an issue with the way you formulated the request. Ensure that your API client (recommendation here is Bruno) is passing the needed details correctly.Points to watch out: URLClient IDClient Certificate including the proper wildcard – needs to match the URLIf these three points are good, you will be able to obtain the bearer token. Let’s go through some possible issues:Symptom 1:  “401 invalid_client – Client not found”Typical issue: The client_id is missing as part of the URL params.Symptom 2: “401 invalid_client – Bad credentials”:Typical issue: The client certificate is either missing or not the right certificate for this client id.Symptom 3: “404 Not found”:Typical issue: The URL is incorrect. Ensure that it ends on the /oauth2/tokenSymptom 4: “400 Error”:Typical issue: The URL is incorrect. Ensure that it ends on the /oauth2/token (in this example it was /oauth2/authorize – copied directly from the key instead of correcting it)Symptom 5: “OPENSSL BAD_BASE64_DECODE”:Typical issue: The certificate or the key file is incorrectly formatted.  Check 2: Are you able to access the Document Grounding Pipeline API:Given the bearer token, now it is time to try accessing the Document Grounding pipeline API. Typically you would do that in order to create the pipeline. Since usually the next request already would be the pipeline creation – and we might have two hard to distinguish error categories here – I recommend to try the simplest GET request the Grounding API has. So execute a GET on /pipelines to double check if you do get a valid response from the API. You want to retrieve a 200 OK response code and a body with a most likely empty list of pipelines. This way you can be sure that in general you talk to the API correctly. If you are prompted with errors in that step – then you need to double check the following points:URLClient IDBearer Token in “Authorization” header needs to be there and valid (it expires after x hours)Client Certificate in place with proper wildcard for this URL as wellIf you do not get a 200 response – most likely you are again doing something wrong in the formulation of the request to the API:Symptom 1: “401 Unauthorized. Ensure you have proper token and subscription”Typical Issue: The request is missing a proper bearer token – or it is invalid by now.Symptom 2: “404 Not found”Typical Issue: The URL is wrong. Now that you can be sure that you are indeed able to talk to the API of Document Grounding – we can come to the next challenge: Step 2: Validate SharePoint Credentials:The most frequent issue we encounter when customers set up Document Grounding for SharePoint is misconfigured API credentials for the Microsoft Graph SharePoint API. While the process may seem intuitive, it often leads to errors.In many organizations, Microsoft administration is handled by dedicated departments. As a result, teams rarely receive the correct credentials with the necessary privileges on the first attempt. Adjusting these credentials typically involves lengthy and bureaucratic procedures.Let’s quickly review what can go wrong:Symptom 1: “401 External system returned an unexpected response”Typical Issue: The credentials to the MS Graph API you maintained in the Destination do not have access to the respective API on Microsoft end to access the SharePoint sites documents.Deep Dive Investigating Access Issues on Microsoft Entra side:Now for this specific symptom we need to deep dive. The reason is – from the document grounding service – we get this generic 401 error message. But this is usually no big help to figure out what exactly is wrong with the credentials. There can be several possible reasons:Invalid Client ID / Client Secret values – could be a simple typoURL’s – the SharePoint site URL, as well as the token service URL could be wrongMissing permissions in generalMissing permission on that specific SharePoint siteSo let’s dive into it step by step.The first thing I highly recommend is to use the App registration in combination with the Site.selected permission set. In the official document grounding API we also offer the possibility to use a “Oauth2Password” flow – but in general talking to many customers and Microsoft admins – the clear preference from a data governance and technical perspective is for sure the App registration. There are several reasons for this: One for example is the ability to restrict the access to only the needed SharePoint sites. The second one are organizational policies. Some orgs require multi factor authentication for the user flow and this is something clearly not possible for a technical service. You can work around it with Ip allow listing – but this is quite cumbersome and error-prone.That being said – we will concentrate on the recommended way of using “Oauth2ClientCredentials”.How to check the SharePoint credentials and get detailed error messages:SAP provides a specific note on “How to test the SharePoint API accessibility” https://me.sap.com/notes/3623622/E this note contains a number of cURL requests that can be used to validate what exactly is coming back from the MS Graph API. My clear recommendation here is: Use the requests given here (import them to Bruno) and get clarity on the exact error message. That way you will be able to much better narrow down what needs to be configured on Entra side.The connection check to SharePoint consists of two steps: Obtaining a bearer token and then accessing the sites contents.Check 1: Obtain the Microsoft Login Token: Possible symptoms that might occur when you are executing this request:Symptom 1: Invalid Request scope is missingTypical Issue: The scope header is missingSymptom 2: Tenant Identifier is invalidTypical Issue: You are not using the right tenant ID for your Microsoft Entra tenant.Symptom 3: App Registration / Client ID not foundTypical Issue: You are using the wrong identifier for your app registration. Ensure to take the (client id) from your app registration.Symptom 4: Invalid Secret providedTypical Issue: The secret value you  Check 2: Access the MS Graph SharePoint API with the token:Now that we can be sure, that the MS App Registration Credentials are in fact valid, we can check if the also have the permissions for that particular sharepoint:Utilize the “Connect to Sharepoint” request from the Note. The expected response should look like this:It should give you some details about your site and return with a 200 OK.Now lets have a look at possible symptoms of permission errors:Symptom 1: “401 InvalidAuthenticationToken”:Typical Issue: This is just an issue with having an invalid token copied from the first request / or invalidated by time. Symptom 2: “404 Not found”:Typical Issue: This is referring to the part of the URL where you need to input the hostname and the sitename. Ensure that you are using the actual site name and not just some sort of description. Best is to take it directly from the Site URL you also open in the browser. Symptom 3: “401 General Exception”:Typical Issue: This indicates a general issue with the permission to the sharepoint site api – your app registration is most likely missing the sites selected permission or it is not fully granted. In MS Entra this should look like this:On the App registrations API permission page – you need to ensure to have the Microsoft Graph “Sites.Selected” Permission added – and Status needs to be “Granted for xxx”.Symptom 4: “403 Forbidden Access Denied”:Typical Issue: In case of a 403 Access Denied on the other hand side – it is and indication that again the permissions are the issue – but in this case specifically the link between your app registration and the sharepoint site is missing. Meaning you might have the sites selected application permission granted – but as the name suggests this permission works in the combination with an additional step to setup the link between that app registration and the sharepoint site. How to validate Sharepoint Permissions in MS Entra:This one is a bit harder to validate – because there is no UI in Entra. Instead the admin has the option to either use a PowerShell command to grant this:Grant-PnPAzureADAppSitePermission -AppId <appId> -DisplayName <name> -Permissions WriteThis would respond with a granted message if done correctly.To validate one can use the following format:Get-PnPAzureADAppSitePermission -Site https://contoso.sharepoint.com/sites/ProjectXAnd again you should get the indication whether your App Registry has the permission to access that specific Sharepoint site.Alternatively he can use a POST Request via the MS Graph Explorer:POST https://graph.microsoft.com/v1.0/sites/{siteId}/permissionsThe response should include you App Registry: {
“roles”: [ “write” ],
“grantedToIdentities”: [
{
“application”: {
“id”: “00000000-1111-2222-3333-444444444444”,
“displayName”: “MY Azure AD App”
}
}
]
} Similarly the GET would allow you to see which applications have access to the site. Both ways are similar.Note: You need to perform the link between the app registration and the sharepoint site for each new site you want to replicate.Given all these steps – we can be sure that the MS Credentials have the proper permissions for our job.Step 3: Validate Pipeline Configuration:Last but not least – given a working set of credentials and being able to retrieve content from the MS side – this should allow us now to create a pipeline. Below there are some additional checks to be done to ensure a proper configuration of the pipeline:Check 1: SharePoint Site Identifier In the step to create the Pipeline – you are asked to provide a site name. This is a bit a tricky one – because SharePoint Sites have a title. A readable nice title –  but this is not necessarily the site identifier. Instead – the name parameter needs to be filled with the ID of the Sharepoint site. And this is best found on the URL.  For example if your Sharepoint is reachable under https://contoso.sharepoint.com/sites/HR%20TeamThen the site identifier would be HR%20TeamNote: Be aware of the URL encoding.A seldom topic that might occur here is that you have a sub-site structure. This is to the best of my knowledge currently not supported by document grounding and sharepoint itself has deprecated this feature.Check 2: SharePoint include pathsThe optional parameter to limit the replication of content into the document grounding service with a include path is also something to watch out for. A typical symptom is that, after configuring an include path, no documents appear in the replication results — indicating that the path definition is likely too restrictive or incorrectly specified.The easiest way to validate this is by checking directly in SharePoint. In this example, I’m using the default document library, “Shared Documents”, and have created a folder named “teched” containing a few sample files. To determine the correct value for the include path, simply examine the document’s URL — it reveals the folder structure that can be referenced in your pipeline configuration.https://sap.sharepoint.com/sites/210637/Shared%20Documents/teched/Board%20Onsite%20San%20Marino.docxGiven an link roughly like this – I would now use the following path to limit the replication only to the teched folder: “/teched”Note: We do not prepend the path with the /Shared%20Documents Check 3: Pipeline created but no documentsHow to check – see this documentation page. It’s pretty straight forward. Whenever you create a pipeline to a site – it will in a matter of  a few minutes trigger a replication job. You can then see the status of that execution and determine what is going on. Sometimes this might lead to a empty list of documents. This usually points to a issue.If you’re not seeing any documents, you may have defined the include path too narrowly. In that case, I recommend first testing with the root path—that is, omit the include path entirely. Once replication works, you can refine the configuration in subsequent iterations to identify which parts of your SharePoint content are being synchronized.Keep in mind that, due to the historical evolution of SharePoint, not all structures are recognized as valid document libraries. For example, subsites are no longer supported by Document Grounding, as this feature has been deprecated in SharePoint itself. If your content resides in a subsite, the only reliable solution is to migrate it to a standard site and its associated document library.Conclusion:If you can answer the following questions with yes – then you succeeded in the troubleshootingDid the pipeline get created with a 201 status code and a pipeline_id? Are any documents showing up for the pipeline execution? Does Joule answer my questions about the documents?Hope you enjoyed the blog post and in case of any other symptoms you found – feel free to comment below.   Read More Technology Blog Posts by SAP articles 

#SAP

#SAPTechnologyblog

You May Also Like

More From Author