Monitor Token Usage with SAP Generative AI Hub

The Generative AI Token Metering System:

Just to quickly recap how the usage of Generative AI models is billed: There are five key metrics to consider:

Model Input Tokens: These are the units of text (words or subwords) from the input prompt that the model processes to understand and generate a response.

Model Output Tokens: These are the units of text generated by the model as a response, forming the final output based on the input prompt.

Generative AI Input Tokens: These represent standardized tokens used as a unified metric across various AI models. They harmonize different tokenization systems (e.g., GPT-4, Claude, or Mistral) into a single, comparable token count for input text.

Generative AI Output Tokens: Like input tokens, these are standardized across models and represent the number of tokens generated as output. This ensures uniform cost and usage comparisons, regardless of the underlying model’s specific tokenization methods.

Capacity Units: This is the virtual currency used on BTP. For companies working with SAP, a contingent of capacity units is typically purchased. Ultimately, Generative AI Input and Output Tokens are converted into capacity units.

For this exercise, the primary goal is to capture the usage of actual model tokens. These can then be aggregated into standardized metrics and, finally, translated into real monetary costs.

Now, let’s explore the different levels at which token usage can be monitored.

Tracking on Subaccount Level:

The governance body in BTP is the Subaccount. This is the hierarchical level within BTP where user access control, cost management, and other administrative tasks are managed. BTP handles the metering of all the various services used by customers and provides usage reports aggregated at the Subaccount level.

For example, on our BTP Cockpit Usage Dashboard, we can view the total number of Generative AI Input and Output Tokens consumed.

However, we cannot see the specific usage per model or the breakdown of tokens across different skills that our AI Agent might possess. Additionally, if we operate a single central Generative AI Hub instance within one Subaccount for multiple apps and services used by different parts of the organization, it becomes challenging to achieve clarity on how to fairly distribute costs.

Fine-Grained Tracking Approach

Many use cases require more granular tracking of usage, especially when optimizing LLM-based applications for large-scale deployments. Experiments with different prompting strategies, retrieval techniques, and related methods need to be evaluated based on their price-performance ratio. To achieve this, it is essential to know precisely how many tokens from which model are used in each user interaction.

Currently, the only viable way to achieve such detailed tracking is by implementing additional code. When making requests to LLM providers, they supply usage numbers that can be utilized for this purpose:

from gen_ai_hub.proxy.native.openai.clients import OpenAI
from gen_ai_hub.proxy.core.proxy_clients import get_proxy_client

proxy_client = get_proxy_client(‘gen-ai-hub’)
client = OpenAI()

Now it’s up to us to collect these metrics for each interaction with our LLM-based application and store them effectively. One key challenge is that different model providers use different API formats to supply usage information. This means we either need to customize our tracking code for each provider’s API or use an intermediate layer to simplify the process.

An example of such an intermediate layer is LangChain, an open-source framework that harmonizes the APIs of various model providers.

In our example, however, we use the built-in Orchestration Service, which provides a unified API. This service enables us to make requests to various models using a consistent API scheme.

Let’s take a look at a simplified example of how an LLM-App server could be implemented in Python:

from flask import Flask, request, jsonify
from token_usage_tracking.genaihub import generate_summary
from token_usage_tracking.hana_store_tokens import init_db, log_tokens

app = Flask(__name__)

@app.route(‘/generate’, methods=[‘POST’])
def generate():
“””flask endpoint to generate summary of prompt”””
body = request.get_json()
prompt = body.get(“prompt”, “”)

model_name = “gpt-4o”

summary, propmt_tokens, completion_tokens = generate_summary(prompt, model_name)

log_tokens(‘generate_summarization_1’, ‘/generate’, model_name, propmt_tokens, completion_tokens)

return jsonify({“generated_text”: summary})

if __name__ == ‘__main__’:
init_db()
app.run(port=5000)

I implemented a simple /generate endpoint that takes in a prompt and generates a summary. To create the summary, I use the function generate_summary. In a real-world scenario, we could expose various AI-based functionalities through the server.

The generate_summary function returns the number of prompt and completion tokens used during the interaction. We can then log these token counts for tracking purposes.

from gen_ai_hub.orchestration.models.message import SystemMessage, UserMessage
from gen_ai_hub.orchestration.models.template import Template, TemplateValue
from gen_ai_hub.orchestration.models.llm import LLM
from gen_ai_hub.orchestration.models.config import OrchestrationConfig
from gen_ai_hub.orchestration.service import OrchestrationService

def generate_summary(text, model_name):
“””using orchestration service to generate a summary and return usage data”””

template = Template(
messages=[
SystemMessage(“You are a helpful summarization assistant.”),
UserMessage(
“Summarize the following text: {{?text}}”
),
]
)

llm = LLM(name=model_name, version=”latest”, parameters={“max_tokens”: 256, “temperature”: 0.2})

config = OrchestrationConfig(
template=template,
llm=llm,
)

orchestration_service = OrchestrationService(api_url=”https://api.ai.internalprod.eu-central-1.aws.ml.hana.ondemand.com/v2/inference/deployments/d9bd1bd1414ecbf5″, config=config)

The actual generation process in the example is quite minimalistic and would require much more code in a real-world scenario.

Now, an important question arises: Where do we store the token usage metrics? For this example, I’ve based the solution on SAP Hana Cloud, as many of our customers use the Hana Cloud Vector Engine in conjunction with the Generative AI Hub. This approach has the advantage of leveraging an existing database within our landscape, allowing us to reuse it cost-effectively.

Of course, this can be adapted to any other data storage solution. Initially, I considered using tools like Dynatrace or SAP Cloud Logging OpenSearch, both of which support metric visualization. However, for simplicity and ease of replication, I believe the example below is the most straightforward option for now:

from concurrent.futures import ThreadPoolExecutor
from hdbcli import dbapi

host = “<hana_cloud_host>”
port = 443
user = “<user>”
password = “<password>”

# Shared database connection
hana_connection = None
executor = ThreadPoolExecutor(max_workers=2)

def init_db():
“””Initialize the database connection”””
global hana_connection
hana_connection = dbapi.connect(
address=host,
port=port,
user=user,
password=password
)
try: # create metric table if not exists
cursor = hana_connection.cursor()
cursor.execute(“””
CREATE COLUMN TABLE TokenLogs (
Endpoint VARCHAR(255),
UseCase VARCHAR(255),
ModelName VARCHAR(255),
InputTokens INT,
OutputTokens INT,
Timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)”””)
hana_connection.commit()
except:
print(“Table already exists”)
finally:
cursor.close()

def log_tokens_to_hana(endpoint, use_case, model_name, input_tokens, output_tokens):
“””log tokens to the hana database”””
try:
# SQL to insert the log data into the table
insert_sql = “””
INSERT INTO TokenLogs (Endpoint, UseCase, ModelName, InputTokens, OutputTokens)
VALUES (?, ?, ?, ?, ?)
“””
cursor = hana_connection.cursor()
# Execute the insert statement with provided values
cursor.execute(insert_sql, (endpoint, use_case, model_name, input_tokens, output_tokens))

# Commit the transaction
hana_connection.commit()
print(“Log successfully inserted into the table.”)
except Exception as e:
print(f”Error: {e}”)

finally:
# Close the cursor and connection
cursor.close()

To log the token usage, I use a ThreadPoolExecutor. This allows us to execute tasks in the background without blocking the main execution of the LLM-App’s response. By using a single shared database connection for the different threads, we can efficiently manage the logging process. The number of workers can be controlled with the max_workers parameter.

First, we create the database connection and ensure that the table for storing the TokenLogs exists. If you’re using design-time artifacts based on HDI, you can skip this step.

In the log_tokens_to_hana function, I insert records into the database table. For this example, I decided to log the following details:
– Server endpoint
– Name of the AI use case
– Model name
– Input and output tokens

Additionally, I include a current timestamp to facilitate time-based aggregations of the token usage data.

Analyze usage data in HANA:

When running the server and processing a series of sample requests, the data is successfully inserted into the Hana Database in the background, without disrupting the execution flow of the endpoint.

Now, let’s take a look at the data that’s being created:

Endpoint | Use Case | Model Name | Input Tokens | Output Tokens | Timestamp
————————-|————–|—————-|————–|—————|—————————-
generate_summarization_1 | /generate | gpt-4o-mini | 121 | 69 | 2025-01-09 20:39:59.063000000
generate_summarization_1 | /generate | gpt-4o-mini | 121 | 72 | 2025-01-09 20:41:13.533000000
generate_summarization_1 | /generate | gpt-4o-mini | 121 | 68 | 2025-01-09 20:42:29.748000000
generate_summarization_1 | /generate | gpt-4o-mini | 121 | 76 | 2025-01-09 20:43:51.390000000
generate_summarization_1 | /generate | gpt-4o-mini | 121 | 72 | 2025-01-09 21:16:35.425000000
generate_summarization_2 | /generate | gpt-4o | 121 | 92 | 2025-01-09 23:28:37.588000000
generate_summarization_2 | /generate | gpt-4o | 121 | 73 | 2025-01-09 23:30:07.564000000
generate_summarization_2 | /generate | gpt-4o | 121 | 72 | 2025-01-09 23:30:09.301000000
generate_summarization_2 | /generate | gpt-4o | 121 | 86 | 2025-01-09 23:30:09.705000000
generate_summarization_2 | /generate | gpt-4o | 121 | 74 | 2025-01-09 23:30:09.885000000
generate_summarization_2 | /generate | gpt-4o | 121 | 77 | 2025-01-09 23:30:10.132000000

Looks very nice.

Now, with some SQL queries, we can perform a deeper analysis on the number of tokens used per use case. For instance, we can aggregate the token usage by different use cases, models, or time periods, helping us understand the cost distribution and performance more clearly.

SELECT
ModelName, Endpoint, UseCase,
TO_VARCHAR(EXTRACT(YEAR FROM Timestamp)) || ‘-‘ || LPAD(TO_VARCHAR(EXTRACT(MONTH FROM Timestamp)), 2, ‘0’) AS Month,
SUM(InputTokens) AS TotalInputTokens,
SUM(OutputTokens) AS TotalOutputTokens,
AVG(InputTokens) AS AvgInputTokens,
AVG(OutputTokens) AS AvgOutputTokens
FROM TokenLogs
GROUP BY ModelName, Endpoint, UseCase, EXTRACT(YEAR FROM Timestamp), EXTRACT(MONTH FROM Timestamp)
ORDER BY ModelName, Endpoint, UseCase, Month;

Using this SQL statement, we can see the total number of tokens used by each use case and model per month, as well as the average values. This helps in analyzing trends and comparing the token consumption across different use cases and models over time:

Model Name | Endpoint | Use Case | Month | Total Input Tokens | Total Output Tokens | Avg Input Tokens | Avg Output Tokens
—————|————————–|————–|———|——————–|———————|——————|——————
gpt-4o | generate_summarization_2 | /generate | 2025-01 | 1210 | 771 | 121.000000 | 77
gpt-4o-mini | generate_summarization_1 | /generate | 2025-01 | 605 | 357 | 121.000000 | 71

In my case, I just sent the same input prompt repeatedly, so it’s expected that the average input tokens remain consistent. Interestingly, GPT-4o seems to use 6 more tokens on average to summarize my text (with a sample size of 4 ).

Finally, if you’re interested, you can convert those model input and output tokens into SAP’s Generative AI input and output tokens and then further into capacity units. Alternatively, you can use the cost estimator here: SAP AI Core Cost Estimator.

Have fun tracking tokens! Feel free to leave a comment if you have any further questions!

Keeping a close eye on costs is crucial for maintaining a viable business case when using Generative AI. This blog post explores techniques for tracking the exact token usage in your Generative AI Hub-based developments. Specifically, we will examine how to break down token consumption on a use-case-by-use-case basis. The Generative AI Token Metering System:Just to quickly recap how the usage of Generative AI models is billed: There are five key metrics to consider:Model Input Tokens: These are the units of text (words or subwords) from the input prompt that the model processes to understand and generate a response.Model Output Tokens: These are the units of text generated by the model as a response, forming the final output based on the input prompt.Generative AI Input Tokens: These represent standardized tokens used as a unified metric across various AI models. They harmonize different tokenization systems (e.g., GPT-4, Claude, or Mistral) into a single, comparable token count for input text.Generative AI Output Tokens: Like input tokens, these are standardized across models and represent the number of tokens generated as output. This ensures uniform cost and usage comparisons, regardless of the underlying model’s specific tokenization methods.Capacity Units: This is the virtual currency used on BTP. For companies working with SAP, a contingent of capacity units is typically purchased. Ultimately, Generative AI Input and Output Tokens are converted into capacity units.For this exercise, the primary goal is to capture the usage of actual model tokens. These can then be aggregated into standardized metrics and, finally, translated into real monetary costs.Now, let’s explore the different levels at which token usage can be monitored.Tracking on Subaccount Level:The governance body in BTP is the Subaccount. This is the hierarchical level within BTP where user access control, cost management, and other administrative tasks are managed. BTP handles the metering of all the various services used by customers and provides usage reports aggregated at the Subaccount level.For example, on our BTP Cockpit Usage Dashboard, we can view the total number of Generative AI Input and Output Tokens consumed.However, we cannot see the specific usage per model or the breakdown of tokens across different skills that our AI Agent might possess. Additionally, if we operate a single central Generative AI Hub instance within one Subaccount for multiple apps and services used by different parts of the organization, it becomes challenging to achieve clarity on how to fairly distribute costs.Fine-Grained Tracking ApproachMany use cases require more granular tracking of usage, especially when optimizing LLM-based applications for large-scale deployments. Experiments with different prompting strategies, retrieval techniques, and related methods need to be evaluated based on their price-performance ratio. To achieve this, it is essential to know precisely how many tokens from which model are used in each user interaction.Currently, the only viable way to achieve such detailed tracking is by implementing additional code. When making requests to LLM providers, they supply usage numbers that can be utilized for this purpose: from gen_ai_hub.proxy.native.openai.clients import OpenAI
from gen_ai_hub.proxy.core.proxy_clients import get_proxy_client

proxy_client = get_proxy_client(‘gen-ai-hub’)
client = OpenAI()

completion = client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[{“role”: “user”, “message”: “Hi there!”}],
)
# print the input and output tokens
print(completion.usage.prompt_tokens)
print(completion.usage.completion_tokens) Now it’s up to us to collect these metrics for each interaction with our LLM-based application and store them effectively. One key challenge is that different model providers use different API formats to supply usage information. This means we either need to customize our tracking code for each provider’s API or use an intermediate layer to simplify the process.An example of such an intermediate layer is LangChain, an open-source framework that harmonizes the APIs of various model providers.In our example, however, we use the built-in Orchestration Service, which provides a unified API. This service enables us to make requests to various models using a consistent API scheme.Let’s take a look at a simplified example of how an LLM-App server could be implemented in Python: from flask import Flask, request, jsonify
from token_usage_tracking.genaihub import generate_summary
from token_usage_tracking.hana_store_tokens import init_db, log_tokens

app = Flask(__name__)

@app.route(‘/generate’, methods=[‘POST’])
def generate():
“””flask endpoint to generate summary of prompt”””
body = request.get_json()
prompt = body.get(“prompt”, “”)

model_name = “gpt-4o”

summary, propmt_tokens, completion_tokens = generate_summary(prompt, model_name)

log_tokens(‘generate_summarization_1’, ‘/generate’, model_name, propmt_tokens, completion_tokens)

return jsonify({“generated_text”: summary})

if __name__ == ‘__main__’:
init_db()
app.run(port=5000)
I implemented a simple /generate endpoint that takes in a prompt and generates a summary. To create the summary, I use the function generate_summary. In a real-world scenario, we could expose various AI-based functionalities through the server.The generate_summary function returns the number of prompt and completion tokens used during the interaction. We can then log these token counts for tracking purposes. from gen_ai_hub.orchestration.models.message import SystemMessage, UserMessage
from gen_ai_hub.orchestration.models.template import Template, TemplateValue
from gen_ai_hub.orchestration.models.llm import LLM
from gen_ai_hub.orchestration.models.config import OrchestrationConfig
from gen_ai_hub.orchestration.service import OrchestrationService

def generate_summary(text, model_name):
“””using orchestration service to generate a summary and return usage data”””

template = Template(
messages=[
SystemMessage(“You are a helpful summarization assistant.”),
UserMessage(
“Summarize the following text: {{?text}}”
),
]
)

llm = LLM(name=model_name, version=”latest”, parameters={“max_tokens”: 256, “temperature”: 0.2})

config = OrchestrationConfig(
template=template,
llm=llm,
)

orchestration_service = OrchestrationService(api_url=”https://api.ai.internalprod.eu-central-1.aws.ml.hana.ondemand.com/v2/inference/deployments/d9bd1bd1414ecbf5″, config=config)

result = orchestration_service.run(template_values=[
TemplateValue(name=”text”, value=text)
])
return result.orchestration_result.choices[0].message.content, result.orchestration_result.usage.prompt_tokens, result.orchestration_result.usage.completion_tokens The actual generation process in the example is quite minimalistic and would require much more code in a real-world scenario.Now, an important question arises: Where do we store the token usage metrics? For this example, I’ve based the solution on SAP Hana Cloud, as many of our customers use the Hana Cloud Vector Engine in conjunction with the Generative AI Hub. This approach has the advantage of leveraging an existing database within our landscape, allowing us to reuse it cost-effectively.Of course, this can be adapted to any other data storage solution. Initially, I considered using tools like Dynatrace or SAP Cloud Logging OpenSearch, both of which support metric visualization. However, for simplicity and ease of replication, I believe the example below is the most straightforward option for now: from concurrent.futures import ThreadPoolExecutor
from hdbcli import dbapi

host = “<hana_cloud_host>”
port = 443
user = “<user>”
password = “<password>”

# Shared database connection
hana_connection = None
executor = ThreadPoolExecutor(max_workers=2)

# Commit the transaction
hana_connection.commit()
print(“Log successfully inserted into the table.”)
except Exception as e:
print(f”Error: {e}”)

finally:
# Close the cursor and connection
cursor.close()

def log_tokens(endpoint, use_case, model_name, input_tokens, output_tokens):
“””log tokens via a background thread”””
executor.submit(log_tokens_to_hana, endpoint, use_case, model_name, input_tokens, output_tokens) To log the token usage, I use a ThreadPoolExecutor. This allows us to execute tasks in the background without blocking the main execution of the LLM-App’s response. By using a single shared database connection for the different threads, we can efficiently manage the logging process. The number of workers can be controlled with the max_workers parameter.First, we create the database connection and ensure that the table for storing the TokenLogs exists. If you’re using design-time artifacts based on HDI, you can skip this step.In the log_tokens_to_hana function, I insert records into the database table. For this example, I decided to log the following details:- Server endpoint- Name of the AI use case- Model name- Input and output tokensAdditionally, I include a current timestamp to facilitate time-based aggregations of the token usage data.Analyze usage data in HANA:When running the server and processing a series of sample requests, the data is successfully inserted into the Hana Database in the background, without disrupting the execution flow of the endpoint.Now, let’s take a look at the data that’s being created: Endpoint | Use Case | Model Name | Input Tokens | Output Tokens | Timestamp
————————-|————–|—————-|————–|—————|—————————-
generate_summarization_1 | /generate | gpt-4o-mini | 121 | 69 | 2025-01-09 20:39:59.063000000
generate_summarization_1 | /generate | gpt-4o-mini | 121 | 72 | 2025-01-09 20:41:13.533000000
generate_summarization_1 | /generate | gpt-4o-mini | 121 | 68 | 2025-01-09 20:42:29.748000000
generate_summarization_1 | /generate | gpt-4o-mini | 121 | 76 | 2025-01-09 20:43:51.390000000
generate_summarization_1 | /generate | gpt-4o-mini | 121 | 72 | 2025-01-09 21:16:35.425000000
generate_summarization_2 | /generate | gpt-4o | 121 | 92 | 2025-01-09 23:28:37.588000000
generate_summarization_2 | /generate | gpt-4o | 121 | 73 | 2025-01-09 23:30:07.564000000
generate_summarization_2 | /generate | gpt-4o | 121 | 72 | 2025-01-09 23:30:09.301000000
generate_summarization_2 | /generate | gpt-4o | 121 | 86 | 2025-01-09 23:30:09.705000000
generate_summarization_2 | /generate | gpt-4o | 121 | 74 | 2025-01-09 23:30:09.885000000
generate_summarization_2 | /generate | gpt-4o | 121 | 77 | 2025-01-09 23:30:10.132000000 Looks very nice.Now, with some SQL queries, we can perform a deeper analysis on the number of tokens used per use case. For instance, we can aggregate the token usage by different use cases, models, or time periods, helping us understand the cost distribution and performance more clearly. SELECT
ModelName, Endpoint, UseCase,
TO_VARCHAR(EXTRACT(YEAR FROM Timestamp)) || ‘-‘ || LPAD(TO_VARCHAR(EXTRACT(MONTH FROM Timestamp)), 2, ‘0’) AS Month,
SUM(InputTokens) AS TotalInputTokens,
SUM(OutputTokens) AS TotalOutputTokens,
AVG(InputTokens) AS AvgInputTokens,
AVG(OutputTokens) AS AvgOutputTokens
FROM TokenLogs
GROUP BY ModelName, Endpoint, UseCase, EXTRACT(YEAR FROM Timestamp), EXTRACT(MONTH FROM Timestamp)
ORDER BY ModelName, Endpoint, UseCase, Month; Using this SQL statement, we can see the total number of tokens used by each use case and model per month, as well as the average values. This helps in analyzing trends and comparing the token consumption across different use cases and models over time: Model Name | Endpoint | Use Case | Month | Total Input Tokens | Total Output Tokens | Avg Input Tokens | Avg Output Tokens
—————|————————–|————–|———|——————–|———————|——————|——————
gpt-4o | generate_summarization_2 | /generate | 2025-01 | 1210 | 771 | 121.000000 | 77
gpt-4o-mini | generate_summarization_1 | /generate | 2025-01 | 605 | 357 | 121.000000 | 71 In my case, I just sent the same input prompt repeatedly, so it’s expected that the average input tokens remain consistent. Interestingly, GPT-4o seems to use 6 more tokens on average to summarize my text (with a sample size of 4 ).Finally, if you’re interested, you can convert those model input and output tokens into SAP’s Generative AI input and output tokens and then further into capacity units. Alternatively, you can use the cost estimator here: SAP AI Core Cost Estimator.Have fun tracking tokens! Feel free to leave a comment if you have any further questions! Read More Technology Blogs by SAP articles