Agentic AI Evaluation: Observability, Traceability & Metrics That Matter

Introduction

Today, enterprises are creating agents, either to automate some manual processes or to optimize existing once. No matter what is the use case, we have already entered the Economy of Agents. This is similar to last several decades where software engineers, developers or consultants developed production grade applications.

Software development evolved over decades which gave us some time to adapt to it and accept new solutions/frameworks. Also the pace with which we were developing applications was slow as compared to the pace with which agentic solutions are being developed.

The critical difference between traditional software solutions/applications and agentic solutions is the very nature. Traditional software solutions are deterministic in nature as opposed to agentic solutions which are non deterministic in nature to an extent.

Quality Gate for traditional software development has evolved over years of iterations and given us some robust test automation tools and frameworks. The harsh truth is – we still find production bugs.

Now, imagine a non deterministic agentic software application for which testing is hard to define or predict at times. This may lead to totally unpredictable production bugs or unexpected behaviour. This might cost us time and money, that too – exponentially.

That is the reason, evaluation of Agentic solutions is very important and should be considered top priority for enterprises to ensure we produce good quality of Agentic applications.

This blog is not to scare you, but to give you a spark to think and act.

So, gear up! and let’s start

I will breakdown the article into following parts

What is agent evaluation?Types of agent evaluationEvaluation ToolsEvaluation framework (Block Diagram)Why OpenTelemetry?Code walkthroughAvailable Agentic Evaluation frameworks

What is Agent Evaluation?

Agent evaluation is validating the output of an agent. This differs slightly from the typical software development testing. In a software testing, we know what is a input and what is the expected output.

But in case of agentic solutions, as the response is given by the underlying LLM, the output varies due to lot of factors such as prompts, LLM used etc.

So, agentic AI testing is done in a different way. It is based on some pre-defined metrics such as response times, underlying tools used by the agent, token consumption, cost of total response etc. to name a few.

Based on the evaluation results, the respective developer can tweak the solution either by enhancing the prompts, changing the underlying LLM, introducing additional tool which will reduce the over all response time and token consumption, but also ensuring the expected output is provided.

Agent evaluation is a evolving space and a lot of advancements are being done to streamline it.

Types of Agent Evaluation

Code based evaluation: It is a pro-code type of evaluation and you will get a glimpse of it in this tutorial/POC. In code based evaluation, code is injected in the Agentic AI application which captures the metric data when the agent it executing any request. It is well suited to validate objectivity of the solution. By that, I mean it is well suited to validate the quantitative attributes of the agentic solution, for example – latency, response time, token consumption etc.LLM as a judge: This is kind of low-code type of evaluation. This is not covered in this tutorial/POC. As the name suggests, here LLM is a reviewer. It validates the correctness of the output provided by the agentic application. It is well suited to validate subjectivity of the solution. By that, I mean it is well suited to validate the qualitative attributes of the agentic solution, for example – correctness of output response, reasoning, planning etc.

Evaluation Tools

Pre-requisite: For running this observability stack on your local machine, you should have docker desktop installed. Once done, install below docker images.

Prometheus: Time series data store for collecting metric data and alerting.Grafana Tempo: A distributed tracing backend that simplifies storing and visualizing trace data.OpenTelemetry Collector: Acts as a central hub for receiving traces and exporting telemetry data to observability backends.Grafana: Grafana is an interactive data visualization and monitoring platform that allows users to query, visualize, and understand data from various source.

Evaluation Framework (Block Diagram)

Figure-1

Why OpenTelemetry?

A very obvious question after looking at the above block diagram is “Why OpenTelemetry?”

OpenTelemetry is an open standard for collecting and exporting telemetry data like metrics and traces.

I would summarize in 5 important points:

Vendor neutral and can integrate with any backend for visualizationProvides diverse language support (Python, Node.js etc)Batch processing which helps to speed up the trace collectionProvides standard format for collection of observability dataPowerful tracing providing parent-child relationship and context propagation

If you want to deep dive into open telemetry, I have added a link in the reference section.

Code Walkthrough

Till now we have discussed the theory and block diagram, now it’s time to deep dive in the code. The Agent Evaluation section of the README details out each and every step to help you get started. I would like to mention some important files in this section for you to give extra attention when you are browsing the code

observability folder has all the config files for open telemetry collector, tempo and prometheus.agent_observability.py – This singleton class is the core of observability. This holds all the methods for capturing the metrics and traces. I have created decorators which are applied on the actual functions to capture traces. Custom attributes like request_size_in_bytes, agent_name, response_time are captured. You can add more attributes here such as token_consumption etc.agent_service.py – In this file, you will see the decorator applied to the “/chat” endpoint method.weather_functions.py – In this file, you will see the decorator applied to all the weather functions.

I have applied the tracing only to weather agent, but can be extended to finance agent.

But there are over the shelf evaluation frameworks available which can be leveraged and provide certain level of customizations. The next section talks about it.

Available Agentic Evaluation Frameworks

I have listed few agentic AI evaluations frameworks available and are good starting point to start learning and understanding the importance of agent evaluation.

LangSmith – https://www.langchain.com/langsmith Phoenix – https://docs.arize.com/phoenixDeepEval – https://www.deepeval.com

Conclusion

Drawing from my experience, I must emphasize that these insights can serve as a starting point, though their applicability may vary based on specific scenarios. This project will surely give you a starting point from design and development of Agentic AI evaluation.

Along with the theory, my intention is to showcase a fully functional prototype which is publicly available on – GitHub

Disclaimer: This is not an official reference application or documentation. The thoughts outlined in this blog are based on my experience and learnings about Agentic AI and agent evaluation.

Feel free to “like“, “Share“, “Add a Comment” and to get more updates about my next blogs follow me!

References

https://opentelemetry.io/docs/languages/python/ https://grafana.com/blog/2021/05/04/get-started-with-distributed-tracing-and-grafana-tempo-using-foobar-a-demo-written-in-python/ https://prometheus.io/ https://opentelemetry.io/docs/what-is-opentelemetry/

IntroductionToday, enterprises are creating agents, either to automate some manual processes or to optimize existing once. No matter what is the use case, we have already entered the Economy of Agents. This is similar to last several decades where software engineers, developers or consultants developed production grade applications.Software development evolved over decades which gave us some time to adapt to it and accept new solutions/frameworks. Also the pace with which we were developing applications was slow as compared to the pace with which agentic solutions are being developed.The critical difference between traditional software solutions/applications and agentic solutions is the very nature. Traditional software solutions are deterministic in nature as opposed to agentic solutions which are non deterministic in nature to an extent.Quality Gate for traditional software development has evolved over years of iterations and given us some robust test automation tools and frameworks. The harsh truth is – we still find production bugs.Now, imagine a non deterministic agentic software application for which testing is hard to define or predict at times. This may lead to totally unpredictable production bugs or unexpected behaviour. This might cost us time and money, that too – exponentially.That is the reason, evaluation of Agentic solutions is very important and should be considered top priority for enterprises to ensure we produce good quality of Agentic applications.This blog is not to scare you, but to give you a spark to think and act.So, gear up! and let’s start I will breakdown the article into following partsWhat is agent evaluation?Types of agent evaluationEvaluation ToolsEvaluation framework (Block Diagram)Why OpenTelemetry?Code walkthroughAvailable Agentic Evaluation frameworksWhat is Agent Evaluation?Agent evaluation is validating the output of an agent. This differs slightly from the typical software development testing. In a software testing, we know what is a input and what is the expected output.But in case of agentic solutions, as the response is given by the underlying LLM, the output varies due to lot of factors such as prompts, LLM used etc.So, agentic AI testing is done in a different way. It is based on some pre-defined metrics such as response times, underlying tools used by the agent, token consumption, cost of total response etc. to name a few.Based on the evaluation results, the respective developer can tweak the solution either by enhancing the prompts, changing the underlying LLM, introducing additional tool which will reduce the over all response time and token consumption, but also ensuring the expected output is provided.Agent evaluation is a evolving space and a lot of advancements are being done to streamline it.Types of Agent EvaluationCode based evaluation: It is a pro-code type of evaluation and you will get a glimpse of it in this tutorial/POC. In code based evaluation, code is injected in the Agentic AI application which captures the metric data when the agent it executing any request. It is well suited to validate objectivity of the solution. By that, I mean it is well suited to validate the quantitative attributes of the agentic solution, for example – latency, response time, token consumption etc.LLM as a judge: This is kind of low-code type of evaluation. This is not covered in this tutorial/POC. As the name suggests, here LLM is a reviewer. It validates the correctness of the output provided by the agentic application. It is well suited to validate subjectivity of the solution. By that, I mean it is well suited to validate the qualitative attributes of the agentic solution, for example – correctness of output response, reasoning, planning etc.Evaluation ToolsPre-requisite: For running this observability stack on your local machine, you should have docker desktop installed. Once done, install below docker images.Prometheus: Time series data store for collecting metric data and alerting.Grafana Tempo: A distributed tracing backend that simplifies storing and visualizing trace data.OpenTelemetry Collector: Acts as a central hub for receiving traces and exporting telemetry data to observability backends.Grafana: Grafana is an interactive data visualization and monitoring platform that allows users to query, visualize, and understand data from various source.Evaluation Framework (Block Diagram)Figure-1 Why OpenTelemetry?A very obvious question after looking at the above block diagram is “Why OpenTelemetry?”OpenTelemetry is an open standard for collecting and exporting telemetry data like metrics and traces.I would summarize in 5 important points:Vendor neutral and can integrate with any backend for visualizationProvides diverse language support (Python, Node.js etc)Batch processing which helps to speed up the trace collectionProvides standard format for collection of observability dataPowerful tracing providing parent-child relationship and context propagationIf you want to deep dive into open telemetry, I have added a link in the reference section.Code WalkthroughTill now we have discussed the theory and block diagram, now it’s time to deep dive in the code. The Agent Evaluation section of the README details out each and every step to help you get started. I would like to mention some important files in this section for you to give extra attention when you are browsing the codeobservability folder has all the config files for open telemetry collector, tempo and prometheus.agent_observability.py – This singleton class is the core of observability. This holds all the methods for capturing the metrics and traces. I have created decorators which are applied on the actual functions to capture traces. Custom attributes like request_size_in_bytes, agent_name, response_time are captured. You can add more attributes here such as token_consumption etc.agent_service.py – In this file, you will see the decorator applied to the “/chat” endpoint method.weather_functions.py – In this file, you will see the decorator applied to all the weather functions.I have applied the tracing only to weather agent, but can be extended to finance agent.But there are over the shelf evaluation frameworks available which can be leveraged and provide certain level of customizations. The next section talks about it.Available Agentic Evaluation FrameworksI have listed few agentic AI evaluations frameworks available and are good starting point to start learning and understanding the importance of agent evaluation.LangSmith – https://www.langchain.com/langsmith Phoenix – https://docs.arize.com/phoenixDeepEval – https://www.deepeval.com ConclusionDrawing from my experience, I must emphasize that these insights can serve as a starting point, though their applicability may vary based on specific scenarios. This project will surely give you a starting point from design and development of Agentic AI evaluation.Along with the theory, my intention is to showcase a fully functional prototype which is publicly available on – GitHubDisclaimer: This is not an official reference application or documentation. The thoughts outlined in this blog are based on my experience and learnings about Agentic AI and agent evaluation.Feel free to “like“, “Share“, “Add a Comment” and to get more updates about my next blogs follow me!Referenceshttps://opentelemetry.io/docs/languages/python/ https://grafana.com/blog/2021/05/04/get-started-with-distributed-tracing-and-grafana-tempo-using-foobar-a-demo-written-in-python/ https://prometheus.io/ https://opentelemetry.io/docs/what-is-opentelemetry/ Read More Technology Blog Posts by SAP articles

#SAP

#SAPTechnologyblog