Using Ragas with AI core + other metrics to evaluate LLMs

Estimated read time 19 min read

Challenges: LLMs must be evaluated for accuracy before they are deployed. It is difficult to qualify the accuracy simply by observation, especially for complicated tasks involving Retrieval Augmented Generation or translation/summarization. These metrics discussed in this blog post provide quantitative scores for various types of tasks involving large language models.

In this post, we will take a look at 

RagasBleu & RougeDeepeval benchmarks.

Ragas

Ragas is a tool which is used for evaluating accuracy of LLMs in RAG (Retrieval-Augmented Generation) applications.

With RAG applications, testing must be fully done before deployment is a possibility. RAGas simplifies that testing process by providing overall scores for various metrics which measure’s your LLMs ability to perform retrieval tasks, get the right context, and provide the right answers.

Ragas requires you to create your own dataset with your own questions and answers. This requires that you must know or be able to figure out the correct answer to these questions in order for Ragas to make the comparison. The dataset must have the following 4 columns:

question | answer | contexts | ground_truth

where answer is the response from the LLM for a given question, contexts is the context/source in which the answer was found in (usually given by a similarity search), and ground_truth is the actual/expected answer.

Ragas uses several metrics to evaluate the accuracy and reliability of LLM responses. 

faithfulness – evaluated by determining if the answer provided by the LLM can be inferred from the given context.

answer relevance – determines if the original question can be reconstructed from the given answer.

context precision – For each chunk in context, checks if it’s relevant to arrive to the ground truth. Checks if chunks which are most relevant to the ground_truth appear at the top (as they should be).

context recall – checks each sentence in the ground truth and determines if it is attributable to the context.

There are also 5 more additional metrics that can be used, which are documented here 

Integrate and run ragas test with SAP AI Core.

In this example app, we have prepared 15 questions, ran each one on Gemini, Claude, and GPT LLMs, and will run ragas tests on all 3 to compare the results.

The data has columns question, contexts, ground_truth, and an answer column for each llm.

Here is an outline of how the data should look like:

*note* Normally you would have only one “answer” column, but in this particular case we are wanting to compare the performance of 3 different large language models.

Ragas will need to use an LLM in order to perform the evaluation test itself, you can use any LLM of your choice. In this example we are using a GPT4 llm which is deployed in AI core. 

An text embedding model will also be needed to perform the test, we will use an AI core deployment of text embedding ada 002.

Required Libraries:

SAP generative AI hub SDK (pip install generative-ai-hub-sdk)SAP AI core SDK (pip install ai-core-sdk)RAGas (pip install ragas)

Env file

a .env file must be made in the same directory as the app file with the following values

 

 

AICORE_CLIENT_ID=
AICORE_AUTH_URL=
AICORE_CLIENT_ID=
AICORE_CLIENT_SECRET=
AICORE_RESOURCE_GROUP=
AICORE_BASE_URL=
OPENAI_API_KEY=

 

Imports

*Note*: You can list any metrics you want to use in the ragas test under the import block as displayed in lines 8-14. The ones currently listed are the default metrics, if you choose to omit this import and not provide any metrics list, these are the ones that will be used regardless.

 

 

from ai_core_sdk.ai_core_v2_client import AICoreV2Client
from gen_ai_hub.proxy.langchain.init_models import init_llm
from gen_ai_hub.proxy.langchain.init_models import init_embedding_model

from datasets import Dataset
import pandas as pd
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_relevancy,
context_recall,
context_precision
)
from dotenv import load_dotenv
load_dotenv()

 

Loading AI Core credentials and initializing chat completion and text embedding LLMs.

These credentials come from the service key of your AI core instance in BTP

 

 

client = AICoreV2Client(base_url=”BASE_URL”,
auth_url=”AUTH_URL”,
client_id=”CLIENT_ID”,
client_secret=”CLIENT_SECRET”,
resource_group=”default”)

langchain_llm = init_llm(‘gpt-4’, max_tokens=100)
embeddings = init_embedding_model(‘text-embedding-ada-002’)

 

Creating datasets for Ragas

Our data is currently in an excel csv file, however we must format this data into a dataset before we run the ragas test.

The 1st step is to create a dict for each of our 3 LLMs.

We will then use the pandas library to read each column in the excel file and use them to populate our dictionaries.

Finally, we create a Dataset from those dicts.

 

df = pd.read_csv(‘Q&A_DATA.csv’)
df[‘contexts’] = df[‘contexts’].apply(lambda x: [x])

gemini_data[“question”] = df[‘question’].tolist()
gemini_data[“answer”] = df[‘gemini_answer’].tolist()
gemini_data[“contexts”] = df[‘contexts’].tolist()
gemini_data[“ground_truth”] = df[‘ground_truth’].tolist()

claude_data[“question”] = df[‘question’].tolist()
claude_data[“answer”] = df[‘claude_answer’].tolist()
claude_data[“contexts”] = df[‘contexts’].tolist()
claude_data[“ground_truth”] = df[‘ground_truth’].tolist()

gpt_data[“question”] = df[‘question’].tolist()
gpt_data[“answer”] = df[‘gpt_answer’].tolist()
gpt_data[“contexts”] = df[‘contexts’].tolist()
gpt_data[“ground_truth”] = df[‘ground_truth’].tolist()

gemini_dataset = Dataset.from_dict(gemini_data)
claude_dataset = Dataset.from_dict(claude_data)
gpt_dataset = Dataset.from_dict(gpt_data)

 

 *Note* line 2 is mandatory, as it ensures the contexts property in the dict will be an array.

Finally, we run the tests on ragas using the imported “evaluate” function:

 

 

gemini_result = evaluate(
dataset=gemini_dataset,
llm=langchain_llm,
embeddings=embeddings
)

claude_result = evaluate(
dataset=claude_dataset,
llm=langchain_llm,
embeddings=embeddings
)

gpt_result = evaluate(
dataset=gpt_dataset,
llm=langchain_llm,
embeddings=embeddings
)

print(“Gemini result ” + str(gemini_result) + ‘nn’)
print(“Claude result ” + str(claude_result) + ‘nn’)
print(“Gpt result” + str(gpt_result) + ‘nn’)

 

And print our results

 

Gemini result {‘answer_relevancy’: 0.9378, ‘context_precision’: 1.0000, ‘faithfulness’: 1.0000, ‘context_recall’: 1.0000}

Claude result {‘answer_relevancy’: 0.9462, ‘context_precision’: 1.0000, ‘faithfulness’: 0.5833, ‘context_recall’: 1.0000}

Gpt result{”answer_relevancy’: 0.8625, ‘context_precision’: 1.0000, ‘faithfulness’: 1.0000, ‘context_recall’: 1.0000}

 

Bleu & Rouge (Non-rag metrics)

Bleu, Rouge

Bleu is a metric which is normally used to evaluate a model’s ability to translate text, while Rouge is used to evaluate its ability to summarize or capture the meaning of text.

Rouge

When using the Rouge metric to measure the quality of a LLM generated summary, you are to provide a reference summary (created by you or another human) showing what would be an example of a good summary. The Rouge metric will be calculated by comparing the llm generated summary with the human-provided one.

Different methods of Rouge:

Rouge-1. This counts how many matching words there are between the human and machine generated summaryRouge-2 This counts how many matching word pairs there are between the human and machine generated summaryRouge-L This finds the longest common subsequence between the human and machine generated summary. The sequence does not have to be contiguous, just in the same order.

Code example:

 

from datasets import load_metric
rouge = load_metric(“rouge”)
llmSummaries = [“I was in walmart yesterday”]
humanSummary = [“I went to walmart yesterday”]
print(rouge.compute(predictions=llmSummaries, references=humanSummary))

 

Console output:

 

{‘rouge1’: AggregateScore(low=Score(precision=0.6, recall=0.6, fmeasure=0.6), mid=Score(precision=0.6, recall=0.6, fmeasure=0.6), high=Score(precision=0.6, recall=0.6, fmeasure=0.6)),

‘rouge2’: AggregateScore(low=Score(precision=0.25, recall=0.25, fmeasure=0.25), mid=Score(precision=0.25, recall=0.25, fmeasure=0.25), high=Score(precision=0.25, recall=0.25, fmeasure=0.25)),

‘rougeL’: AggregateScore(low=Score(precision=0.6, recall=0.6, fmeasure=0.6), mid=Score(precision=0.6, recall=0.6, fmeasure=0.6), high=Score(precision=0.6, recall=0.6, fmeasure=0.6)),

‘rougeLsum’: AggregateScore(low=Score(precision=0.6, recall=0.6, fmeasure=0.6), mid=Score(precision=0.6, recall=0.6, fmeasure=0.6), high=Score(precision=0.6, recall=0.6, fmeasure=0.6))}

 

We are provided with rouge1, rouge2, and rougeL scores

Bleu

Similiar to Rouge, you must provide a human reference translation as well as the machine generated one you wish to get a Bleu score for. 

Bleu works by getting the unigram precision. Which it obtains by comparing the machine and human generated translations, counting the number of word matches, and dividing it by the number of words in the generation. It does not count consecutively repeated words as to avoid giving sentences with only the same word repeated a perfect score. It also has the option of comparing matching n-pair words instead of just one. During the calculation of the score, it will compare with both single pair and n-pair words to account for sentence order (as this can be different between languages).

Example:

 

from datasets import load_metric

bleu = load_metric(“bleu”)
llmTranslation = [[“I”, “am”, “of”, “Spain”]]
humanTranslations = [
[[“I”, “am”, “from”, “Spain”], [“I’m”, “from”, “Spain”]]
]
print(bleu.compute(predictions=llmTranslation, references=humanTranslations))

 

output:

 

bleu = load_metric(“bleu”)
{‘bleu’: 0.0, ‘precisions’: [0.75, 0.3333333333333333, 0.0, 0.0], ‘brevity_penalty’: 1.0, ‘length_ratio’: 1.3333333333333333, ‘translation_length’: 4, ‘reference_length’: 3}

 

 

Deepeval Benchmarks

Deepeval is an LLM evluation framework developed by confident-ai which provides access to various benchmark tests that can be used to evaluate a model’s overall performance. 

Unlike the above metrics, these benchmarks evaluate the performance of a model by using entire datasets.  Each benchmark represents a different dataset containing a list of relevant questions and expected answers (ground truths) for the task.

HellaSwagTruthfulQ&AMMLUDROP

HellaSwag – Provides 10,000 challenges revolving around sentence completion.

DROP – 9,500 challenges which measure a model’s reasoning abilities. 

TruthfulQ&A – 817 challenges (questions) which determine if a model is able to answer questions truthfully. Common misconceptions (things believed by many people which are not actually true) are a key part of these challenges.

MMLU – 15,000 multiple choice challenges ranging from various topics (57 different subjects). I.e. Math, history, law, and ethics

All these benchmarks are available in the deepeval library (pip install deepeval). Each benchmark has a set of tasks associate with it, and you can choose which tasks you want to include.

An example with MMLUTask

 

 

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from deepeval.benchmarks import MMLU
from deepeval.benchmarks.tasks import MMLUTask

benchmark = MMLU(
tasks=[MMLUTask.FORMAL_LOGIC, MMLUTask.GLOBAL_FACTS],
n_shots=3
)

model = AutoModelForCausalLM.from_pretrained(“Jimmyhd/testRepo3”,
device_map=’auto’,
torch_dtype=torch.float16,
load_in_4bit=True,
bnb_4bit_quant_type=”nf4″,
bnb_4bit_compute_dtype=torch.float16,
trust_remote_code=True)
benchmark.evaluate(model=model)
print(benchmark.overall_score)

 

**Note** the “trust_remote_code=True” parameter in the AutmoModelForCasualLlm call is mandatory in order to use any of the benchmarks.

For more detail about how the benchmarks are evaluated and different examples using them, see the documentation here

Conclusion

Whether you want to evaluate a model’s ability to perform rag functions, translate, summarize, reason, or just answer questions based on common mathematical and scientific knowledge, there are many different metric and benchmark tools which will provide you a quantitative score that will allow you to see just how well the model is performing before your application deployed, giving you full assurance. 

This will conclude our look at these 3 types of evaluations!

 

​ Challenges: LLMs must be evaluated for accuracy before they are deployed. It is difficult to qualify the accuracy simply by observation, especially for complicated tasks involving Retrieval Augmented Generation or translation/summarization. These metrics discussed in this blog post provide quantitative scores for various types of tasks involving large language models.In this post, we will take a look at RagasBleu & RougeDeepeval benchmarks.RagasRagas is a tool which is used for evaluating accuracy of LLMs in RAG (Retrieval-Augmented Generation) applications.With RAG applications, testing must be fully done before deployment is a possibility. RAGas simplifies that testing process by providing overall scores for various metrics which measure’s your LLMs ability to perform retrieval tasks, get the right context, and provide the right answers.Ragas requires you to create your own dataset with your own questions and answers. This requires that you must know or be able to figure out the correct answer to these questions in order for Ragas to make the comparison. The dataset must have the following 4 columns:question | answer | contexts | ground_truthwhere answer is the response from the LLM for a given question, contexts is the context/source in which the answer was found in (usually given by a similarity search), and ground_truth is the actual/expected answer.Ragas uses several metrics to evaluate the accuracy and reliability of LLM responses. faithfulness – evaluated by determining if the answer provided by the LLM can be inferred from the given context.answer relevance – determines if the original question can be reconstructed from the given answer.context precision – For each chunk in context, checks if it’s relevant to arrive to the ground truth. Checks if chunks which are most relevant to the ground_truth appear at the top (as they should be).context recall – checks each sentence in the ground truth and determines if it is attributable to the context.There are also 5 more additional metrics that can be used, which are documented here Integrate and run ragas test with SAP AI Core.In this example app, we have prepared 15 questions, ran each one on Gemini, Claude, and GPT LLMs, and will run ragas tests on all 3 to compare the results.The data has columns question, contexts, ground_truth, and an answer column for each llm.Here is an outline of how the data should look like:*note* Normally you would have only one “answer” column, but in this particular case we are wanting to compare the performance of 3 different large language models.Ragas will need to use an LLM in order to perform the evaluation test itself, you can use any LLM of your choice. In this example we are using a GPT4 llm which is deployed in AI core. An text embedding model will also be needed to perform the test, we will use an AI core deployment of text embedding ada 002.Required Libraries:SAP generative AI hub SDK (pip install generative-ai-hub-sdk)SAP AI core SDK (pip install ai-core-sdk)RAGas (pip install ragas)Env filea .env file must be made in the same directory as the app file with the following values  AICORE_CLIENT_ID=
AICORE_AUTH_URL=
AICORE_CLIENT_ID=
AICORE_CLIENT_SECRET=
AICORE_RESOURCE_GROUP=
AICORE_BASE_URL=
OPENAI_API_KEY= Imports*Note*: You can list any metrics you want to use in the ragas test under the import block as displayed in lines 8-14. The ones currently listed are the default metrics, if you choose to omit this import and not provide any metrics list, these are the ones that will be used regardless.  from ai_core_sdk.ai_core_v2_client import AICoreV2Client
from gen_ai_hub.proxy.langchain.init_models import init_llm
from gen_ai_hub.proxy.langchain.init_models import init_embedding_model

from datasets import Dataset
import pandas as pd
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_relevancy,
context_recall,
context_precision
)
from dotenv import load_dotenv
load_dotenv() Loading AI Core credentials and initializing chat completion and text embedding LLMs.These credentials come from the service key of your AI core instance in BTP  client = AICoreV2Client(base_url=”BASE_URL”,
auth_url=”AUTH_URL”,
client_id=”CLIENT_ID”,
client_secret=”CLIENT_SECRET”,
resource_group=”default”)

langchain_llm = init_llm(‘gpt-4’, max_tokens=100)
embeddings = init_embedding_model(‘text-embedding-ada-002’) Creating datasets for RagasOur data is currently in an excel csv file, however we must format this data into a dataset before we run the ragas test.The 1st step is to create a dict for each of our 3 LLMs.We will then use the pandas library to read each column in the excel file and use them to populate our dictionaries.Finally, we create a Dataset from those dicts. df = pd.read_csv(‘Q&A_DATA.csv’)
df[‘contexts’] = df[‘contexts’].apply(lambda x: [x])

gemini_data[“question”] = df[‘question’].tolist()
gemini_data[“answer”] = df[‘gemini_answer’].tolist()
gemini_data[“contexts”] = df[‘contexts’].tolist()
gemini_data[“ground_truth”] = df[‘ground_truth’].tolist()

claude_data[“question”] = df[‘question’].tolist()
claude_data[“answer”] = df[‘claude_answer’].tolist()
claude_data[“contexts”] = df[‘contexts’].tolist()
claude_data[“ground_truth”] = df[‘ground_truth’].tolist()

gpt_data[“question”] = df[‘question’].tolist()
gpt_data[“answer”] = df[‘gpt_answer’].tolist()
gpt_data[“contexts”] = df[‘contexts’].tolist()
gpt_data[“ground_truth”] = df[‘ground_truth’].tolist()

gemini_dataset = Dataset.from_dict(gemini_data)
claude_dataset = Dataset.from_dict(claude_data)
gpt_dataset = Dataset.from_dict(gpt_data)  *Note* line 2 is mandatory, as it ensures the contexts property in the dict will be an array.Finally, we run the tests on ragas using the imported “evaluate” function:  gemini_result = evaluate(
dataset=gemini_dataset,
llm=langchain_llm,
embeddings=embeddings
)

claude_result = evaluate(
dataset=claude_dataset,
llm=langchain_llm,
embeddings=embeddings
)

gpt_result = evaluate(
dataset=gpt_dataset,
llm=langchain_llm,
embeddings=embeddings
)

print(“Gemini result ” + str(gemini_result) + ‘nn’)
print(“Claude result ” + str(claude_result) + ‘nn’)
print(“Gpt result” + str(gpt_result) + ‘nn’) And print our results Gemini result {‘answer_relevancy’: 0.9378, ‘context_precision’: 1.0000, ‘faithfulness’: 1.0000, ‘context_recall’: 1.0000}

Claude result {‘answer_relevancy’: 0.9462, ‘context_precision’: 1.0000, ‘faithfulness’: 0.5833, ‘context_recall’: 1.0000}

Gpt result{”answer_relevancy’: 0.8625, ‘context_precision’: 1.0000, ‘faithfulness’: 1.0000, ‘context_recall’: 1.0000} Bleu & Rouge (Non-rag metrics)Bleu, RougeBleu is a metric which is normally used to evaluate a model’s ability to translate text, while Rouge is used to evaluate its ability to summarize or capture the meaning of text.RougeWhen using the Rouge metric to measure the quality of a LLM generated summary, you are to provide a reference summary (created by you or another human) showing what would be an example of a good summary. The Rouge metric will be calculated by comparing the llm generated summary with the human-provided one.Different methods of Rouge:Rouge-1. This counts how many matching words there are between the human and machine generated summaryRouge-2 This counts how many matching word pairs there are between the human and machine generated summaryRouge-L This finds the longest common subsequence between the human and machine generated summary. The sequence does not have to be contiguous, just in the same order.Code example: from datasets import load_metric
rouge = load_metric(“rouge”)
llmSummaries = [“I was in walmart yesterday”]
humanSummary = [“I went to walmart yesterday”]
print(rouge.compute(predictions=llmSummaries, references=humanSummary)) Console output: {‘rouge1’: AggregateScore(low=Score(precision=0.6, recall=0.6, fmeasure=0.6), mid=Score(precision=0.6, recall=0.6, fmeasure=0.6), high=Score(precision=0.6, recall=0.6, fmeasure=0.6)),

‘rouge2’: AggregateScore(low=Score(precision=0.25, recall=0.25, fmeasure=0.25), mid=Score(precision=0.25, recall=0.25, fmeasure=0.25), high=Score(precision=0.25, recall=0.25, fmeasure=0.25)),

‘rougeL’: AggregateScore(low=Score(precision=0.6, recall=0.6, fmeasure=0.6), mid=Score(precision=0.6, recall=0.6, fmeasure=0.6), high=Score(precision=0.6, recall=0.6, fmeasure=0.6)),

‘rougeLsum’: AggregateScore(low=Score(precision=0.6, recall=0.6, fmeasure=0.6), mid=Score(precision=0.6, recall=0.6, fmeasure=0.6), high=Score(precision=0.6, recall=0.6, fmeasure=0.6))} We are provided with rouge1, rouge2, and rougeL scoresBleuSimiliar to Rouge, you must provide a human reference translation as well as the machine generated one you wish to get a Bleu score for. Bleu works by getting the unigram precision. Which it obtains by comparing the machine and human generated translations, counting the number of word matches, and dividing it by the number of words in the generation. It does not count consecutively repeated words as to avoid giving sentences with only the same word repeated a perfect score. It also has the option of comparing matching n-pair words instead of just one. During the calculation of the score, it will compare with both single pair and n-pair words to account for sentence order (as this can be different between languages).Example: from datasets import load_metric

bleu = load_metric(“bleu”)
llmTranslation = [[“I”, “am”, “of”, “Spain”]]
humanTranslations = [
[[“I”, “am”, “from”, “Spain”], [“I’m”, “from”, “Spain”]]
]
print(bleu.compute(predictions=llmTranslation, references=humanTranslations)) output: bleu = load_metric(“bleu”)
{‘bleu’: 0.0, ‘precisions’: [0.75, 0.3333333333333333, 0.0, 0.0], ‘brevity_penalty’: 1.0, ‘length_ratio’: 1.3333333333333333, ‘translation_length’: 4, ‘reference_length’: 3}  Deepeval BenchmarksDeepeval is an LLM evluation framework developed by confident-ai which provides access to various benchmark tests that can be used to evaluate a model’s overall performance. Unlike the above metrics, these benchmarks evaluate the performance of a model by using entire datasets.  Each benchmark represents a different dataset containing a list of relevant questions and expected answers (ground truths) for the task.HellaSwagTruthfulQ&AMMLUDROPHellaSwag – Provides 10,000 challenges revolving around sentence completion.DROP – 9,500 challenges which measure a model’s reasoning abilities. TruthfulQ&A – 817 challenges (questions) which determine if a model is able to answer questions truthfully. Common misconceptions (things believed by many people which are not actually true) are a key part of these challenges.MMLU – 15,000 multiple choice challenges ranging from various topics (57 different subjects). I.e. Math, history, law, and ethicsAll these benchmarks are available in the deepeval library (pip install deepeval). Each benchmark has a set of tasks associate with it, and you can choose which tasks you want to include.An example with MMLUTask  import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from deepeval.benchmarks import MMLU
from deepeval.benchmarks.tasks import MMLUTask

benchmark = MMLU(
tasks=[MMLUTask.FORMAL_LOGIC, MMLUTask.GLOBAL_FACTS],
n_shots=3
)

model = AutoModelForCausalLM.from_pretrained(“Jimmyhd/testRepo3”,
device_map=’auto’,
torch_dtype=torch.float16,
load_in_4bit=True,
bnb_4bit_quant_type=”nf4″,
bnb_4bit_compute_dtype=torch.float16,
trust_remote_code=True)
benchmark.evaluate(model=model)
print(benchmark.overall_score) **Note** the “trust_remote_code=True” parameter in the AutmoModelForCasualLlm call is mandatory in order to use any of the benchmarks.For more detail about how the benchmarks are evaluated and different examples using them, see the documentation here: ConclusionWhether you want to evaluate a model’s ability to perform rag functions, translate, summarize, reason, or just answer questions based on common mathematical and scientific knowledge, there are many different metric and benchmark tools which will provide you a quantitative score that will allow you to see just how well the model is performing before your application deployed, giving you full assurance. This will conclude our look at these 3 types of evaluations!   Read More Technology Blogs by SAP articles 

#SAP

#SAPTechnologyblog

You May Also Like

More From Author

+ There are no comments

Add yours