SAP AI Core is All You Need | 3. Workflow, Configuration, and Shakespeare Language Model Training

Estimated read time 50 min read

Introduction

Hey there, AI enthusiasts! Welcome back to our series “SAP AI Core is All You Need?.

In this blog, “Workflow, Configuration, and Shakespeare Language Model Training“, we’re rolling up our sleeves to take our Shakespeare Language Model to the next level. We’ll be diving into the nitty-gritty of containerizing and orchestrating our AI text generation pipelines using Docker and Argo Workflows. Let’s make sure our model training process is scalable, reproducible, and ready for action.

What to Expect

In this blog, you will gain hands-on experience with the following key concepts:

Creating Docker Images: Learn how to set up and build Docker images for our training pipeline.Designing Workflow Templates: Understand how to create workflow templates to automate and manage your training process.Deploying the Training Workflow: See how to deploy your configured workflow template on SAP AI Core.Evaluating Model Results: Learn how to track and evaluate the performance of your model.

 

Building Containerized Pipelines for AI Text Generation

Let’s dive into the training process step-by-step. We’ll explore how to containerize and orchestrate AI text generation pipelines using Docker and Argo Workflows. Our focus will be on training the Shakespeare Language Model in a scalable and reproducible manner with SAP AI Core and SAP AI Launchpad.

Generating Docker Image

First things first, we need to generate the Docker image that will handle our training pipeline. We’ll break this down into two steps: ai-core-training-setup and ai-core-training (as you may remember, in our blog 2, we described these steps). 

Let’s begin with ai-core-training-setup. Take a look inside the folder:

So, you already know what’s inside ai-core-training (all the files), because that’s where we learned about Transformers, right? But let’s take a closer look at ai-core-training-setup (as you might not be familiar with this part yet). Even though all the code for this blog is available here https://github.com/carlosbasto/shakespeare-language-model, it’s worth showing you a little bit of what’s in there, don’t you think? Well, at least the part that’s important for understanding how we’ll design the workflow template to make our model “tunable”.

 

Understanding main.py Method

Alright, let’s dive into evaluating the main.py method here:

 

import torch
from ShakespeareanGenerator.logger import Logger
from ShakespeareanGenerator.artifact_manager import ObjectStoreArtifactManager

class Run:
def __init__(self):
self.logging = Logger()
self.obj = ObjectStoreArtifactManager()
self.prepare_data()

def prepare_data(self):
self.logging.info(‘START: PREPARATION STEP’)
self.obj.upload_file_to_object_store()
self.logging.info(‘Training Data was uploaded to Object Store’)
self.logging.info(‘END: PREPARATION STEP’)

if __name__ == ‘__main__’:
Run()

 

Take a look at this snippet! It simply instantiates the ObjectStoreArtifactManager class and calls upload_file_to_object_store(). Pretty straightforward, right?

Exploring ObjectStoreArtifactManager

This method is designed to upload files to the object store, which (I hope) should be self-explanatory.

 

import boto3
import requests
from ShakespeareanGenerator.logger import Logger
from ShakespeareanGenerator.parameters import ObjectStoreParameters

class ObjectStoreArtifactManager:

def __init__(self):
self.logging = Logger()
self.obj_parameters = ObjectStoreParameters()
self.s3 = boto3.client(
‘s3’,
aws_access_key_id=self.obj_parameters.access_key_id,
aws_secret_access_key=self.obj_parameters.secret_access_key
)

def upload_file_to_object_store(self):
url = “<link_to_github_repository>.tinyshakespeare.txt”

file_key = f”{self.obj_parameters.prefix}{self.obj_parameters.DATA_PATH + self.obj_parameters.DATA_NAME}”
try:
response = requests.get(url)
response.raise_for_status() # Raise an exception for HTTP errors
corpus = response.text
corpus = “<b>”.join(corpus.split(‘n’))
self.s3.put_object(
Bucket=self.obj_parameters.bucket_name,
Key=file_key,
Body=corpus.encode(‘utf-8’)
)
self.logging.info(f”Uploaded tinyshakespeare.txt to S3 path: {file_key}”)
except requests.RequestException as e:
error_msg = f”Error fetching data from URL: {e}”
print(error_msg)
self.logging.error(error_msg)
except Exception as e:
error_msg = f”An unexpected error occurred: {e}”
print(error_msg)
self.logging.error(error_msg)

 

In __init__, the class connects to S3 (Amazon Simple Storage Service) and makes this connection available through self.S3.

Now, for upload_file_to_object_store, here’s a breakdown:

It starts by setting the url where the file is available and the file_key, which determines the path in S3 where we want to save the file.Next, it attempts to fetch the content (text) from the URL and sets it as the corpus variable.Then, it calls the put_object method to insert the corpus into the specified bucket path in S3.Finally, it handles any exceptions that may occur during this process.

Now, where do the aws_access_key_id and aws_secret_access_key variables come from?

 

aws_access_key_id=self.obj_parameters.access_key_id,
aws_secret_access_key=self.obj_parameters.secret_access_key

 

Exploring ObjectStoreParameters

Now, let’s take a look at ObjectStoreParameters to understand the obj_parameters instance.

Navigate to the parameters.py file to examine ObjectStoreParameters in detail. This class defines and manages the parameters needed for interacting with the object store, such as AWS credentials (aws_access_key_id and aws_secret_access_key).

 

import os

class ObjectStoreParameters:
def __init__(self):
self.bucket_name = os.environ.get(‘BUCKET_NAME’)
self.prefix = os.environ.get(‘PREFIX_NAME’)
self.access_key_id = os.environ.get(‘ACCESS_KEY_ID’)
self.secret_access_key = os.environ.get(‘SECRET_ACCESS_KEY’)
self.DATA_PATH = ‘data/’
self.DATA_NAME = ‘tinyshakespeare.txt’
self.LOG_PATH = ‘/app/logs/’
self.LOG_NAME = ‘setup_logs.log’

 

As you can see, these variables are either constants or environment-based.

The environment-based variables will need to be defined in the workflow template so that SAP AI Core can make them available within the container when needed. We’ll handle this setup later on, but it’s good to have this understanding now.

Let’s break down the constant variables:

DATA_PATH: This refers to the input artifact path in the S3 Bucket (ai://shakespeare/data). In our code, we save tinyshakespeare.txt there, which is where SAP AI Core will look for the input specified in the artifact.DATA_NAME: This is simply the name of the file (tinyshakespeare.txt).LOG_PATH: This variable is used to save logs during the step execution. Even though we didn’t create any specific artifact for it beforehand, in the workflow template, we’ll refer to it as “output,” and SAP AI Core will automatically understand it as one of the outcomes from the step execution. SAP AI Core will create this folder in the S3 Bucket as it’s designated as an output.LOG_NAME: Similarly, this is just the name of the log file.

I hope this explanation gives you a general idea of how these variables work. Now that we understand this, let’s move on to generating two important files: requirements.txt and Dockerfile. Cool, let’s get started!

 

Generating requirements.txt and Dockerfile

Forget manually listing every Python library you need. There’s a much simpler way! A handy tool called pipreqs can scan your project and automatically generate a requirements.txt file.

Here’s how it works:

Open your terminal (command prompt) and navigate to your project folder.Type this command: pipreqs .That dot (.) tells pipreqs to look at the files in your current location.

That’s it! Pipreqs will analyze your code and create a requirements.txt file listing all the necessary libraries.

Next, create a file called Dockerfile (no extension) with the following code.

 

# Use a slim Python 3.9 image as the base layer to minimize the image size
FROM python:3.9-slim

# Create necessary directories within the Docker image
# /app/src: Directory for source code
# /app/logs: Directory for log files
RUN mkdir -p /app/src /app/logs

# Copy the main Python script and requirements file from the local system to the Docker image
COPY main.py /app/src/
COPY requirements.txt /app/src/

# Copy the ShakespeareanGenerator module from the local system to the Docker image
COPY /ShakespeareanGenerator/*.py /app/src/ShakespeareanGenerator/

# Install the required Python packages specified in requirements.txt
# –no-cache-dir: Do not cache the packages, reducing the image size
RUN pip3 install –no-cache-dir -r /app/src/requirements.txt

# Change the group ownership of all files and directories under /app to the group ID 65534 (typically ‘nogroup’)
# Set read, write, and execute permissions for all users on all files and directories under /app
RUN chgrp -R 65534 /app &&
chmod -R 777 /app

 

Building the Base: We’ll start by using the python:3.9-slim image as the foundation for our custom image. Think of it like a pre-built house with Python 3.9 already installed.Creating Folders: Next, we’ll create two directories inside the image: /app/src/ and /app/logs/. These will be like rooms in our house – one for our Python code (src/) and another for any logs generated (logs/).Moving in the Furniture: Now, it’s time to bring in the essential stuff:main.py: This is the main file for our Python application, like the living room of the house.requirements.txt: This file lists all the additional Python libraries we need, similar to the kitchen appliances.ShakespeareanGenerator folder: This contains all the Python files related to generating Shakespearean text, like the bedroom with all the comfy pillows.Installing the Kitchen Appliances: We’ll use pip3 to install all the libraries listed in requirements.txt, just like setting up the kitchen with the necessary tools.Setting Permissions: Finally, we’ll adjust some permissions within the image using commands like chgrp and chmod. This is like assigning access rights to different parts of the house (who can read, write, or execute what).

That’s what the Dockerfile does! It creates a custom image with all the necessary components for running our Python application.

 

Building Docker Image

You can run the command for building the Docker image in your repository (local machine) as follows:

 

# Build the Docker image and tag it
docker build -t carlosbasto/shakespeare-setup:0.0.1 .

 

Your folder would be like this now:

The -t option is used to tag the Docker image with a name and version. Here, carlosbasto/shakespeare-setup is the image name in the format <username_or_organization>/<image_name>:0.0.1, where 0.0.1 is the version number. You should expect to see something similar to the following when you run the command:

Now you’ve got it! The Docker image has been generated and saved on your workstation (machine). Alternatively, if you prefer, Docker Desktop might display it more visually.

To use it with SAP AI Core, the Docker image needs to be in the Docker Registry we set up in the previous blog. To push the image to Docker Hub, simply use this command:

 

docker push <username_or_organization>/<image_name>:0.0.1

 

This command pushes the Docker image to the specified repository on Docker Hub, making it accessible for use with SAP AI Core.

You might notice something familiar when viewing my image. In my case, you’ll see that some layers are mounted from another image, indicating that my image is leveraging layer reuse and sharing (and that I built several images before ?). This approach optimizes storage and transmission of Docker images, which is pretty neat!

To confirm that your image is now on Docker Hub, simply log in to their website (or docker desktop, or shell command “docker images” etc.).

Now, your next task is to apply the same steps to the content in the ai-core-training folder. You’ve got this! 

First generate the image, by firing up the build process with this command:

 

docker build -t carlosbasto/shakespeare-training:0.0.1 .

 

And the push it to the docker hub:

 

docker push carlosbasto/shakespeare-training:0.0.1

 

Then, you should have it on the Docker Hub:

This step might take a bit longer because we’re installing some hefty requirements this time around. But no worries, it’s all part of the process, and it’ll come together just fine! By the way, just to make sure you got this: if you run the Docker command with a “.” at the end, you’ll need to be in the repository that contains all the training files (*.py, Dockerfile, .txt, etc.).

 

Designing and Understanding the Workflow Template

Think of workflow templates as blueprints for your AI training pipelines. They live in your code repository, like a version-controlled recipe book. SAP AI Core uses Argo Workflows, an open-source project, to actually run these pipelines. It’s like having a handy robot chef that follows your recipes precisely. Mapping these templates is like labeling them correctly. The AI API uses these labels to find the right “recipes” and ingredients (data, models) for your training job.

Basically, workflow templates are the foundation for running your training pipelines within SAP AI Core. They’re like pre-defined workflows that you can easily manage and adapt to your needs. 

Let’s break it down:

apiVersion and kind

These fields specify the version of the API group (argoproj.io/v1alpha1) and the kind of resource (WorkflowTemplate). It means that the template defines a reusable workflow configuration.

 

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate

 

metadata

The metadata section provides information about the workflow template:

Name: The name of the workflow template, which is “shakespeare-model”.Annotations: Additional metadata annotations describing the scenario, executables, and artifacts associated with the workflow. Artifacts: These are the important things created by the workflow:”Tiny Shakespeare Dataset”: The training data your model learns from.”Trained Language Model”: The final Shakespearean text generator you get.”Byte-Pair Encoding Tokenizer”: A special tool that helps the model understand words.”Setup Logs” and “Model Training Logs”: Records of how the training went.Labels: These are like keywords that help SAP AI Core understand your workflow better.Some of these annotations and labels are required by SAP AI Core so it can process them, get more information on Workflow Templates.

 

metadata:
name: “shakespeare-model”
annotations:
scenarios.ai.sap.com/name: “shakespeare-language-model”
scenarios.ai.sap.com/description: “Shakespeare Language Model”
executables.ai.sap.com/name: “Shakespeare-language-model-trainer”
executables.ai.sap.com/description: “Shakespeare Language Model Trainer Executable”
artifacts.ai.sap.com/data.kind: “dataset”
artifacts.ai.sap.com/data.description: “Tiny Shakespeare Dataset”
artifacts.ai.sap.com/model.kind: “model”
artifacts.ai.sap.com/model.description: “Trained Language Model”
artifacts.ai.sap.com/model.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/bpe_model.kind: “model”
artifacts.ai.sap.com/bpe_model.description: “Byte-Pair Encoding Tokenizer”
artifacts.ai.sap.com/bpe_model.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/setuplogs.kind: “other”
artifacts.ai.sap.com/setuplogs.description: “Setup Logs”
artifacts.ai.sap.com/setuplogs.labels: |
{“ext.ai.sap.com/step”:”setup”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/logs.kind: “other”
artifacts.ai.sap.com/logs.description: “Model Training Logs”
artifacts.ai.sap.com/logs.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
labels:
scenarios.ai.sap.com/id: “shakespeare-language-model”
executables.ai.sap.com/id: “shakespeare-trainer”
ai.sap.com/version: “0.0.1”

 

spec

The spec section defines the specific steps and parameters needed to train your Shakespeare language model using your chosen workflow template.

imagePullSecrets: Defines the secret (shakespeare-docker) used to pull the Docker images we created.entrypoint: This is the starting point which specifies the template within the workflow to begin with.arguments: Defines the input parameters (arguments) for the workflow template. They are like dials and knobs you can turn to fine-tune how the training happens. For more information, check the “description” of each “parameters”.

 

imagePullSecrets:
– name: shakespeare-docker
entrypoint: core
arguments:
parameters:
– name: BATCH_SIZE
description: The number of training examples processed in one iteration during training. It determines the size of each batch in the training dataset.
– name: CONTEXT_LENGTH
description: Defines the maximum length of input sequences, typically representing the number of tokens in each sequence or block of text.
– name: ITERATION_LIMIT
description: Specifies the maximum number of iterations or training steps to be performed during the training process. It controls the duration of the training loop.
– name: EVAL_FREQUENCY
description: Indicates how often model evaluation occurs during training, measured in the number of iterations or epochs between evaluations.
– name: EVAL_STEPS
description: Represents the number of evaluation steps to perform during each evaluation period. It determines the granularity of evaluation within each evaluation cycle.
– name: LEARNING_RATE
description: The rate at which the model parameters are updated during training, influencing the size of the steps taken in the parameter space to minimize the loss function.
– name: EMBEDDING_DIM
description: Determines the dimensionality of the embedding vectors used to represent tokens in the model. It impacts the expressive power of the model’s embedding layer.
– name: ATTENTION_HEADS
description: Specifies the number of parallel attention heads in the multi-head attention mechanism of the model. Each head learns different aspects of the input data.
– name: NUM_LAYERS
description: Represents the total number of transformer layers in the model architecture. It controls the depth and complexity of the model.
– name: DROPOUT
description: The probability of dropping out neurons or connections between layers during training, helping prevent overfitting by randomly deactivating some units.
– name: DICTIONARY_SIZE
description: Indicates the size of the vocabulary or dictionary used by the model, representing the total number of unique tokens or words in the dataset vocabulary.

 

templates

The templates section defines the specific actions your workflow will take to train your Shakespeare language model. As we have 2 steps, it becomes multi-step workflow. Amazing, don’t you think? 

core: This is the main template that controls the overall flow of the training process. It’s like the central hub that directs everything.steps: These are the individual tasks that happen one after the other. Here are the two main ones:Setup: This step prepares everything for training, like uploading the data.Train: This step is where the actual model training happens.

 

templates:
– name: core
steps:
– – name: setup
template: setup-pipeline
– – name: train
template: train-pipeline

 

Setup Pipeline (setup-pipeline)

This part defines the “setup” step that happens before training your Shakespeare language model.Here’s where we grab the tinyshakespeare.txt dataset and copy it over to the input path in S3, which we’ve set up in the SAP AI Core config as the input artifact location.

 

– name: setup-pipeline
metadata:
labels:
ai.sap.com/resourcePlan: basic
outputs:
artifacts:
– name: setuplogs
globalName: setuplogs
path: /app/logs/
archive:
none:
{}
container:
image: docker.io/carlosbasto/shakespeare-setup:0.0.1
imagePullPolicy: Always
command: [“/bin/sh”, “-c”]
args:
– python /app/src/main.py
env:
– name: BUCKET_NAME
valueFrom:
secretKeyRef:
name: object-store-credentials
key: bucket
– name: PREFIX_NAME
valueFrom:
secretKeyRef:
name: object-store-credentials
key: path_prefix
– name: ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: object-store-credentials
key: access_key_id
– name: SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: object-store-credentials
key: secret_access_key

 

Name: “setup-pipeline” – This is the official name for this specific step.Labels: These are like tags that tell SAP AI Core what kind of resources this step needs (e.g., basic resources).Outputs: This is what the setup step produces:Log Files: These files record any important events that happen during the setup process.Container: This defines the software environment where the setup happens. It’s like a dedicated computer specifically for this task:Image: This tells the system where to find the software needed for the setup process.Command: This is the specific program that runs within the container to do the setup work. For the command field [“/bin/sh”, “-c”], it indicates that the container will start with the /bin/sh shell (sh) and run the -c flag, which tells sh to interpret the subsequent string as a command. In this case, the actual command being executed inside the container will be:

 

/bin/sh -c python /app/src/main.py

 

Arguments: These are additional instructions passed to the program to tell it what to do. In our case it is “python /app/src/main.py” or “run the main.py“.Environment Variables: These act like settings that control how the program runs. They specify details such as where to store data and how to access it. We’re using environment variables to configure the necessary settings within the Docker image for running the setup. These variables will be populated with values from the generic secret we previously created (named object-store-credentials) which holds the S3 connection details. Remember that?

Train Pipeline (train-pipeline)

the Train Pipeline defines the specific instructions, environment, and adjustable settings needed to train your Shakespeare language model using the provided training data. It also specifies where to store the generated models and training logs.

 

– name: train-pipeline
metadata:
labels:
ai.sap.com/resourcePlan: train.l
inputs:
artifacts:
– name: data
path: /app/data/
outputs:
artifacts:
– name: model
globalName: model
path: /app/model/
archive:
none:
{}
– name: bpe_model
path: /app/tokenizer/
archive:
none:
{}
– name: logs
path: /app/logs/
archive:
none:
{}
container:
image: docker.io/carlosbasto/shakespeare-training:0.0.1
imagePullPolicy: Always
command: [“/bin/sh”, “-c”]
args:
– python /app/src/main.py
env:
– name: BATCH_SIZE
value: “{{workflow.parameters.BATCH_SIZE}}”
– name: CONTEXT_LENGTH
value: “{{workflow.parameters.CONTEXT_LENGTH}}”
– name: ITERATION_LIMIT
value: “{{workflow.parameters.ITERATION_LIMIT}}”
– name: EVAL_FREQUENCY
value: “{{workflow.parameters.EVAL_FREQUENCY}}”
– name: EVAL_STEPS
value: “{{workflow.parameters.EVAL_STEPS}}”
– name: LEARNING_RATE
value: “{{workflow.parameters.LEARNING_RATE}}”
– name: EMBEDDING_DIM
value: “{{workflow.parameters.EMBEDDING_DIM}}”
– name: ATTENTION_HEADS
value: “{{workflow.parameters.ATTENTION_HEADS}}”
– name: NUM_LAYERS
value: “{{workflow.parameters.NUM_LAYERS}}”
– name: DROPOUT
value: “{{workflow.parameters.DROPOUT}}”
– name: DICTIONARY_SIZE
value: “{{workflow.parameters.DICTIONARY_SIZE}}”

 

Name: “train-pipeline” – This is the official name for this specific step.Labels: Similar to setup-pipeline, this metadata tells SAP AI Core that it requires additional, more powerful resources for training the machine.Inputs: This is what the training step needs to work with:Training Data: This is the data the model learns from, containing Shakespearean text.Outputs: This is what the training step produces:Trained Model: This is the final Shakespearean text generator you get.Tokenizer: This is the BPE Tokenizer model that we trained on tiny Shakespeare dataset.Training Logs: These files record any important events or progress during the training process.Container: This defines the software environment where the training happens. It’s like a dedicated computer specifically for this task:Image: This tells the system where to find the software needed for training. Here is where you declare the image we built in previous steps (docker.io/carlosbasto/shakespeare-setup:0.0.1) the format is like <docker registry>/<repository>/<image name>:<tag/version>.Command: When we mention – python /app/src/main.py in the command, it’s about running the code for training. Arguments: This main.py belongs to another container we’ve defined, so it’s all about training here, not setup anymore.Environment Variables: These are like adjustable settings such as batch size, learning rate, and more. They’re what make our model “tunable” in SAP AI Core because you can easily choose values for them during the configuration step. Right now, these values are coming from the configuration setup and are made available as environment variables within the container.

 

Deploying Your Shakespeare Language Model Training Workflow

Now that you’ve configured your workflow template, let’s proceed to its deployment on SAP AI Core. We’ll revisit the SAP AI Launchpad and delve deeper into the scenario we created.

Click on it, and you’ll discover the various components we’ve set up in the workflow template – pretty satisfying, right? ? First, let’s look at the parameters:

Next, take a peek at the inputs and outputs:

These are quite self-explanatory, aren’t they? Great! Now, we’re almost ready to train the model. But first, we need to create a configuration based on this scenario. To do that, navigate to “Configuration” and click on “Create”:

 Here, we just need to provide four pieces of information:

Configuration Name: Choose a fitting name.
Scenario: Select the scenario created from your workflow template (e.g., “shakespeare-language-model” in my case).
Version: Currently, we have only one version (“0.0.1”), but you can have more. Choose the appropriate one here.
Executable: The executable here is the trainer (e.g., “shakespeare-language-model-trainer”). Later, we’ll create others like transfer and tuner.

Now, it’s time to fill in the parameters. Quick tip: if you “Enable Description,” all the descriptions we set in the workflow template will help guide you through filling in the parameters.

Next, you can start experimenting with different parameter values. Don’t hesitate to explore and choose the best options.

As a reminder, this training pipeline only requires one artifact: the training dataset. Now, let’s select the corresponding artifact and map it to the configuration.

If everything looks good, you should see something like this:

Hit the “Create” button. Once you’ve created the configuration, take advantage of the conveniently placed button in the upper right corner of the screen to create an execution:

The “Process Overview” will give you a visual grasp of what will happen:

The “executable” will utilize the “input artifact” mapped in the “configuration” to execute our training pipeline. Awesome, isn’t it?

Metrics Resources and Evaluation

Another cool feature I want to highlight is the “Metric Resources” aspect of our ModelTrainer class. This feature lets us track and fetch metrics for executions and models. In our case – how our training and validation losses evolve over time, giving us good insights into our model’s performance.

Remember when we set up our ModelTrainer to log these metrics? Here’s a snippet of the code we used:

 

# Metric Logging: Step Information
training_loss_msg = ‘{:.4f}’.format(losses[‘train’])
validation_loss_msg = ‘{:.4f}’.format(losses[‘val’])
tracking.log_metrics(
metrics = [
Metric(
name= “Training Loss”,
value= float(training_loss_msg),
timestamp=datetime.now(timezone.utc),
step=iteration
),
Metric(
name= “Validation Loss”,
value= float(validation_loss_msg),
timestamp=datetime.now(timezone.utc),
step=iteration
),
]
)

 

This code snippet allows us to monitor our training and validation losses at regular intervals (every 1000 epochs in this case) during model training.

We log these metrics along with additional custom information, such as the number of parameters our model is learning:

 

learning_parameters = sum(p.numel() for p in model.parameters())/1e6
msg_to_log = ‘The model is learning {} million parameters.’.format(learning_parameters)
logging.info(msg_to_log)
msg_to_metrics = ‘{} million parameters.’.format(learning_parameters)
tracking.set_custom_info(
custom_info=[
MetricCustomInfo(name=”Number of Parameters”, value=str(msg_to_metrics))
]
)

 

And “Epoch Status”:

 

evaluation_msg = ‘EPOCH {} | LOSS: Train {:.4f} Valid {:.4f}’.format(
str(iteration).ljust(5), losses[‘train’], losses[‘val’])

logging.info(evaluation_msg)
tracking.set_custom_info(
custom_info=[
MetricCustomInfo(name=”Epoch Status”, value=str(evaluation_msg))
]
)

 

Curious to learn more about this approach? Check out the “Generate Metrics and Compare Models in SAP AI Core” tutorial for a deeper dive into tracking and optimizing model performance.

Understanding and Evaluating Model Results

Once the first step is done running, you’ll notice a file has been written to the S3 path (the same one used as an input artifact for our Configuration). 

Additionally, when you check the executions, you’ll find a folder named after your execution ID, like eb3b9d321c22539a. If the first step has already run, you’ll have the logs ready. Feel free to copy them to your local machine for closer inspection:

 

aws s3 cp s3://hcp-f4249aeb-db74-47b2-b5f0-41a00f48224b/shakespeare/executions/<your_execution_id>/setuplogs/setup_logs.log .

 

Once the second step wraps up, you’ll have everything.

And just as anticipated, the training logs are all there.

 In the execution overview, you’ll get a complete view of the flow.

Pretty much what we were hoping for, huh? Great! Don’t forget to take a look at the artifacts generated from this execution.

The four artifacts we spotted on the execution overview page were successfully generated (and of course, we checked them out directly on S3 too, right?).

 

Wrapping Up and Next Steps

Congratulations on diving into the containerized pipelines for your Shakespearean Language Model! In this blog, we’ve covered the final essential steps to deploy our training pipeline and still evaluated the model performance.

Let’s recap what we’ve covered:

Generating Docker Images: We created Docker images to handle our training pipeline setup and execution.Designing Workflow Templates: We learned how to design workflow templates for the training pipeline using SAP AI Core and Argo Workflows.Deploying the Training Workflow: We deployed the configured workflow template on SAP AI Core.Evaluating Model Results: We tracked and fetched metrics for executions and models, and analyzed the generated artifacts to evaluate the training results.

Next Steps

Now that we’ve deployed the training pipeline, stay with us for the upcoming blogs in this series, where we’ll explore further functionalities:

Improving Model Training Efficiency: Understand how to use checkpointing and resuming to make model training more efficient.
[SAP AI Core is All You Need | 4. Improving Model Training Efficiency with Checkpointing/Resuming]Fine-Tuning with Low-Rank Adaptation (LoRA): Learn how to use LoRA to fine-tune models with fewer parameters, making the process more efficient and effective.
[SAP AI Core is All You Need | 5. Fine Tuning with Low-Rank Adaptation (LoRA)]Fine-Tuning Pipeline: Dive into fine-tuning techniques to enhance model performance on specific datasets or tasks. We’ll explore the deployment of fine-tuning pipelines using SAP AI Core and explore model deployment and serving using KServe with SAP AI Core. Learn how to efficiently serve fine-tuned models for real-world applications.
[SAP AI Core is All You Need | 6. Serving Shakespeare Model using SAP AI Core and KServe]Sampling and Consuming Language Models: Discover methods for sampling from trained language models and integrating them into applications.
[SAP AI Core is All You Need | 7. Deploying Language Models for Text Generation]Developing a Language-Model-Based App: Gain insights into building an application powered by your trained language model.
[SAP AI Core is All You Need | 8. Consuming and Sampling from Shakespeare Language Models]

Further References

Source Code: GitHub repositorySAP AI Core HelpSAP AI Launchpad HelpArgo Workflow TemplatesDocker Image Builder 

​ IntroductionHey there, AI enthusiasts! Welcome back to our series “SAP AI Core is All You Need” ?.In this blog, “Workflow, Configuration, and Shakespeare Language Model Training”, we’re rolling up our sleeves to take our Shakespeare Language Model to the next level. We’ll be diving into the nitty-gritty of containerizing and orchestrating our AI text generation pipelines using Docker and Argo Workflows. Let’s make sure our model training process is scalable, reproducible, and ready for action.What to ExpectIn this blog, you will gain hands-on experience with the following key concepts:Creating Docker Images: Learn how to set up and build Docker images for our training pipeline.Designing Workflow Templates: Understand how to create workflow templates to automate and manage your training process.Deploying the Training Workflow: See how to deploy your configured workflow template on SAP AI Core.Evaluating Model Results: Learn how to track and evaluate the performance of your model. Building Containerized Pipelines for AI Text GenerationLet’s dive into the training process step-by-step. We’ll explore how to containerize and orchestrate AI text generation pipelines using Docker and Argo Workflows. Our focus will be on training the Shakespeare Language Model in a scalable and reproducible manner with SAP AI Core and SAP AI Launchpad.Generating Docker ImageFirst things first, we need to generate the Docker image that will handle our training pipeline. We’ll break this down into two steps: ai-core-training-setup and ai-core-training (as you may remember, in our blog 2, we described these steps). Let’s begin with ai-core-training-setup. Take a look inside the folder:So, you already know what’s inside ai-core-training (all the files), because that’s where we learned about Transformers, right? But let’s take a closer look at ai-core-training-setup (as you might not be familiar with this part yet). Even though all the code for this blog is available here https://github.com/carlosbasto/shakespeare-language-model, it’s worth showing you a little bit of what’s in there, don’t you think? Well, at least the part that’s important for understanding how we’ll design the workflow template to make our model “tunable”. Understanding main.py MethodAlright, let’s dive into evaluating the main.py method here: import torch
from ShakespeareanGenerator.logger import Logger
from ShakespeareanGenerator.artifact_manager import ObjectStoreArtifactManager

class Run:
def __init__(self):
self.logging = Logger()
self.obj = ObjectStoreArtifactManager()
self.prepare_data()

def prepare_data(self):
self.logging.info(‘START: PREPARATION STEP’)
self.obj.upload_file_to_object_store()
self.logging.info(‘Training Data was uploaded to Object Store’)
self.logging.info(‘END: PREPARATION STEP’)

if __name__ == ‘__main__’:
Run() Take a look at this snippet! It simply instantiates the ObjectStoreArtifactManager class and calls upload_file_to_object_store(). Pretty straightforward, right?Exploring ObjectStoreArtifactManagerThis method is designed to upload files to the object store, which (I hope) should be self-explanatory. import boto3
import requests
from ShakespeareanGenerator.logger import Logger
from ShakespeareanGenerator.parameters import ObjectStoreParameters

class ObjectStoreArtifactManager:

def __init__(self):
self.logging = Logger()
self.obj_parameters = ObjectStoreParameters()
self.s3 = boto3.client(
‘s3’,
aws_access_key_id=self.obj_parameters.access_key_id,
aws_secret_access_key=self.obj_parameters.secret_access_key
)

def upload_file_to_object_store(self):
url = “<link_to_github_repository>.tinyshakespeare.txt”

file_key = f”{self.obj_parameters.prefix}{self.obj_parameters.DATA_PATH + self.obj_parameters.DATA_NAME}”
try:
response = requests.get(url)
response.raise_for_status() # Raise an exception for HTTP errors
corpus = response.text
corpus = “<b>”.join(corpus.split(‘n’))
self.s3.put_object(
Bucket=self.obj_parameters.bucket_name,
Key=file_key,
Body=corpus.encode(‘utf-8’)
)
self.logging.info(f”Uploaded tinyshakespeare.txt to S3 path: {file_key}”)
except requests.RequestException as e:
error_msg = f”Error fetching data from URL: {e}”
print(error_msg)
self.logging.error(error_msg)
except Exception as e:
error_msg = f”An unexpected error occurred: {e}”
print(error_msg)
self.logging.error(error_msg) In __init__, the class connects to S3 (Amazon Simple Storage Service) and makes this connection available through self.S3.Now, for upload_file_to_object_store, here’s a breakdown:It starts by setting the url where the file is available and the file_key, which determines the path in S3 where we want to save the file.Next, it attempts to fetch the content (text) from the URL and sets it as the corpus variable.Then, it calls the put_object method to insert the corpus into the specified bucket path in S3.Finally, it handles any exceptions that may occur during this process.Now, where do the aws_access_key_id and aws_secret_access_key variables come from? aws_access_key_id=self.obj_parameters.access_key_id,
aws_secret_access_key=self.obj_parameters.secret_access_key Exploring ObjectStoreParametersNow, let’s take a look at ObjectStoreParameters to understand the obj_parameters instance.Navigate to the parameters.py file to examine ObjectStoreParameters in detail. This class defines and manages the parameters needed for interacting with the object store, such as AWS credentials (aws_access_key_id and aws_secret_access_key). import os

class ObjectStoreParameters:
def __init__(self):
self.bucket_name = os.environ.get(‘BUCKET_NAME’)
self.prefix = os.environ.get(‘PREFIX_NAME’)
self.access_key_id = os.environ.get(‘ACCESS_KEY_ID’)
self.secret_access_key = os.environ.get(‘SECRET_ACCESS_KEY’)
self.DATA_PATH = ‘data/’
self.DATA_NAME = ‘tinyshakespeare.txt’
self.LOG_PATH = ‘/app/logs/’
self.LOG_NAME = ‘setup_logs.log’ As you can see, these variables are either constants or environment-based.The environment-based variables will need to be defined in the workflow template so that SAP AI Core can make them available within the container when needed. We’ll handle this setup later on, but it’s good to have this understanding now.Let’s break down the constant variables:DATA_PATH: This refers to the input artifact path in the S3 Bucket (ai://shakespeare/data). In our code, we save tinyshakespeare.txt there, which is where SAP AI Core will look for the input specified in the artifact.DATA_NAME: This is simply the name of the file (tinyshakespeare.txt).LOG_PATH: This variable is used to save logs during the step execution. Even though we didn’t create any specific artifact for it beforehand, in the workflow template, we’ll refer to it as “output,” and SAP AI Core will automatically understand it as one of the outcomes from the step execution. SAP AI Core will create this folder in the S3 Bucket as it’s designated as an output.LOG_NAME: Similarly, this is just the name of the log file.I hope this explanation gives you a general idea of how these variables work. Now that we understand this, let’s move on to generating two important files: requirements.txt and Dockerfile. Cool, let’s get started! Generating requirements.txt and DockerfileForget manually listing every Python library you need. There’s a much simpler way! A handy tool called pipreqs can scan your project and automatically generate a requirements.txt file.Here’s how it works:Open your terminal (command prompt) and navigate to your project folder.Type this command: pipreqs .That dot (.) tells pipreqs to look at the files in your current location.That’s it! Pipreqs will analyze your code and create a requirements.txt file listing all the necessary libraries.Next, create a file called Dockerfile (no extension) with the following code. # Use a slim Python 3.9 image as the base layer to minimize the image size
FROM python:3.9-slim

# Create necessary directories within the Docker image
# /app/src: Directory for source code
# /app/logs: Directory for log files
RUN mkdir -p /app/src /app/logs

# Copy the main Python script and requirements file from the local system to the Docker image
COPY main.py /app/src/
COPY requirements.txt /app/src/

# Copy the ShakespeareanGenerator module from the local system to the Docker image
COPY /ShakespeareanGenerator/*.py /app/src/ShakespeareanGenerator/

# Install the required Python packages specified in requirements.txt
# –no-cache-dir: Do not cache the packages, reducing the image size
RUN pip3 install –no-cache-dir -r /app/src/requirements.txt

# Change the group ownership of all files and directories under /app to the group ID 65534 (typically ‘nogroup’)
# Set read, write, and execute permissions for all users on all files and directories under /app
RUN chgrp -R 65534 /app &&
chmod -R 777 /app Building the Base: We’ll start by using the python:3.9-slim image as the foundation for our custom image. Think of it like a pre-built house with Python 3.9 already installed.Creating Folders: Next, we’ll create two directories inside the image: /app/src/ and /app/logs/. These will be like rooms in our house – one for our Python code (src/) and another for any logs generated (logs/).Moving in the Furniture: Now, it’s time to bring in the essential stuff:main.py: This is the main file for our Python application, like the living room of the house.requirements.txt: This file lists all the additional Python libraries we need, similar to the kitchen appliances.ShakespeareanGenerator folder: This contains all the Python files related to generating Shakespearean text, like the bedroom with all the comfy pillows.Installing the Kitchen Appliances: We’ll use pip3 to install all the libraries listed in requirements.txt, just like setting up the kitchen with the necessary tools.Setting Permissions: Finally, we’ll adjust some permissions within the image using commands like chgrp and chmod. This is like assigning access rights to different parts of the house (who can read, write, or execute what).That’s what the Dockerfile does! It creates a custom image with all the necessary components for running our Python application. Building Docker ImageYou can run the command for building the Docker image in your repository (local machine) as follows: # Build the Docker image and tag it
docker build -t carlosbasto/shakespeare-setup:0.0.1 . Your folder would be like this now:The -t option is used to tag the Docker image with a name and version. Here, carlosbasto/shakespeare-setup is the image name in the format <username_or_organization>/<image_name>:0.0.1, where 0.0.1 is the version number. You should expect to see something similar to the following when you run the command:Now you’ve got it! The Docker image has been generated and saved on your workstation (machine). Alternatively, if you prefer, Docker Desktop might display it more visually.To use it with SAP AI Core, the Docker image needs to be in the Docker Registry we set up in the previous blog. To push the image to Docker Hub, simply use this command: docker push <username_or_organization>/<image_name>:0.0.1 This command pushes the Docker image to the specified repository on Docker Hub, making it accessible for use with SAP AI Core.You might notice something familiar when viewing my image. In my case, you’ll see that some layers are mounted from another image, indicating that my image is leveraging layer reuse and sharing (and that I built several images before ?). This approach optimizes storage and transmission of Docker images, which is pretty neat!To confirm that your image is now on Docker Hub, simply log in to their website (or docker desktop, or shell command “docker images” etc.).Now, your next task is to apply the same steps to the content in the ai-core-training folder. You’ve got this! First generate the image, by firing up the build process with this command: docker build -t carlosbasto/shakespeare-training:0.0.1 . And the push it to the docker hub: docker push carlosbasto/shakespeare-training:0.0.1 Then, you should have it on the Docker Hub:This step might take a bit longer because we’re installing some hefty requirements this time around. But no worries, it’s all part of the process, and it’ll come together just fine! By the way, just to make sure you got this: if you run the Docker command with a “.” at the end, you’ll need to be in the repository that contains all the training files (*.py, Dockerfile, .txt, etc.). Designing and Understanding the Workflow TemplateThink of workflow templates as blueprints for your AI training pipelines. They live in your code repository, like a version-controlled recipe book. SAP AI Core uses Argo Workflows, an open-source project, to actually run these pipelines. It’s like having a handy robot chef that follows your recipes precisely. Mapping these templates is like labeling them correctly. The AI API uses these labels to find the right “recipes” and ingredients (data, models) for your training job.Basically, workflow templates are the foundation for running your training pipelines within SAP AI Core. They’re like pre-defined workflows that you can easily manage and adapt to your needs. Let’s break it down:apiVersion and kindThese fields specify the version of the API group (argoproj.io/v1alpha1) and the kind of resource (WorkflowTemplate). It means that the template defines a reusable workflow configuration. apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate metadataThe metadata section provides information about the workflow template:Name: The name of the workflow template, which is “shakespeare-model”.Annotations: Additional metadata annotations describing the scenario, executables, and artifacts associated with the workflow. Artifacts: These are the important things created by the workflow:”Tiny Shakespeare Dataset”: The training data your model learns from.”Trained Language Model”: The final Shakespearean text generator you get.”Byte-Pair Encoding Tokenizer”: A special tool that helps the model understand words.”Setup Logs” and “Model Training Logs”: Records of how the training went.Labels: These are like keywords that help SAP AI Core understand your workflow better.Some of these annotations and labels are required by SAP AI Core so it can process them, get more information on Workflow Templates. metadata:
name: “shakespeare-model”
annotations:
scenarios.ai.sap.com/name: “shakespeare-language-model”
scenarios.ai.sap.com/description: “Shakespeare Language Model”
executables.ai.sap.com/name: “Shakespeare-language-model-trainer”
executables.ai.sap.com/description: “Shakespeare Language Model Trainer Executable”
artifacts.ai.sap.com/data.kind: “dataset”
artifacts.ai.sap.com/data.description: “Tiny Shakespeare Dataset”
artifacts.ai.sap.com/model.kind: “model”
artifacts.ai.sap.com/model.description: “Trained Language Model”
artifacts.ai.sap.com/model.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/bpe_model.kind: “model”
artifacts.ai.sap.com/bpe_model.description: “Byte-Pair Encoding Tokenizer”
artifacts.ai.sap.com/bpe_model.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/setuplogs.kind: “other”
artifacts.ai.sap.com/setuplogs.description: “Setup Logs”
artifacts.ai.sap.com/setuplogs.labels: |
{“ext.ai.sap.com/step”:”setup”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/logs.kind: “other”
artifacts.ai.sap.com/logs.description: “Model Training Logs”
artifacts.ai.sap.com/logs.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
labels:
scenarios.ai.sap.com/id: “shakespeare-language-model”
executables.ai.sap.com/id: “shakespeare-trainer”
ai.sap.com/version: “0.0.1” specThe spec section defines the specific steps and parameters needed to train your Shakespeare language model using your chosen workflow template.imagePullSecrets: Defines the secret (shakespeare-docker) used to pull the Docker images we created.entrypoint: This is the starting point which specifies the template within the workflow to begin with.arguments: Defines the input parameters (arguments) for the workflow template. They are like dials and knobs you can turn to fine-tune how the training happens. For more information, check the “description” of each “parameters”.  imagePullSecrets:
– name: shakespeare-docker
entrypoint: core
arguments:
parameters:
– name: BATCH_SIZE
description: The number of training examples processed in one iteration during training. It determines the size of each batch in the training dataset.
– name: CONTEXT_LENGTH
description: Defines the maximum length of input sequences, typically representing the number of tokens in each sequence or block of text.
– name: ITERATION_LIMIT
description: Specifies the maximum number of iterations or training steps to be performed during the training process. It controls the duration of the training loop.
– name: EVAL_FREQUENCY
description: Indicates how often model evaluation occurs during training, measured in the number of iterations or epochs between evaluations.
– name: EVAL_STEPS
description: Represents the number of evaluation steps to perform during each evaluation period. It determines the granularity of evaluation within each evaluation cycle.
– name: LEARNING_RATE
description: The rate at which the model parameters are updated during training, influencing the size of the steps taken in the parameter space to minimize the loss function.
– name: EMBEDDING_DIM
description: Determines the dimensionality of the embedding vectors used to represent tokens in the model. It impacts the expressive power of the model’s embedding layer.
– name: ATTENTION_HEADS
description: Specifies the number of parallel attention heads in the multi-head attention mechanism of the model. Each head learns different aspects of the input data.
– name: NUM_LAYERS
description: Represents the total number of transformer layers in the model architecture. It controls the depth and complexity of the model.
– name: DROPOUT
description: The probability of dropping out neurons or connections between layers during training, helping prevent overfitting by randomly deactivating some units.
– name: DICTIONARY_SIZE
description: Indicates the size of the vocabulary or dictionary used by the model, representing the total number of unique tokens or words in the dataset vocabulary. templatesThe templates section defines the specific actions your workflow will take to train your Shakespeare language model. As we have 2 steps, it becomes multi-step workflow. Amazing, don’t you think? core: This is the main template that controls the overall flow of the training process. It’s like the central hub that directs everything.steps: These are the individual tasks that happen one after the other. Here are the two main ones:Setup: This step prepares everything for training, like uploading the data.Train: This step is where the actual model training happens.  templates:
– name: core
steps:
– – name: setup
template: setup-pipeline
– – name: train
template: train-pipeline Setup Pipeline (setup-pipeline)This part defines the “setup” step that happens before training your Shakespeare language model.Here’s where we grab the tinyshakespeare.txt dataset and copy it over to the input path in S3, which we’ve set up in the SAP AI Core config as the input artifact location.  – name: setup-pipeline
metadata:
labels:
ai.sap.com/resourcePlan: basic
outputs:
artifacts:
– name: setuplogs
globalName: setuplogs
path: /app/logs/
archive:
none:
{}
container:
image: docker.io/carlosbasto/shakespeare-setup:0.0.1
imagePullPolicy: Always
command: [“/bin/sh”, “-c”]
args:
– python /app/src/main.py
env:
– name: BUCKET_NAME
valueFrom:
secretKeyRef:
name: object-store-credentials
key: bucket
– name: PREFIX_NAME
valueFrom:
secretKeyRef:
name: object-store-credentials
key: path_prefix
– name: ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: object-store-credentials
key: access_key_id
– name: SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: object-store-credentials
key: secret_access_key  Name: “setup-pipeline” – This is the official name for this specific step.Labels: These are like tags that tell SAP AI Core what kind of resources this step needs (e.g., basic resources).Outputs: This is what the setup step produces:Log Files: These files record any important events that happen during the setup process.Container: This defines the software environment where the setup happens. It’s like a dedicated computer specifically for this task:Image: This tells the system where to find the software needed for the setup process.Command: This is the specific program that runs within the container to do the setup work. For the command field [“/bin/sh”, “-c”], it indicates that the container will start with the /bin/sh shell (sh) and run the -c flag, which tells sh to interpret the subsequent string as a command. In this case, the actual command being executed inside the container will be: /bin/sh -c python /app/src/main.py Arguments: These are additional instructions passed to the program to tell it what to do. In our case it is “- python /app/src/main.py” or “run the main.py”.Environment Variables: These act like settings that control how the program runs. They specify details such as where to store data and how to access it. We’re using environment variables to configure the necessary settings within the Docker image for running the setup. These variables will be populated with values from the generic secret we previously created (named object-store-credentials) which holds the S3 connection details. Remember that?Train Pipeline (train-pipeline)the Train Pipeline defines the specific instructions, environment, and adjustable settings needed to train your Shakespeare language model using the provided training data. It also specifies where to store the generated models and training logs.  – name: train-pipeline
metadata:
labels:
ai.sap.com/resourcePlan: train.l
inputs:
artifacts:
– name: data
path: /app/data/
outputs:
artifacts:
– name: model
globalName: model
path: /app/model/
archive:
none:
{}
– name: bpe_model
path: /app/tokenizer/
archive:
none:
{}
– name: logs
path: /app/logs/
archive:
none:
{}
container:
image: docker.io/carlosbasto/shakespeare-training:0.0.1
imagePullPolicy: Always
command: [“/bin/sh”, “-c”]
args:
– python /app/src/main.py
env:
– name: BATCH_SIZE
value: “{{workflow.parameters.BATCH_SIZE}}”
– name: CONTEXT_LENGTH
value: “{{workflow.parameters.CONTEXT_LENGTH}}”
– name: ITERATION_LIMIT
value: “{{workflow.parameters.ITERATION_LIMIT}}”
– name: EVAL_FREQUENCY
value: “{{workflow.parameters.EVAL_FREQUENCY}}”
– name: EVAL_STEPS
value: “{{workflow.parameters.EVAL_STEPS}}”
– name: LEARNING_RATE
value: “{{workflow.parameters.LEARNING_RATE}}”
– name: EMBEDDING_DIM
value: “{{workflow.parameters.EMBEDDING_DIM}}”
– name: ATTENTION_HEADS
value: “{{workflow.parameters.ATTENTION_HEADS}}”
– name: NUM_LAYERS
value: “{{workflow.parameters.NUM_LAYERS}}”
– name: DROPOUT
value: “{{workflow.parameters.DROPOUT}}”
– name: DICTIONARY_SIZE
value: “{{workflow.parameters.DICTIONARY_SIZE}}” Name: “train-pipeline” – This is the official name for this specific step.Labels: Similar to setup-pipeline, this metadata tells SAP AI Core that it requires additional, more powerful resources for training the machine.Inputs: This is what the training step needs to work with:Training Data: This is the data the model learns from, containing Shakespearean text.Outputs: This is what the training step produces:Trained Model: This is the final Shakespearean text generator you get.Tokenizer: This is the BPE Tokenizer model that we trained on tiny Shakespeare dataset.Training Logs: These files record any important events or progress during the training process.Container: This defines the software environment where the training happens. It’s like a dedicated computer specifically for this task:Image: This tells the system where to find the software needed for training. Here is where you declare the image we built in previous steps (docker.io/carlosbasto/shakespeare-setup:0.0.1) the format is like <docker registry>/<repository>/<image name>:<tag/version>.Command: When we mention – python /app/src/main.py in the command, it’s about running the code for training. Arguments: This main.py belongs to another container we’ve defined, so it’s all about training here, not setup anymore.Environment Variables: These are like adjustable settings such as batch size, learning rate, and more. They’re what make our model “tunable” in SAP AI Core because you can easily choose values for them during the configuration step. Right now, these values are coming from the configuration setup and are made available as environment variables within the container. Deploying Your Shakespeare Language Model Training WorkflowNow that you’ve configured your workflow template, let’s proceed to its deployment on SAP AI Core. We’ll revisit the SAP AI Launchpad and delve deeper into the scenario we created.Click on it, and you’ll discover the various components we’ve set up in the workflow template – pretty satisfying, right? ? First, let’s look at the parameters:Next, take a peek at the inputs and outputs:These are quite self-explanatory, aren’t they? Great! Now, we’re almost ready to train the model. But first, we need to create a configuration based on this scenario. To do that, navigate to “Configuration” and click on “Create”: Here, we just need to provide four pieces of information:Configuration Name: Choose a fitting name.Scenario: Select the scenario created from your workflow template (e.g., “shakespeare-language-model” in my case).Version: Currently, we have only one version (“0.0.1”), but you can have more. Choose the appropriate one here.Executable: The executable here is the trainer (e.g., “shakespeare-language-model-trainer”). Later, we’ll create others like transfer and tuner.Now, it’s time to fill in the parameters. Quick tip: if you “Enable Description,” all the descriptions we set in the workflow template will help guide you through filling in the parameters.Next, you can start experimenting with different parameter values. Don’t hesitate to explore and choose the best options.As a reminder, this training pipeline only requires one artifact: the training dataset. Now, let’s select the corresponding artifact and map it to the configuration.If everything looks good, you should see something like this:Hit the “Create” button. Once you’ve created the configuration, take advantage of the conveniently placed button in the upper right corner of the screen to create an execution:The “Process Overview” will give you a visual grasp of what will happen:The “executable” will utilize the “input artifact” mapped in the “configuration” to execute our training pipeline. Awesome, isn’t it?Metrics Resources and EvaluationAnother cool feature I want to highlight is the “Metric Resources” aspect of our ModelTrainer class. This feature lets us track and fetch metrics for executions and models. In our case – how our training and validation losses evolve over time, giving us good insights into our model’s performance.Remember when we set up our ModelTrainer to log these metrics? Here’s a snippet of the code we used:  # Metric Logging: Step Information
training_loss_msg = ‘{:.4f}’.format(losses[‘train’])
validation_loss_msg = ‘{:.4f}’.format(losses[‘val’])
tracking.log_metrics(
metrics = [
Metric(
name= “Training Loss”,
value= float(training_loss_msg),
timestamp=datetime.now(timezone.utc),
step=iteration
),
Metric(
name= “Validation Loss”,
value= float(validation_loss_msg),
timestamp=datetime.now(timezone.utc),
step=iteration
),
]
) This code snippet allows us to monitor our training and validation losses at regular intervals (every 1000 epochs in this case) during model training.We log these metrics along with additional custom information, such as the number of parameters our model is learning:  learning_parameters = sum(p.numel() for p in model.parameters())/1e6
msg_to_log = ‘The model is learning {} million parameters.’.format(learning_parameters)
logging.info(msg_to_log)
msg_to_metrics = ‘{} million parameters.’.format(learning_parameters)
tracking.set_custom_info(
custom_info=[
MetricCustomInfo(name=”Number of Parameters”, value=str(msg_to_metrics))
]
)  And “Epoch Status”:  evaluation_msg = ‘EPOCH {} | LOSS: Train {:.4f} Valid {:.4f}’.format(
str(iteration).ljust(5), losses[‘train’], losses[‘val’])

logging.info(evaluation_msg)
tracking.set_custom_info(
custom_info=[
MetricCustomInfo(name=”Epoch Status”, value=str(evaluation_msg))
]
)  Curious to learn more about this approach? Check out the “Generate Metrics and Compare Models in SAP AI Core” tutorial for a deeper dive into tracking and optimizing model performance.Understanding and Evaluating Model ResultsOnce the first step is done running, you’ll notice a file has been written to the S3 path (the same one used as an input artifact for our Configuration). Additionally, when you check the executions, you’ll find a folder named after your execution ID, like eb3b9d321c22539a. If the first step has already run, you’ll have the logs ready. Feel free to copy them to your local machine for closer inspection: aws s3 cp s3://hcp-f4249aeb-db74-47b2-b5f0-41a00f48224b/shakespeare/executions/<your_execution_id>/setuplogs/setup_logs.log . Once the second step wraps up, you’ll have everything.And just as anticipated, the training logs are all there. In the execution overview, you’ll get a complete view of the flow.Pretty much what we were hoping for, huh? Great! Don’t forget to take a look at the artifacts generated from this execution.The four artifacts we spotted on the execution overview page were successfully generated (and of course, we checked them out directly on S3 too, right?). Wrapping Up and Next StepsCongratulations on diving into the containerized pipelines for your Shakespearean Language Model! In this blog, we’ve covered the final essential steps to deploy our training pipeline and still evaluated the model performance.Let’s recap what we’ve covered:Generating Docker Images: We created Docker images to handle our training pipeline setup and execution.Designing Workflow Templates: We learned how to design workflow templates for the training pipeline using SAP AI Core and Argo Workflows.Deploying the Training Workflow: We deployed the configured workflow template on SAP AI Core.Evaluating Model Results: We tracked and fetched metrics for executions and models, and analyzed the generated artifacts to evaluate the training results.Next StepsNow that we’ve deployed the training pipeline, stay with us for the upcoming blogs in this series, where we’ll explore further functionalities:Improving Model Training Efficiency: Understand how to use checkpointing and resuming to make model training more efficient.[SAP AI Core is All You Need | 4. Improving Model Training Efficiency with Checkpointing/Resuming]Fine-Tuning with Low-Rank Adaptation (LoRA): Learn how to use LoRA to fine-tune models with fewer parameters, making the process more efficient and effective.[SAP AI Core is All You Need | 5. Fine Tuning with Low-Rank Adaptation (LoRA)]Fine-Tuning Pipeline: Dive into fine-tuning techniques to enhance model performance on specific datasets or tasks. We’ll explore the deployment of fine-tuning pipelines using SAP AI Core and explore model deployment and serving using KServe with SAP AI Core. Learn how to efficiently serve fine-tuned models for real-world applications.[SAP AI Core is All You Need | 6. Serving Shakespeare Model using SAP AI Core and KServe]Sampling and Consuming Language Models: Discover methods for sampling from trained language models and integrating them into applications.[SAP AI Core is All You Need | 7. Deploying Language Models for Text Generation]Developing a Language-Model-Based App: Gain insights into building an application powered by your trained language model.[SAP AI Core is All You Need | 8. Consuming and Sampling from Shakespeare Language Models]Further ReferencesSource Code: GitHub repositorySAP AI Core HelpSAP AI Launchpad HelpArgo Workflow TemplatesDocker Image Builder   Read More Technology Blogs by SAP articles 

#SAP

#SAPTechnologyblog

You May Also Like

More From Author

+ There are no comments

Add yours