SAP AI Core is All You Need | 4. Improving Model Training Efficiency with Checkpointing/Resuming

Estimated read time 54 min read

Introduction

Welcome back to our series “SAP AI Core is All You Need” ?.

Hey there! In this blog, let’s dive into AI checkpointing – a common technique that’s like a safety net for your model training adventures. Imagine you’re training a machine learning model, pouring in hours of computing power and heaps of data. Suddenly, bam! The GPUs hit their limit on the cluster, or the boss told us to stop because we’re hogging all the resources, or sometimes there’s just a disaster in the data center. All that hard work – gone in an instant? Not with checkpointing!

What to Expect

In this blog, you will gain hands-on experience with the following key concepts:

Understanding Checkpointing: Learn the importance of checkpointing in machine learning and how it can save time and resources by allowing you to resume training from the last successful point.Creating Separate Docker Images: Explore the benefits of modular design and scalability by using separate Docker images for different stages of the training process.Adapting Code for Checkpointing: Understand how to modify your code to support checkpointing, ensuring that your model can efficiently resume training.Configuring Docker Images and Workflow Templates: Learn to set up Docker images and workflow templates to manage your checkpointing process effectively.Deploying the Checkpointer Workflow: See how to deploy your configured workflow template on SAP AI Core and evaluate the results of your checkpointing strategy.

Understanding Checkpointing/Resuming Training

What’s Checkpointing, Anyway? Checkpointing is like hitting “save” during a video game. It’s a smart strategy that periodically saves the state of your model during training. This means capturing critical info like the model’s weights, biases, and other parameters at different stages of the training journey.

Checkpointing isn’t just a safety net; it’s a game-changer. It saves valuable time and resources by allowing you to resume training from the last successful point. No need to redo everything from scratch. This can be a lifesaver for big, complex models that take days or weeks to train. It’s also a powerful tool for keeping an eye on your model’s progress. By saving checkpoints regularly, you can monitor how your model evolves over time. Spotting trends or issues early on can help you fine-tune your approach and achieve better results.

So, whether you’re training a language model or teaching your AI to recognize cats from dogs, remember the magic of checkpointing. It’s your safety net, your progress tracker, and your ticket to more efficient AI adventures. Happy checkpointing! ?

 

Leveraging Separate Docker Images for Enhanced Checkpointing and Scalability

Alright, so we’ve established that checkpointing is a lifesaver for your AI projects. But here’s the twist: we’re going to take things a step further by implementing it in a separate Docker image.

Now, you might be wondering, “Why not just stick everything in one image?”. That’s a fair question! Here’s the thing: the only answer would be “to learn more”; however, while it works, keeping things separate will offer some sweet advantages to your learning journey:

Modular Design: Think of it like building with Legos. A separate checkpointing image acts like a dedicated “checkpointing brick.” This keeps your training script nice and clean, focusing solely on the training logic.Scalability Potential: Imagine training multiple models at once. With separate images, you can easily manage checkpoints for each model independently, making things more scalable in the long run.Learning Opportunity: Let’s be honest, we’re all on a journey of AI mastery. Deploying things on SAP BTP with SAP AI Core is a fantastic way to gain experience in a powerful platform specifically designed for AI workloads. It’s like adding a new tool to your AI toolbox!Warm-up: This is a good chance to delve deeper into artifacts, workflows, and SAP AI Core, gearing up for fine-tuning our Shakespeare Language Model with some exciting tasks. It serves as a useful warm-up for the adjustments we’ll cover in the next blog post about Fine-Tuning.

So, while keeping everything in one image might seem simpler at first, taking the “separate image” route offers a cleaner architecture, potential scalability, and a valuable learning experience on a robust platform like SAP BTP. Remember, the goal is to build robust and efficient AI models, and sometimes, a little extra planning goes a long way. So, embrace the power of separate images, and watch your machine learning projects thrive!

However, in a production environment and real-world scenarios, it’s important to evaluate the most suitable approach based on your specific needs and resource constraints. Another consideration is whether this step can be integrated directly within the training workflow (using the same image, which is a more common approach) or omitted entirely (though not recommended for Language Models and Large Language Models).

If you feel confident with these steps already, feel free to skip ahead to the next blog, okay? I can still see you around! ? Alright, let’s move forward then!

 

Adapting/Changing code for Checkpointing

Let’s dive into another interesting part of our machine learning project – defining what needs to be changed or adapted for our checkpointing setup. 

 

main.py for the ai-core-checkpointer-setup

As you may expect, the setup code should be adapted because it was just providing the data input for our model, but this time, it needs to provide also the language model in its last state and the BPE model (tokenizer), right? Let’s see how the main.py methods gets now:

 

from ShakespeareanGenerator.logger import Logger
from ShakespeareanGenerator.artifact_manager import ObjectStoreArtifactManager

class Run:
def __init__(self):
self.logging = Logger()
self.obj = ObjectStoreArtifactManager()
self.prepare_data()

def prepare_data(self):
self.logging.info(‘START: PREPARATION STEP’)
self.obj.upload_file_to_object_store()
self.logging.info(‘Training Data was uploaded to Object Store’)
self.obj.copy_object(model_type=’model’)
self.logging.info(‘The Language Model was successfully uploaded to the object store’)
self.obj.copy_object(model_type=’bpe_model’)
self.logging.info(‘The trained tokenizer (BPE) was successfully uploaded to the object store’)
self.logging.info(‘END: PREPARATION STEP’)

if __name__ == ‘__main__’:
Run()

 

Yes, it has two instances of the method copy_object for each input: model and bpe_model.

 

self.obj.copy_object(model_type=’model’)
self.logging.info(‘The Language Model was successfully uploaded to the object store’)
self.obj.copy_object(model_type=’bpe_model’)
self.logging.info(‘The trained tokenizer (BPE) was successfully uploaded to the object store’)

 

This is implemented by the class ObjectStoreArtifactManager from artifact_manager, so let’s see it:

 

import boto3
import requests
from ShakespeareanGenerator.logger import Logger
from ShakespeareanGenerator.parameters import ObjectStoreParameters

class ObjectStoreArtifactManager:

def __init__(self):

self.logging = Logger()
self.obj_parameters = ObjectStoreParameters()
self.s3 = self.__get_s3_connection()
self.latest_execution_id = None

def __get_s3_connection(self):
return boto3.client(
‘s3’,
aws_access_key_id = self.obj_parameters.access_key_id,
aws_secret_access_key = self.obj_parameters.secret_access_key
)

def __get_executions(self):
response = self.s3.list_objects_v2(Bucket=self.obj_parameters.bucket_name, Prefix=self.obj_parameters.prefix_m)
unique_prefixes = set()
for obj in response[‘Contents’]:
prefix_part = obj[‘Key’].split(‘/’)[2]
unique_prefixes.add(prefix_part)
sorted_objects = sorted(response[‘Contents’], key=lambda x: x[‘LastModified’])
latest_keys = {}
for obj in sorted_objects:
prefix_part = obj[‘Key’].split(‘/’)[2]
if prefix_part not in latest_keys:
latest_keys[prefix_part] = obj[‘Key’].split(‘/’)[2]

self.sorted_keys = list(latest_keys.values())

def __check_model_files_exist(self):
model_files_exist = False
for model_type in [‘model’, ‘bpe_model’]:
source_key = f”{self.obj_parameters.prefix_m}{self.latest_execution_id}/{model_type}/”
response = self.s3.list_objects_v2(Bucket=self.obj_parameters.bucket_name, Prefix=source_key)
if ‘Contents’ in response:
model_files_exist = True
else:
model_files_exist = False
self.logging.warning(‘Exit the loop if any model file is missing’)
break
return model_files_exist

def __get_latest_valid_execution_id(self):

if not hasattr(self, ‘sorted_keys’):
self.__get_executions()
self.logging.info(‘Reading all the models in object store from all executions’)
if not hasattr(self, ‘current_index’):
self.current_index = 0
self.logging.info(f’Initial Index: {self.current_index}’)
reversed_prefixes = list(map(lambda x: x, reversed(self.sorted_keys)))
for index in range(0, len(self.sorted_keys)):
self.latest_execution_id = reversed_prefixes[index]
if self.__check_model_files_exist():
return self.latest_execution_id
else:
msg = ‘Files for execution ID not found. {}’.format(self.latest_execution_id)
self.logging.warning(msg)

def copy_object(self, model_type):

self.__get_latest_valid_execution_id()
model_mappings = {
‘model’: (‘model.pkl’,'{}{}model.pkl’.format(
self.obj_parameters.prefix,
self.obj_parameters.INPUT_MODEL_PATH
)),
‘bpe_model_vocab’: (‘vocab.json’, ‘{}{}vocab.json’.format(
self.obj_parameters.prefix,
self.obj_parameters.INPUT_BPE_MODEL_PATH
)),
‘bpe_model_merges’: (‘merges.txt’, ‘{}{}merges.txt’.format(
self.obj_parameters.prefix,
self.obj_parameters.INPUT_BPE_MODEL_PATH
))
}
if not any(key.startswith(model_type) for key in model_mappings):
raise ValueError(f”Invalid model_type: {model_type}”)
for key, (model_file_name, destination_key) in model_mappings.items():
if key.startswith(model_type):
source_key = f”{self.obj_parameters.prefix_m}{self.latest_execution_id}/{model_type}/{model_file_name}”
self.logging.info(f’FROM: {source_key} TO: {destination_key}’)
self.logging.info(f’Starting copy process for {model_type}’)
self.s3.copy_object(
Bucket=self.obj_parameters.bucket_name,
CopySource={‘Bucket’: self.obj_parameters.bucket_name, ‘Key’: source_key},
Key=destination_key
)
self.logging.info(f'{model_type} artifacts were updated from {self.latest_execution_id} folder to the input folders for further processing’)
return self.latest_execution_id

def upload_file_to_object_store(self):
url = “<link_to_github_repository>tinyshakespeare.txt”

file_key = f”{self.obj_parameters.prefix}{self.obj_parameters.DATA_PATH + self.obj_parameters.DATA_NAME}”
try:
response = requests.get(url)
response.raise_for_status() # Raise an exception for HTTP errors
corpus = response.text
corpus = “<b>”.join(corpus.split(‘n’))
self.s3.put_object(
Bucket=self.obj_parameters.bucket_name,
Key=file_key,
Body=corpus.encode(‘utf-8’)
)
self.logging.info(f”Uploaded tinyshakespeare.txt to S3 path: {file_key}”)
self.logging.info(f”{self.obj_parameters.prefix_m}”)
except requests.RequestException as e:
error_msg = f”Error fetching data from URL: {e}”
print(error_msg)
self.logging.error(error_msg)
except Exception as e:
error_msg = f”An unexpected error occurred: {e}”
print(error_msg)
self.logging.error(error_msg)

 

Alright, so we’ve got a class here called ObjectStoreArtifactManager. This class helps manage and interact with object storage, specifically Amazon S3, within our machine learning workflow. Let’s go through what each part of this class is doing:

Importing Libraries: We start by importing the necessary libraries like boto3 for AWS interactions and requests for making HTTP requests.Initialization: In the __init__ method, we set up some initial configurations:We create a logger instance (self.logging) to capture and manage logs.We initialize object store parameters (self.obj_parameters) to handle configurations related to our object storage setup.We establish an S3 connection (self.s3) using AWS credentials provided in ObjectStoreParameters.Private Methods (__get_s3_connection, __get_executions, __check_model_files_exist, __get_latest_valid_execution_id). These methods are prefixed with double underscores (__) to indicate that they are intended for internal use within the class and not to be accessed directly from outside.__get_s3_connection: Establishes and returns an S3 client connection using AWS credentials that came from the Generic Secret we created (object-store-credentials).__get_executions: Retrieves and sorts training execution information from the specified S3 bucket and prefix.__check_model_files_exist: Checks if model files exist for the latest execution ID retrieved because it might be an execution that ended with errors and if this is the case the last execution ID would be there, but no files (models) will be available.__get_latest_valid_execution_id: Retrieves the latest valid execution ID based on existing model files.Public Methods (copy_object, upload_file_to_object_store). These methods are accessible from outside the class and perform specific tasks related to managing object storage.copy_object: Copies model artifacts (e.g., model.pkl, vocab.json) based on the latest valid execution ID to designated input folders for further processing.upload_file_to_object_store: Downloads a corpus file from a specified URL and uploads it to the object store (S3) using the provided object store parameters.

Here’s an example of how the code will deal with S3 bucket.

Step 1: It will locate the latest valid execution folder within the S3 paths (ai://default/<execution_id>/model/ and ai://default/<execution_id>/bpe_model/) where SAP AI Core stores models from that execution.Step 2: The code checks each execution folder to see if it contains models. If it doesn’t find any models in the latest execution folder, it will search in the previous execution folders until it finds valid models. If no pre-trained models are found in any of the folders, the code will raise an error indicating that no pre-trained models were located.Step 3: Once it identifies the valid execution folder containing models, it will write these models to the paths ai://shakespeare/input_model/ and ai://shakespeare/input_tokenizer/. This process ensures that the necessary models are retrieved and stored for further processing in the workflow.

This ObjectStoreArtifactManager class is designed to provide all required interactions with object storage for our machine learning workflow, handling tasks such as copying model artifacts and uploading data files. It’s all about managing our resources and keeping artifacts available for resuming training.

 

main.py for the ai-core-checkpointer

You might already have a good idea of what to expect from the files we’ll be using, right? So, we’ll focus on the specific ones that have been adapted or changed for our checkpointing feature. Let’s start with main.py.

 

import pickle
import torch
from ShakespeareanGenerator.model.language_models import ShakespeareanLanguagelModel, ModelTrainer
from ShakespeareanGenerator.parameters import TrainingParameters
from ShakespeareanGenerator.data_handler import DataHandler
from ShakespeareanGenerator.logger import Logger

class Run:
def __init__(self):
self.logging = Logger()
self.training_params = TrainingParameters()
self.check_gpu_usage()
self.prepare_data()
self.train_model()

def check_gpu_usage(self):
if torch.cuda.is_available():
self.logging.info(f”GPU is available, using GPU: {torch.cuda.get_device_name(0)}”)
self.logging.info(f”Using CUDA version {torch.version.cuda}”)
else:
self.logging.warning(“GPU is not available, using CPU.”)

def prepare_data(self):
self.logging.info(‘START OF EXECUTION’)
self.logging.info(‘Get DataHandler and Model Instances’)
self.data_handler = DataHandler(self.training_params.DATA_PATH)
try:
with open(self.training_params.INPUT_MODEL + ‘model.pkl’, ‘rb’) as f:
loaded_model = pickle.load(f)
self.logging.info(‘Loaded model for continuing training’)
except FileNotFoundError:
loaded_model = None
self.logging.error(‘Transfer learning not possible; no model found’)
self.logging.warning(‘Model will start from scratch’)

self.model_object = ShakespeareanLanguagelModel()
model = self.model_object if loaded_model is None else loaded_model
self.model = model.to(self.training_params.device)
self.logging.info(‘DataHandler and Model Instantiated’)

def train_model(self):
self.trainer = ModelTrainer(self.data_handler, self.model)
self.trainer.train()
self.logging.info(‘Model was trained successfully’)
with open(self.training_params.MODEL_PATH + ‘model.pkl’, ‘wb’) as f:
pickle.dump(self.model, f)
self.logging.info(‘END OF EXECUTION’)

if __name__ == ‘__main__’:
Run()

 

As you can see, the changes are mainly related to this part, specifically:

 

try:
with open(self.training_params.INPUT_MODEL + ‘model.pkl’, ‘rb’) as f:
loaded_model = pickle.load(f)
self.logging.info(‘Loaded model for continuing training’)
except FileNotFoundError:
loaded_model = None
self.logging.error(‘Transfer learning not possible; no model found’)
self.logging.warning(‘Model will start from scratch’)

self.model_object = ShakespeareanLanguagelModel()
model = self.model_object if loaded_model is None else loaded_model

 

Breaking it down, we have:

try and except Block: This part of the code is like a safety net. It tries to open a file (model.pkl) containing a pre-trained model. If the file is found (FileNotFoundError is not raised), it loads the model using pickle.load(f).Loading the Model: If the file is found and the model is loaded successfully, it logs a message saying “Loaded model for continuing training”. This means you can pick up where you left off and continue training your awesome Shakespearean Language Model.Handling File Not Found: If the file (model.pkl) is not found (i.e., FileNotFoundError is raised), it sets loaded_model to None and logs an error message saying “Transfer learning not possible; no model found”. It also logs a warning message indicating that the model will start training from scratch.Initializing the Model: After handling the file loading, it initializes an instance of ShakespeareanLanguageModel() and assigns it to self.model_object.Setting the Training Model: Finally, the model variable is assigned based on whether loaded_model is None or not. If loaded_model is None, it means there was no existing model to load, so it sets model to self.model_object (a new instance of the model). If loaded_model is not None, it means a pre-trained model was successfully loaded, so it sets model to the loaded model.

Not that difficult, right? However, you might have noticed that training_params.INPUT_MODEL is a new parameter. If you did, you’re correct! This is one of the new parameters that will be introduced soon. Hang tight, and we’ll jump into it very quickly.

 

tokenizer.py

Since we’ve already trained the tokenizer previously, there’s no need to train them again, especially if we’re using the same dataset and not fine-tuning anything. Alright, let’s break down another piece of code related to our machine learning project – the load_tokenizer function. This function is all about setting up and loading a tokenizer using the SentencePieceBPETokenizer.

 

def load_tokenizer(self):
self.tokenizer = SentencePieceBPETokenizer.from_file(
self.training_params.INPUT_TOKENIZER_MODEL +’vocab.json’,
merges_filename = self.training_params.INPUT_TOKENIZER_MODEL +’merges.txt’)

 

Initializing the Tokenizer: In this function, we’re initializing our tokenizer attribute using the SentencePieceBPETokenizer.from_file method.File Paths: The tokenizer is loaded from two specific files:vocab.json: This file contains the vocabulary used by the tokenizer.merges.txt: This file contains information about token merges based on the tiny Shakespeare corpus.File Paths Explanation: The file paths (self.training_params.INPUT_TOKENIZER_MODEL + ‘vocab.json’ and self.training_params.INPUT_TOKENIZER_MODEL + ‘merges.txt’) are constructed based on the input tokenizer model directory specified in self.training_params (which will get into it in a minute).

Additionally, we have added self.load_tokenizer() in the encode and decode methods as well.

And that’s it! Can you believe it? Everything else remains unchanged. Well, we still need to review parameters.py, but there’s not much to add to it. It’s been pretty straightforward so far, right? 

 

parameters.py

Moving forward, we’ve seen that some parameters were added in the code below, specifically: 

INPUT_MODEL = “/app/input_model/”. This is the one responsible for tell SAP AI Core that there’s an input artifact which is a model and will be somehow placed on that path within the S3 bucket.INPUT_TOKENIZER_MODEL = “/app/input_tokenizer/”. Same as previous, but this time it’s about the  artifacts generated from the tokenizer training (vocabulary and merges).

 

Adapting/Changing Docker Image and Workflow Template for Checkpointing

Let’s start by checking the Dockerfile for ai-core-checkpointer (we don’t need to revisit the ai-core-checkpointer-setup because we’ve been through the same Dockerfile for ai-core-training-setup, right?):

 

# Use the PyTorch image with CUDA 12.1 and cuDNN 8 runtime
FROM pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime

# Install necessary system dependencies
RUN apt-get update && apt-get install -y
python3-pip
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*

# Create necessary directories within the Docker image
RUN mkdir -p /app/src /app/data /app/input_model /app/input_tokenizer /app/model /app/logs

# Copy files from local system to path in Docker image
COPY main.py /app/src/
COPY requirements.txt /app/src/
COPY /ShakespeareanGenerator/*.py /app/src/ShakespeareanGenerator/
COPY /ShakespeareanGenerator/model/*.py /app/src/ShakespeareanGenerator/model/

# Install Python dependencies within the Docker image
RUN pip3 install –no-cache-dir -r /app/src/requirements.txt

# Set permissions to execute anything inside the /app folder
RUN chgrp -R 65534 /app &&
chmod -R 777 /app

 

If you’ve executed our training workflows from previous blogs, you might have noticed the use of a distinct base image in the Dockerfile for training purposes. This choice stems from the utilization of pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime, a PyTorch image tailored with CUDA 12.1 and cuDNN 8 runtime. This specialization underscores its optimization for leveraging GPU capabilities. In this manner, we ensure our training environment is finely tuned for GPU-accelerated tasks, aligning with the performance demands of modern machine learning workflows and allowing us to take full advantage of SAP AI Core Kubernetes cluster GPUs.

Now, let’s focus on a couple of new directories in our setup. We’ve got some input directories that SAP AI Core will use during the workflow execution. These directories are set up as “input artifacts,” meaning SAP AI Core will look for files there, but it won’t copy anything into them – that’s the job of the “setup” step we covered in the previous blog. We’ll also walk you through the changes we made to the “setup” code so you can see what’s happening behind the scenes.

 

RUN mkdir -p /app/data/
RUN mkdir -p /app/input_model/
RUN mkdir -p /app/input_tokenizer/

 

Next up, we’ve got a couple of new folders for our output. These directories are where SAP AI Core will automatically copy our outputs (thanks to the workflow template), and it will even create them if they don’t exist. The copied files will be sent to the S3 “default” path.

 

RUN mkdir -p /app/model/
RUN mkdir -p /app/logs/

 

These simple mkdir commands are setting up our project structure to handle input data, models, and logs effectively within our Docker container. It’s all about keeping things organized and ready for the workflow ahead.

You can download the checkpointer_template.yml file in the github repository:

 

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: “shakespeare-model-chkp”
annotations:
scenarios.ai.sap.com/name: “shakespeare-language-model”
scenarios.ai.sap.com/description: “Shakespeare Language Model”
executables.ai.sap.com/name: “Shakespeare-language-model-trainer-checkpointer”
executables.ai.sap.com/description: “Shakespeare Language Model Trainer Checkpointer Executable”
artifacts.ai.sap.com/data.kind: “dataset”
artifacts.ai.sap.com/data.description: “Tiny Shakespeare Dataset”
artifacts.ai.sap.com/model.kind: “model”
artifacts.ai.sap.com/model.description: “Trained Language Model”
artifacts.ai.sap.com/model.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/bpe_model.kind: “model”
artifacts.ai.sap.com/bpe_model.description: “Byte-Pair Encoding Tokenizer”
artifacts.ai.sap.com/bpe_model.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/setuplogs.kind: “other”
artifacts.ai.sap.com/setuplogs.description: “Setup Logs”
artifacts.ai.sap.com/setuplogs.labels: |
{“ext.ai.sap.com/step”:”setup”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/logs.kind: “other”
artifacts.ai.sap.com/logs.description: “Model Training Logs”
artifacts.ai.sap.com/logs.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
labels:
scenarios.ai.sap.com/id: “shakespeare-language-model”
executables.ai.sap.com/id: “shakespeare-checkpointer”
ai.sap.com/version: “0.0.1”
spec:
imagePullSecrets:
– name: shakespeare-docker-repo
entrypoint: core
arguments:
parameters:
– name: BATCH_SIZE
description: The number of training examples processed in one iteration during training. It determines the size of each batch in the training dataset.
– name: CONTEXT_LENGTH
description: Defines the maximum length of input sequences, typically representing the number of tokens in each sequence or block of text.
– name: ITERATION_LIMIT
description: Specifies the maximum number of iterations or training steps to be performed during the training process. It controls the duration of the training loop.
– name: EVAL_FREQUENCY
description: Indicates how often model evaluation occurs during training, measured in the number of iterations or epochs between evaluations.
– name: EVAL_STEPS
description: Represents the number of evaluation steps to perform during each evaluation period. It determines the granularity of evaluation within each evaluation cycle.
– name: LEARNING_RATE
description: The rate at which the model parameters are updated during training, influencing the size of the steps taken in the parameter space to minimize the loss function.
– name: EMBEDDING_DIM
description: Determines the dimensionality of the embedding vectors used to represent tokens in the model. It impacts the expressive power of the model’s embedding layer.
– name: ATTENTION_HEADS
description: Specifies the number of parallel attention heads in the multi-head attention mechanism of the model. Each head learns different aspects of the input data.
– name: NUM_LAYERS
description: Represents the total number of transformer layers in the model architecture. It controls the depth and complexity of the model.
– name: DROPOUT
description: The probability of dropping out neurons or connections between layers during training, helping prevent overfitting by randomly deactivating some units.
– name: DICTIONARY_SIZE
description: Indicates the size of the vocabulary or dictionary used by the model, representing the total number of unique tokens or words in the dataset vocabulary.
templates:
– name: core
steps:
– – name: setup
template: setup-pipeline
– – name: train
template: train-pipeline
– name: setup-pipeline
metadata:
labels:
ai.sap.com/resourcePlan: basic
outputs:
artifacts:
– name: setup_logs
globalName: setup_logs
path: /app/logs/
archive:
none:
{}
container:
image: docker.io/carlosbasto/shakespeare-checkpointer-setup:0.0.1
imagePullPolicy: Always
command: [“/bin/sh”, “-c”]
args:
– python /app/src/main.py
env:
– name: BUCKET_NAME
valueFrom:
secretKeyRef:
name: object-store-credentials
key: bucket
– name: PREFIX_NAME
valueFrom:
secretKeyRef:
name: object-store-credentials
key: path_prefix
– name: ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: object-store-credentials
key: access_key_id
– name: SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: object-store-credentials
key: secret_access_key
– name: train-pipeline
metadata:
labels:
ai.sap.com/resourcePlan: train.l
inputs:
artifacts:
– name: data
path: /app/data/
– name: input_model
path: /app/input_model/
– name: input_tokenizer
path: /app/input_tokenizer/
outputs:
artifacts:
– name: model
path: /app/model/
globalName: model
archive:
none:
{}
– name: logs
path: /app/logs/
archive:
none:
{}
container:
image: docker.io/carlosbasto/shakespeare-checkpointer:0.0.1
imagePullPolicy: Always
command: [“/bin/sh”, “-c”]
args:
– python /app/src/main.py
env:
– name: BATCH_SIZE
value: “{{workflow.parameters.BATCH_SIZE}}”
– name: CONTEXT_LENGTH
value: “{{workflow.parameters.CONTEXT_LENGTH}}”
– name: ITERATION_LIMIT
value: “{{workflow.parameters.ITERATION_LIMIT}}”
– name: EVAL_FREQUENCY
value: “{{workflow.parameters.EVAL_FREQUENCY}}”
– name: EVAL_STEPS
value: “{{workflow.parameters.EVAL_STEPS}}”
– name: LEARNING_RATE
value: “{{workflow.parameters.LEARNING_RATE}}”
– name: EMBEDDING_DIM
value: “{{workflow.parameters.EMBEDDING_DIM}}”
– name: ATTENTION_HEADS
value: “{{workflow.parameters.ATTENTION_HEADS}}”
– name: NUM_LAYERS
value: “{{workflow.parameters.NUM_LAYERS}}”
– name: DROPOUT
value: “{{workflow.parameters.DROPOUT}}”
– name: DICTIONARY_SIZE
value: “{{workflow.parameters.DICTIONARY_SIZE}}”

 

Place it into your own github repository and then, if you sync your application, you’ll notice another file added.

 Let’s jump into the scenario with this new executable. Here, you’ll find two executables:

In addition to the input parameters (which are the same as those for the trainer), we have both input and output parameters to consider:

Creating Configuration and Deploy Checkpointer Workflow

Now that we have synced the scenario in SAP AI Core, let’s walk through the steps to create the necessary artifacts and configure them for our execution scenario. If you need a refresher, you can check out the previous blog post for more clarity.

Are you back? Good, let’s create an artifact of type “model” for the scenario we had (shakespeare-language-model).

Now, we’re going to create the BPE (Byte-Pair Encoding) model artifact. Give it a meaningful name that reflects its purpose.

Next, we’ll need to specify the URL or path in S3 where the BPE model will be stored. This path should be set up during the initial workflow setup.

Since this BPE model is an input artifact for the checkpointer workflow, make sure that the corresponding folder exists in the object store (S3). The same goes for the input_model folder.

Feel free to add a label if you want – it can help organize and identify artifacts more easily. That’s all for the tokenizer model setup!

Now let’s repeat the process for the input_model:

Create the input_model artifact following similar steps. Once we’ve created these artifacts, we need to set up a configuration to map them to our specific scenario.

 Next, map the inputs within the configuration to the corresponding artifacts we’ve just created.

After following these steps, you’ll end up with a result similar to the one below, with extra inputs incorporated into the setup.

Now, you can clearly see that we’ve mapped 3 inputs to our configuration and obtained 2 outputs as results. Pretty neat, right? ?

As expected, the checkpointer should resume from where the trainer left off: from 3.0334 to 3.035 for Training Loss and 3.3903 to 3.3866 for Validation Loss. It’s nothing fancy, but it gives us a little more satisfaction knowing it worked. Of course, feel free to run it as many times as you want to try and achieve better results.

Anyway, let’s check out what SAP AI Core has delivered to us in the S3 folders. This time, the execution ID is e6d11701c54a2597 in my case, so we should see a subfolder of ai://default/ with that ID after the execution is complete.

 There we have it! And inside it, we have the saved outputs.

Alright, great! I think we’ve come a long way. Now it’s time to switch gears and start thinking about fine-tuning our model, don’t you think?

See you in the next blog ?.

Wrapping Up and Next Steps

Congratulations on mastering the essentials of checkpointing and resuming training for your Shakespearean Language Model! In this blog, we’ve explored critical aspects of setting up our training workflow using Docker and SAP AI Core.

Let’s recap what we’ve covered:

Understanding Checkpointing: We discussed the concept of checkpointing, its importance, and how it can save time and resources in training large models.Leveraging Separate Docker Images: We explored the benefits of using separate Docker images for modular design, scalability, and gaining hands-on experience with SAP AI Core.Adapting Code for Checkpointing: We delved into modifying the code to support checkpointing, ensuring our model can resume training efficiently.Configuring Docker Images and Workflow Templates: We set up Docker images and workflow templates to manage our checkpointing process effectively.Deploying and Evaluating Checkpointer Workflow: We deployed the checkpointer workflow and evaluated the results, ensuring our model training process is robust and resilient.

Next Steps

Now that you’ve built and trained our Shakespearean Language Model, it’s time to dive deeper into the following advanced topics:

Fine-Tuning with Low-Rank Adaptation (LoRA): Learn how to use LoRA to fine-tune models with fewer parameters, making the process more efficient and effective.
[SAP AI Core is All You Need | 5. Fine Tuning with Low-Rank Adaptation (LoRA)]Fine-Tuning Pipeline: Dive into fine-tuning techniques to enhance model performance on specific datasets or tasks. We’ll explore the deployment of fine-tuning pipelines using SAP AI Core and explore model deployment and serving using KServe with SAP AI Core. Learn how to efficiently serve fine-tuned models for real-world applications.
[SAP AI Core is All You Need | 6. Serving Shakespeare Model using SAP AI Core and KServe]Sampling and Consuming Language Models: Discover methods for sampling from trained language models and integrating them into applications.
[SAP AI Core is All You Need | 7. Deploying Language Models for Text Generation]Developing a Language-Model-Based App: Gain insights into building an application powered by your trained language model.
[SAP AI Core is All You Need | 8. Consuming and Sampling from Shakespeare Language Models]

 

Further References

Source Code: GitHub repositorySAP AI Core HelpGeneral Checkpoint in PyTorchPyTorch: Cuda SemanticsCheckpoint Google Glossary

 

 

 

​ IntroductionWelcome back to our series “SAP AI Core is All You Need” ?.Hey there! In this blog, let’s dive into AI checkpointing – a common technique that’s like a safety net for your model training adventures. Imagine you’re training a machine learning model, pouring in hours of computing power and heaps of data. Suddenly, bam! The GPUs hit their limit on the cluster, or the boss told us to stop because we’re hogging all the resources, or sometimes there’s just a disaster in the data center. All that hard work – gone in an instant? Not with checkpointing!What to ExpectIn this blog, you will gain hands-on experience with the following key concepts:Understanding Checkpointing: Learn the importance of checkpointing in machine learning and how it can save time and resources by allowing you to resume training from the last successful point.Creating Separate Docker Images: Explore the benefits of modular design and scalability by using separate Docker images for different stages of the training process.Adapting Code for Checkpointing: Understand how to modify your code to support checkpointing, ensuring that your model can efficiently resume training.Configuring Docker Images and Workflow Templates: Learn to set up Docker images and workflow templates to manage your checkpointing process effectively.Deploying the Checkpointer Workflow: See how to deploy your configured workflow template on SAP AI Core and evaluate the results of your checkpointing strategy.Understanding Checkpointing/Resuming TrainingWhat’s Checkpointing, Anyway? Checkpointing is like hitting “save” during a video game. It’s a smart strategy that periodically saves the state of your model during training. This means capturing critical info like the model’s weights, biases, and other parameters at different stages of the training journey.Checkpointing isn’t just a safety net; it’s a game-changer. It saves valuable time and resources by allowing you to resume training from the last successful point. No need to redo everything from scratch. This can be a lifesaver for big, complex models that take days or weeks to train. It’s also a powerful tool for keeping an eye on your model’s progress. By saving checkpoints regularly, you can monitor how your model evolves over time. Spotting trends or issues early on can help you fine-tune your approach and achieve better results.So, whether you’re training a language model or teaching your AI to recognize cats from dogs, remember the magic of checkpointing. It’s your safety net, your progress tracker, and your ticket to more efficient AI adventures. Happy checkpointing! ? Leveraging Separate Docker Images for Enhanced Checkpointing and ScalabilityAlright, so we’ve established that checkpointing is a lifesaver for your AI projects. But here’s the twist: we’re going to take things a step further by implementing it in a separate Docker image.Now, you might be wondering, “Why not just stick everything in one image?”. That’s a fair question! Here’s the thing: the only answer would be “to learn more”; however, while it works, keeping things separate will offer some sweet advantages to your learning journey:Modular Design: Think of it like building with Legos. A separate checkpointing image acts like a dedicated “checkpointing brick.” This keeps your training script nice and clean, focusing solely on the training logic.Scalability Potential: Imagine training multiple models at once. With separate images, you can easily manage checkpoints for each model independently, making things more scalable in the long run.Learning Opportunity: Let’s be honest, we’re all on a journey of AI mastery. Deploying things on SAP BTP with SAP AI Core is a fantastic way to gain experience in a powerful platform specifically designed for AI workloads. It’s like adding a new tool to your AI toolbox!Warm-up: This is a good chance to delve deeper into artifacts, workflows, and SAP AI Core, gearing up for fine-tuning our Shakespeare Language Model with some exciting tasks. It serves as a useful warm-up for the adjustments we’ll cover in the next blog post about Fine-Tuning.So, while keeping everything in one image might seem simpler at first, taking the “separate image” route offers a cleaner architecture, potential scalability, and a valuable learning experience on a robust platform like SAP BTP. Remember, the goal is to build robust and efficient AI models, and sometimes, a little extra planning goes a long way. So, embrace the power of separate images, and watch your machine learning projects thrive!However, in a production environment and real-world scenarios, it’s important to evaluate the most suitable approach based on your specific needs and resource constraints. Another consideration is whether this step can be integrated directly within the training workflow (using the same image, which is a more common approach) or omitted entirely (though not recommended for Language Models and Large Language Models).If you feel confident with these steps already, feel free to skip ahead to the next blog, okay? I can still see you around! ? Alright, let’s move forward then! Adapting/Changing code for CheckpointingLet’s dive into another interesting part of our machine learning project – defining what needs to be changed or adapted for our checkpointing setup.  main.py for the ai-core-checkpointer-setupAs you may expect, the setup code should be adapted because it was just providing the data input for our model, but this time, it needs to provide also the language model in its last state and the BPE model (tokenizer), right? Let’s see how the main.py methods gets now: from ShakespeareanGenerator.logger import Logger
from ShakespeareanGenerator.artifact_manager import ObjectStoreArtifactManager

class Run:
def __init__(self):
self.logging = Logger()
self.obj = ObjectStoreArtifactManager()
self.prepare_data()

def prepare_data(self):
self.logging.info(‘START: PREPARATION STEP’)
self.obj.upload_file_to_object_store()
self.logging.info(‘Training Data was uploaded to Object Store’)
self.obj.copy_object(model_type=’model’)
self.logging.info(‘The Language Model was successfully uploaded to the object store’)
self.obj.copy_object(model_type=’bpe_model’)
self.logging.info(‘The trained tokenizer (BPE) was successfully uploaded to the object store’)
self.logging.info(‘END: PREPARATION STEP’)

if __name__ == ‘__main__’:
Run() Yes, it has two instances of the method copy_object for each input: model and bpe_model.  self.obj.copy_object(model_type=’model’)
self.logging.info(‘The Language Model was successfully uploaded to the object store’)
self.obj.copy_object(model_type=’bpe_model’)
self.logging.info(‘The trained tokenizer (BPE) was successfully uploaded to the object store’) This is implemented by the class ObjectStoreArtifactManager from artifact_manager, so let’s see it: import boto3
import requests
from ShakespeareanGenerator.logger import Logger
from ShakespeareanGenerator.parameters import ObjectStoreParameters

class ObjectStoreArtifactManager:

def __init__(self):

self.logging = Logger()
self.obj_parameters = ObjectStoreParameters()
self.s3 = self.__get_s3_connection()
self.latest_execution_id = None

def __get_s3_connection(self):
return boto3.client(
‘s3’,
aws_access_key_id = self.obj_parameters.access_key_id,
aws_secret_access_key = self.obj_parameters.secret_access_key
)

def __get_executions(self):
response = self.s3.list_objects_v2(Bucket=self.obj_parameters.bucket_name, Prefix=self.obj_parameters.prefix_m)
unique_prefixes = set()
for obj in response[‘Contents’]:
prefix_part = obj[‘Key’].split(‘/’)[2]
unique_prefixes.add(prefix_part)
sorted_objects = sorted(response[‘Contents’], key=lambda x: x[‘LastModified’])
latest_keys = {}
for obj in sorted_objects:
prefix_part = obj[‘Key’].split(‘/’)[2]
if prefix_part not in latest_keys:
latest_keys[prefix_part] = obj[‘Key’].split(‘/’)[2]

self.sorted_keys = list(latest_keys.values())

def __check_model_files_exist(self):
model_files_exist = False
for model_type in [‘model’, ‘bpe_model’]:
source_key = f”{self.obj_parameters.prefix_m}{self.latest_execution_id}/{model_type}/”
response = self.s3.list_objects_v2(Bucket=self.obj_parameters.bucket_name, Prefix=source_key)
if ‘Contents’ in response:
model_files_exist = True
else:
model_files_exist = False
self.logging.warning(‘Exit the loop if any model file is missing’)
break
return model_files_exist

def __get_latest_valid_execution_id(self):

if not hasattr(self, ‘sorted_keys’):
self.__get_executions()
self.logging.info(‘Reading all the models in object store from all executions’)
if not hasattr(self, ‘current_index’):
self.current_index = 0
self.logging.info(f’Initial Index: {self.current_index}’)
reversed_prefixes = list(map(lambda x: x, reversed(self.sorted_keys)))
for index in range(0, len(self.sorted_keys)):
self.latest_execution_id = reversed_prefixes[index]
if self.__check_model_files_exist():
return self.latest_execution_id
else:
msg = ‘Files for execution ID not found. {}’.format(self.latest_execution_id)
self.logging.warning(msg)

def copy_object(self, model_type):

self.__get_latest_valid_execution_id()
model_mappings = {
‘model’: (‘model.pkl’,'{}{}model.pkl’.format(
self.obj_parameters.prefix,
self.obj_parameters.INPUT_MODEL_PATH
)),
‘bpe_model_vocab’: (‘vocab.json’, ‘{}{}vocab.json’.format(
self.obj_parameters.prefix,
self.obj_parameters.INPUT_BPE_MODEL_PATH
)),
‘bpe_model_merges’: (‘merges.txt’, ‘{}{}merges.txt’.format(
self.obj_parameters.prefix,
self.obj_parameters.INPUT_BPE_MODEL_PATH
))
}
if not any(key.startswith(model_type) for key in model_mappings):
raise ValueError(f”Invalid model_type: {model_type}”)
for key, (model_file_name, destination_key) in model_mappings.items():
if key.startswith(model_type):
source_key = f”{self.obj_parameters.prefix_m}{self.latest_execution_id}/{model_type}/{model_file_name}”
self.logging.info(f’FROM: {source_key} TO: {destination_key}’)
self.logging.info(f’Starting copy process for {model_type}’)
self.s3.copy_object(
Bucket=self.obj_parameters.bucket_name,
CopySource={‘Bucket’: self.obj_parameters.bucket_name, ‘Key’: source_key},
Key=destination_key
)
self.logging.info(f'{model_type} artifacts were updated from {self.latest_execution_id} folder to the input folders for further processing’)
return self.latest_execution_id

def upload_file_to_object_store(self):
url = “<link_to_github_repository>tinyshakespeare.txt”

file_key = f”{self.obj_parameters.prefix}{self.obj_parameters.DATA_PATH + self.obj_parameters.DATA_NAME}”
try:
response = requests.get(url)
response.raise_for_status() # Raise an exception for HTTP errors
corpus = response.text
corpus = “<b>”.join(corpus.split(‘n’))
self.s3.put_object(
Bucket=self.obj_parameters.bucket_name,
Key=file_key,
Body=corpus.encode(‘utf-8’)
)
self.logging.info(f”Uploaded tinyshakespeare.txt to S3 path: {file_key}”)
self.logging.info(f”{self.obj_parameters.prefix_m}”)
except requests.RequestException as e:
error_msg = f”Error fetching data from URL: {e}”
print(error_msg)
self.logging.error(error_msg)
except Exception as e:
error_msg = f”An unexpected error occurred: {e}”
print(error_msg)
self.logging.error(error_msg) Alright, so we’ve got a class here called ObjectStoreArtifactManager. This class helps manage and interact with object storage, specifically Amazon S3, within our machine learning workflow. Let’s go through what each part of this class is doing:Importing Libraries: We start by importing the necessary libraries like boto3 for AWS interactions and requests for making HTTP requests.Initialization: In the __init__ method, we set up some initial configurations:We create a logger instance (self.logging) to capture and manage logs.We initialize object store parameters (self.obj_parameters) to handle configurations related to our object storage setup.We establish an S3 connection (self.s3) using AWS credentials provided in ObjectStoreParameters.Private Methods (__get_s3_connection, __get_executions, __check_model_files_exist, __get_latest_valid_execution_id). These methods are prefixed with double underscores (__) to indicate that they are intended for internal use within the class and not to be accessed directly from outside.__get_s3_connection: Establishes and returns an S3 client connection using AWS credentials that came from the Generic Secret we created (object-store-credentials).__get_executions: Retrieves and sorts training execution information from the specified S3 bucket and prefix.__check_model_files_exist: Checks if model files exist for the latest execution ID retrieved because it might be an execution that ended with errors and if this is the case the last execution ID would be there, but no files (models) will be available.__get_latest_valid_execution_id: Retrieves the latest valid execution ID based on existing model files.Public Methods (copy_object, upload_file_to_object_store). These methods are accessible from outside the class and perform specific tasks related to managing object storage.copy_object: Copies model artifacts (e.g., model.pkl, vocab.json) based on the latest valid execution ID to designated input folders for further processing.upload_file_to_object_store: Downloads a corpus file from a specified URL and uploads it to the object store (S3) using the provided object store parameters.Here’s an example of how the code will deal with S3 bucket.Step 1: It will locate the latest valid execution folder within the S3 paths (ai://default/<execution_id>/model/ and ai://default/<execution_id>/bpe_model/) where SAP AI Core stores models from that execution.Step 2: The code checks each execution folder to see if it contains models. If it doesn’t find any models in the latest execution folder, it will search in the previous execution folders until it finds valid models. If no pre-trained models are found in any of the folders, the code will raise an error indicating that no pre-trained models were located.Step 3: Once it identifies the valid execution folder containing models, it will write these models to the paths ai://shakespeare/input_model/ and ai://shakespeare/input_tokenizer/. This process ensures that the necessary models are retrieved and stored for further processing in the workflow.This ObjectStoreArtifactManager class is designed to provide all required interactions with object storage for our machine learning workflow, handling tasks such as copying model artifacts and uploading data files. It’s all about managing our resources and keeping artifacts available for resuming training. main.py for the ai-core-checkpointerYou might already have a good idea of what to expect from the files we’ll be using, right? So, we’ll focus on the specific ones that have been adapted or changed for our checkpointing feature. Let’s start with main.py. import pickle
import torch
from ShakespeareanGenerator.model.language_models import ShakespeareanLanguagelModel, ModelTrainer
from ShakespeareanGenerator.parameters import TrainingParameters
from ShakespeareanGenerator.data_handler import DataHandler
from ShakespeareanGenerator.logger import Logger

class Run:
def __init__(self):
self.logging = Logger()
self.training_params = TrainingParameters()
self.check_gpu_usage()
self.prepare_data()
self.train_model()

def check_gpu_usage(self):
if torch.cuda.is_available():
self.logging.info(f”GPU is available, using GPU: {torch.cuda.get_device_name(0)}”)
self.logging.info(f”Using CUDA version {torch.version.cuda}”)
else:
self.logging.warning(“GPU is not available, using CPU.”)

def prepare_data(self):
self.logging.info(‘START OF EXECUTION’)
self.logging.info(‘Get DataHandler and Model Instances’)
self.data_handler = DataHandler(self.training_params.DATA_PATH)
try:
with open(self.training_params.INPUT_MODEL + ‘model.pkl’, ‘rb’) as f:
loaded_model = pickle.load(f)
self.logging.info(‘Loaded model for continuing training’)
except FileNotFoundError:
loaded_model = None
self.logging.error(‘Transfer learning not possible; no model found’)
self.logging.warning(‘Model will start from scratch’)

self.model_object = ShakespeareanLanguagelModel()
model = self.model_object if loaded_model is None else loaded_model
self.model = model.to(self.training_params.device)
self.logging.info(‘DataHandler and Model Instantiated’)

def train_model(self):
self.trainer = ModelTrainer(self.data_handler, self.model)
self.trainer.train()
self.logging.info(‘Model was trained successfully’)
with open(self.training_params.MODEL_PATH + ‘model.pkl’, ‘wb’) as f:
pickle.dump(self.model, f)
self.logging.info(‘END OF EXECUTION’)

if __name__ == ‘__main__’:
Run() As you can see, the changes are mainly related to this part, specifically:  try:
with open(self.training_params.INPUT_MODEL + ‘model.pkl’, ‘rb’) as f:
loaded_model = pickle.load(f)
self.logging.info(‘Loaded model for continuing training’)
except FileNotFoundError:
loaded_model = None
self.logging.error(‘Transfer learning not possible; no model found’)
self.logging.warning(‘Model will start from scratch’)

self.model_object = ShakespeareanLanguagelModel()
model = self.model_object if loaded_model is None else loaded_model Breaking it down, we have:try and except Block: This part of the code is like a safety net. It tries to open a file (model.pkl) containing a pre-trained model. If the file is found (FileNotFoundError is not raised), it loads the model using pickle.load(f).Loading the Model: If the file is found and the model is loaded successfully, it logs a message saying “Loaded model for continuing training”. This means you can pick up where you left off and continue training your awesome Shakespearean Language Model.Handling File Not Found: If the file (model.pkl) is not found (i.e., FileNotFoundError is raised), it sets loaded_model to None and logs an error message saying “Transfer learning not possible; no model found”. It also logs a warning message indicating that the model will start training from scratch.Initializing the Model: After handling the file loading, it initializes an instance of ShakespeareanLanguageModel() and assigns it to self.model_object.Setting the Training Model: Finally, the model variable is assigned based on whether loaded_model is None or not. If loaded_model is None, it means there was no existing model to load, so it sets model to self.model_object (a new instance of the model). If loaded_model is not None, it means a pre-trained model was successfully loaded, so it sets model to the loaded model.Not that difficult, right? However, you might have noticed that training_params.INPUT_MODEL is a new parameter. If you did, you’re correct! This is one of the new parameters that will be introduced soon. Hang tight, and we’ll jump into it very quickly. tokenizer.pySince we’ve already trained the tokenizer previously, there’s no need to train them again, especially if we’re using the same dataset and not fine-tuning anything. Alright, let’s break down another piece of code related to our machine learning project – the load_tokenizer function. This function is all about setting up and loading a tokenizer using the SentencePieceBPETokenizer.  def load_tokenizer(self):
self.tokenizer = SentencePieceBPETokenizer.from_file(
self.training_params.INPUT_TOKENIZER_MODEL +’vocab.json’,
merges_filename = self.training_params.INPUT_TOKENIZER_MODEL +’merges.txt’) Initializing the Tokenizer: In this function, we’re initializing our tokenizer attribute using the SentencePieceBPETokenizer.from_file method.File Paths: The tokenizer is loaded from two specific files:vocab.json: This file contains the vocabulary used by the tokenizer.merges.txt: This file contains information about token merges based on the tiny Shakespeare corpus.File Paths Explanation: The file paths (self.training_params.INPUT_TOKENIZER_MODEL + ‘vocab.json’ and self.training_params.INPUT_TOKENIZER_MODEL + ‘merges.txt’) are constructed based on the input tokenizer model directory specified in self.training_params (which will get into it in a minute).Additionally, we have added self.load_tokenizer() in the encode and decode methods as well.And that’s it! Can you believe it? Everything else remains unchanged. Well, we still need to review parameters.py, but there’s not much to add to it. It’s been pretty straightforward so far, right?  parameters.pyMoving forward, we’ve seen that some parameters were added in the code below, specifically: INPUT_MODEL = “/app/input_model/”. This is the one responsible for tell SAP AI Core that there’s an input artifact which is a model and will be somehow placed on that path within the S3 bucket.INPUT_TOKENIZER_MODEL = “/app/input_tokenizer/”. Same as previous, but this time it’s about the  artifacts generated from the tokenizer training (vocabulary and merges). Adapting/Changing Docker Image and Workflow Template for CheckpointingLet’s start by checking the Dockerfile for ai-core-checkpointer (we don’t need to revisit the ai-core-checkpointer-setup because we’ve been through the same Dockerfile for ai-core-training-setup, right?): # Use the PyTorch image with CUDA 12.1 and cuDNN 8 runtime
FROM pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime

# Install necessary system dependencies
RUN apt-get update && apt-get install -y
python3-pip
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*

# Create necessary directories within the Docker image
RUN mkdir -p /app/src /app/data /app/input_model /app/input_tokenizer /app/model /app/logs

# Copy files from local system to path in Docker image
COPY main.py /app/src/
COPY requirements.txt /app/src/
COPY /ShakespeareanGenerator/*.py /app/src/ShakespeareanGenerator/
COPY /ShakespeareanGenerator/model/*.py /app/src/ShakespeareanGenerator/model/

# Install Python dependencies within the Docker image
RUN pip3 install –no-cache-dir -r /app/src/requirements.txt

# Set permissions to execute anything inside the /app folder
RUN chgrp -R 65534 /app &&
chmod -R 777 /app If you’ve executed our training workflows from previous blogs, you might have noticed the use of a distinct base image in the Dockerfile for training purposes. This choice stems from the utilization of pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime, a PyTorch image tailored with CUDA 12.1 and cuDNN 8 runtime. This specialization underscores its optimization for leveraging GPU capabilities. In this manner, we ensure our training environment is finely tuned for GPU-accelerated tasks, aligning with the performance demands of modern machine learning workflows and allowing us to take full advantage of SAP AI Core Kubernetes cluster GPUs.Now, let’s focus on a couple of new directories in our setup. We’ve got some input directories that SAP AI Core will use during the workflow execution. These directories are set up as “input artifacts,” meaning SAP AI Core will look for files there, but it won’t copy anything into them – that’s the job of the “setup” step we covered in the previous blog. We’ll also walk you through the changes we made to the “setup” code so you can see what’s happening behind the scenes. RUN mkdir -p /app/data/
RUN mkdir -p /app/input_model/
RUN mkdir -p /app/input_tokenizer/ Next up, we’ve got a couple of new folders for our output. These directories are where SAP AI Core will automatically copy our outputs (thanks to the workflow template), and it will even create them if they don’t exist. The copied files will be sent to the S3 “default” path. RUN mkdir -p /app/model/
RUN mkdir -p /app/logs/ These simple mkdir commands are setting up our project structure to handle input data, models, and logs effectively within our Docker container. It’s all about keeping things organized and ready for the workflow ahead.You can download the checkpointer_template.yml file in the github repository: apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: “shakespeare-model-chkp”
annotations:
scenarios.ai.sap.com/name: “shakespeare-language-model”
scenarios.ai.sap.com/description: “Shakespeare Language Model”
executables.ai.sap.com/name: “Shakespeare-language-model-trainer-checkpointer”
executables.ai.sap.com/description: “Shakespeare Language Model Trainer Checkpointer Executable”
artifacts.ai.sap.com/data.kind: “dataset”
artifacts.ai.sap.com/data.description: “Tiny Shakespeare Dataset”
artifacts.ai.sap.com/model.kind: “model”
artifacts.ai.sap.com/model.description: “Trained Language Model”
artifacts.ai.sap.com/model.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/bpe_model.kind: “model”
artifacts.ai.sap.com/bpe_model.description: “Byte-Pair Encoding Tokenizer”
artifacts.ai.sap.com/bpe_model.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/setuplogs.kind: “other”
artifacts.ai.sap.com/setuplogs.description: “Setup Logs”
artifacts.ai.sap.com/setuplogs.labels: |
{“ext.ai.sap.com/step”:”setup”, “ext.ai.sap.com/version”:”0.0.1″}
artifacts.ai.sap.com/logs.kind: “other”
artifacts.ai.sap.com/logs.description: “Model Training Logs”
artifacts.ai.sap.com/logs.labels: |
{“ext.ai.sap.com/step”:”train”, “ext.ai.sap.com/version”:”0.0.1″}
labels:
scenarios.ai.sap.com/id: “shakespeare-language-model”
executables.ai.sap.com/id: “shakespeare-checkpointer”
ai.sap.com/version: “0.0.1”
spec:
imagePullSecrets:
– name: shakespeare-docker-repo
entrypoint: core
arguments:
parameters:
– name: BATCH_SIZE
description: The number of training examples processed in one iteration during training. It determines the size of each batch in the training dataset.
– name: CONTEXT_LENGTH
description: Defines the maximum length of input sequences, typically representing the number of tokens in each sequence or block of text.
– name: ITERATION_LIMIT
description: Specifies the maximum number of iterations or training steps to be performed during the training process. It controls the duration of the training loop.
– name: EVAL_FREQUENCY
description: Indicates how often model evaluation occurs during training, measured in the number of iterations or epochs between evaluations.
– name: EVAL_STEPS
description: Represents the number of evaluation steps to perform during each evaluation period. It determines the granularity of evaluation within each evaluation cycle.
– name: LEARNING_RATE
description: The rate at which the model parameters are updated during training, influencing the size of the steps taken in the parameter space to minimize the loss function.
– name: EMBEDDING_DIM
description: Determines the dimensionality of the embedding vectors used to represent tokens in the model. It impacts the expressive power of the model’s embedding layer.
– name: ATTENTION_HEADS
description: Specifies the number of parallel attention heads in the multi-head attention mechanism of the model. Each head learns different aspects of the input data.
– name: NUM_LAYERS
description: Represents the total number of transformer layers in the model architecture. It controls the depth and complexity of the model.
– name: DROPOUT
description: The probability of dropping out neurons or connections between layers during training, helping prevent overfitting by randomly deactivating some units.
– name: DICTIONARY_SIZE
description: Indicates the size of the vocabulary or dictionary used by the model, representing the total number of unique tokens or words in the dataset vocabulary.
templates:
– name: core
steps:
– – name: setup
template: setup-pipeline
– – name: train
template: train-pipeline
– name: setup-pipeline
metadata:
labels:
ai.sap.com/resourcePlan: basic
outputs:
artifacts:
– name: setup_logs
globalName: setup_logs
path: /app/logs/
archive:
none:
{}
container:
image: docker.io/carlosbasto/shakespeare-checkpointer-setup:0.0.1
imagePullPolicy: Always
command: [“/bin/sh”, “-c”]
args:
– python /app/src/main.py
env:
– name: BUCKET_NAME
valueFrom:
secretKeyRef:
name: object-store-credentials
key: bucket
– name: PREFIX_NAME
valueFrom:
secretKeyRef:
name: object-store-credentials
key: path_prefix
– name: ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: object-store-credentials
key: access_key_id
– name: SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: object-store-credentials
key: secret_access_key
– name: train-pipeline
metadata:
labels:
ai.sap.com/resourcePlan: train.l
inputs:
artifacts:
– name: data
path: /app/data/
– name: input_model
path: /app/input_model/
– name: input_tokenizer
path: /app/input_tokenizer/
outputs:
artifacts:
– name: model
path: /app/model/
globalName: model
archive:
none:
{}
– name: logs
path: /app/logs/
archive:
none:
{}
container:
image: docker.io/carlosbasto/shakespeare-checkpointer:0.0.1
imagePullPolicy: Always
command: [“/bin/sh”, “-c”]
args:
– python /app/src/main.py
env:
– name: BATCH_SIZE
value: “{{workflow.parameters.BATCH_SIZE}}”
– name: CONTEXT_LENGTH
value: “{{workflow.parameters.CONTEXT_LENGTH}}”
– name: ITERATION_LIMIT
value: “{{workflow.parameters.ITERATION_LIMIT}}”
– name: EVAL_FREQUENCY
value: “{{workflow.parameters.EVAL_FREQUENCY}}”
– name: EVAL_STEPS
value: “{{workflow.parameters.EVAL_STEPS}}”
– name: LEARNING_RATE
value: “{{workflow.parameters.LEARNING_RATE}}”
– name: EMBEDDING_DIM
value: “{{workflow.parameters.EMBEDDING_DIM}}”
– name: ATTENTION_HEADS
value: “{{workflow.parameters.ATTENTION_HEADS}}”
– name: NUM_LAYERS
value: “{{workflow.parameters.NUM_LAYERS}}”
– name: DROPOUT
value: “{{workflow.parameters.DROPOUT}}”
– name: DICTIONARY_SIZE
value: “{{workflow.parameters.DICTIONARY_SIZE}}” Place it into your own github repository and then, if you sync your application, you’ll notice another file added. Let’s jump into the scenario with this new executable. Here, you’ll find two executables:In addition to the input parameters (which are the same as those for the trainer), we have both input and output parameters to consider:Creating Configuration and Deploy Checkpointer WorkflowNow that we have synced the scenario in SAP AI Core, let’s walk through the steps to create the necessary artifacts and configure them for our execution scenario. If you need a refresher, you can check out the previous blog post for more clarity.Are you back? Good, let’s create an artifact of type “model” for the scenario we had (shakespeare-language-model).Now, we’re going to create the BPE (Byte-Pair Encoding) model artifact. Give it a meaningful name that reflects its purpose.Next, we’ll need to specify the URL or path in S3 where the BPE model will be stored. This path should be set up during the initial workflow setup.Since this BPE model is an input artifact for the checkpointer workflow, make sure that the corresponding folder exists in the object store (S3). The same goes for the input_model folder.Feel free to add a label if you want – it can help organize and identify artifacts more easily. That’s all for the tokenizer model setup!Now let’s repeat the process for the input_model:Create the input_model artifact following similar steps. Once we’ve created these artifacts, we need to set up a configuration to map them to our specific scenario. Next, map the inputs within the configuration to the corresponding artifacts we’ve just created.After following these steps, you’ll end up with a result similar to the one below, with extra inputs incorporated into the setup.Now, you can clearly see that we’ve mapped 3 inputs to our configuration and obtained 2 outputs as results. Pretty neat, right? ?As expected, the checkpointer should resume from where the trainer left off: from 3.0334 to 3.035 for Training Loss and 3.3903 to 3.3866 for Validation Loss. It’s nothing fancy, but it gives us a little more satisfaction knowing it worked. Of course, feel free to run it as many times as you want to try and achieve better results.Anyway, let’s check out what SAP AI Core has delivered to us in the S3 folders. This time, the execution ID is e6d11701c54a2597 in my case, so we should see a subfolder of ai://default/ with that ID after the execution is complete. There we have it! And inside it, we have the saved outputs.Alright, great! I think we’ve come a long way. Now it’s time to switch gears and start thinking about fine-tuning our model, don’t you think?See you in the next blog ?.Wrapping Up and Next StepsCongratulations on mastering the essentials of checkpointing and resuming training for your Shakespearean Language Model! In this blog, we’ve explored critical aspects of setting up our training workflow using Docker and SAP AI Core.Let’s recap what we’ve covered:Understanding Checkpointing: We discussed the concept of checkpointing, its importance, and how it can save time and resources in training large models.Leveraging Separate Docker Images: We explored the benefits of using separate Docker images for modular design, scalability, and gaining hands-on experience with SAP AI Core.Adapting Code for Checkpointing: We delved into modifying the code to support checkpointing, ensuring our model can resume training efficiently.Configuring Docker Images and Workflow Templates: We set up Docker images and workflow templates to manage our checkpointing process effectively.Deploying and Evaluating Checkpointer Workflow: We deployed the checkpointer workflow and evaluated the results, ensuring our model training process is robust and resilient.Next StepsNow that you’ve built and trained our Shakespearean Language Model, it’s time to dive deeper into the following advanced topics:Fine-Tuning with Low-Rank Adaptation (LoRA): Learn how to use LoRA to fine-tune models with fewer parameters, making the process more efficient and effective.[SAP AI Core is All You Need | 5. Fine Tuning with Low-Rank Adaptation (LoRA)]Fine-Tuning Pipeline: Dive into fine-tuning techniques to enhance model performance on specific datasets or tasks. We’ll explore the deployment of fine-tuning pipelines using SAP AI Core and explore model deployment and serving using KServe with SAP AI Core. Learn how to efficiently serve fine-tuned models for real-world applications.[SAP AI Core is All You Need | 6. Serving Shakespeare Model using SAP AI Core and KServe]Sampling and Consuming Language Models: Discover methods for sampling from trained language models and integrating them into applications.[SAP AI Core is All You Need | 7. Deploying Language Models for Text Generation]Developing a Language-Model-Based App: Gain insights into building an application powered by your trained language model.[SAP AI Core is All You Need | 8. Consuming and Sampling from Shakespeare Language Models] Further ReferencesSource Code: GitHub repositorySAP AI Core HelpGeneral Checkpoint in PyTorchPyTorch: Cuda SemanticsCheckpoint Google Glossary     Read More Technology Blogs by SAP articles 

#SAP

#SAPTechnologyblog

You May Also Like

More From Author

+ There are no comments

Add yours