SAP AI Core is All You Need | 1. Building your Own Language Model with Transformers

Estimated read time 73 min read

Introduction

Welcome to the first installment of our series “SAP AI Core is All You Need“! 

In this blog, titled “Building your Own Language Model with Transformers“, we’ll dive into the amazing capabilities of Transformers – the architecture that powers models like GPT (you can also check the first paper on that: Improving Language Understanding by Generative Pre-Training). We’ll guide you through the process of building a (your) language model from scratch using this cutting-edge technology. Without further ado, let’s get started!

What to Expect

In this blog, you will gain hands-on experience with the following key concepts:

Understanding Transformers: Learn about the architecture that has revolutionized natural language processing (NLP) by effectively handling sequential data through attention mechanisms.Implementing Transformers from Scratch: Discover the process of creating your own language model tailored to specific needs, using the Shakespearean language as our case study.

Get ready to dive into a series of blog posts where we’ll unlock the amazing potential of SAP AI Core and SAP AI Launchpad. We’ll explore a rich collection of components and ideas to fuel your AI adventures. And here’s a fun fact: our series title is a playful nod to the groundbreaking paper “Attention is All You Need“. Why? Because we’ll be using those same principles to build our very own language model. Ready?

 

Getting Hands-On with Transformers

First, let’s understand what “Transformers” are. The best place to start is the paper “Attention is All You Need” as we mentioned before, authored by Google in 2017. This paper presented a novel approach to handling sequential data by replacing traditional recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) with a fully attention-based model. 

Now, we’ll dive into this architecture and show you how SAP AI Core can simplify building, training, and deploying these models on a Kubernetes cluster running on GPU. Exciting, right?

Defining and Understanding Transformers

So, what are Transformers? They’re a type of deep learning model designed to process sequential data, like text, more effectively than previous models such as RNNs and LSTMs. Transformers are exceptional at capturing long-range dependencies and context, making them promising for tasks like translation, text generation, and language understanding. The magic lies in the attention mechanism, which lets the model dynamically focus on different parts of the input sequence. But, before we move forward, it is important to notice that we can use transformer architecture in different ways:

The image provides a visual comparison of three types of transformer architectures: Encoder-Only models like BERT, which are used for tasks such as sentiment analysis; Encoder-Decoder models like BART and T5, which are used for sequence-to-sequence tasks like translation; and Decoder-Only models like GPT, which are used for generative tasks and feature an autoregressive property – that is the one we’ll be building from now on.

Why Use Transformers with SAP AI Core?

SAP AI Core is a handy service in the SAP Business Technology Platform that manages and runs your AI projects in a standardized, scalable way, without being tied to any specific cloud provider. It smoothly integrates with your SAP solutions, and you can easily use any AI function with open-source frameworks. SAP AI Core takes care of the full lifecycle management of AI scenarios. Plus, you can tap into generative AI capabilities and manage prompts via the generative AI hub. By leveraging SAP AI Core, you can speed up the entire process of building and managing transformers. Here’s how SAP AI Core can enhance your workflow:

Scalability and Efficiency: SAP AI Core allows you to train and fine-tune transformers on powerful GPU clusters, ensuring efficient handling of large datasets and complex models.Integration with Kubernetes: Deploy your language models seamlessly on a Kubernetes cluster, benefiting from robust orchestration and resource management.Simplified Implementation: SAP AI Core provides a suite of tools and pre-built components that simplify the implementation of advanced AI workflows, from data preprocessing to model deployment.

Implementing Transformers from Scratch

Have you ever wondered why building AI models from scratch might not always be the best route, especially when there are fantastic pre-trained models readily available, like those you’ll find on the SAP GenAI Hub? Let’s discuss why creating your own language model can be an exciting and fulfilling experience. Here are some of the main reasons to consider building your own:

Tailored to Your Specific Needs: Developing your own language model allows you to customize it precisely to fit your unique business requirements and industry nuances. This level of customization can lead to more accurate and relevant outputs for your specific use cases.Complete Control Over Data and Privacy: By building your own model, you have full control over the data used for training, ensuring privacy and compliance with your organization’s policies and regulations. This control is important, especially for sensitive or proprietary information.Opportunity for Innovation and Learning: Building a language model from scratch presents a valuable learning opportunity for your team. It encourages innovation, problem-solving, and deep understanding of AI technologies, fostering a culture of continuous improvement within your organization.Domain-Specific Insights and Knowledge: In certain industries or niche markets, pre-trained models might not capture specialized vocabulary or context effectively. Creating your own model allows you to incorporate domain-specific knowledge, resulting in more relevant and actionable insights.Long-Term Flexibility and Ownership: Building your own model means you own the intellectual property and have the flexibility to evolve and adapt the model over time as your business needs change. This long-term ownership can be a strategic advantage in a rapidly evolving AI landscape.Potential for Competitive Advantage: A custom language model can differentiate your products or services in the market, providing a unique selling point that sets you apart from competitors relying solely on off-the-shelf solutions.

As we dive into building our Shakespearean language model, we’ve got some key classes in our toolkit that handle the most important tasks like attention, training, and logging metrics. We’ll explore each class throughout this blog to understand how they compose a decoder-only Transformer (similar to the used in GPT2 as described on the paper “Language Models are Unsupervised Multitask Learners“). However, we need to start from somewhere, right? And, this is going to be the attention mechanism.

Understanding and Implementing the Attention Mechanism

The “Scaled Dot-Product Attention” is the core of the attention mechanism that involves the computation of attention scores between query (Q) and key (K) vectors. The process can be broken down into the following steps:

Query, Key, and Value Matrices

Each input token is transformed into three different vectors: a query vector, a key vector, and a value vector. These vectors are obtained through learned linear projections.

Where X is the input matrix, and WQ, WK, WV are the weight matrices for queries, keys, and values, respectively (the W matrices are trainable – so it learns).

Calculating Attention Scores

The attention scores are computed by taking the dot product of the query vectors with the key vectors. This is where MatMul (matrix multiplication) comes into play:

The QKT represents the matrix multiplication of the query matrix Q with the transpose of the key matrix K. This operation results in a matrix of attention scores, which indicates how much focus each token of the sequence should give to every other token.

Softmax

The scaled scores are then passed through a softmax function to obtain normalized attention weights, which sum to one. These weights determine the importance of each key-value pair in the context of the current query.

Weighted Sum of Values

Finally, the normalized attention weights are used to compute a weighted sum of the value vectors. This results in the output of the attention mechanism:

Here again, MatMul is used to multiply the attention weights with the value matrix V.

Coding Attention Mechanism

Initialization Method:

 

class Head(nn.Module):

def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(training_params.embedding_dim, head_size, bias=False)
self.query = nn.Linear(training_params.embedding_dim, head_size, bias=False)
self.value = nn.Linear(training_params.embedding_dim, head_size, bias=False)
self.register_buffer(‘tril’, torch.tril(torch.ones(training_params.context_length, training_params.context_length)))
self.dropout = nn.Dropout(training_params.dropout)

 

self.key, self.query, self.value: These are linear layers that project the input embeddings into key, query, and value vectors, respectively. Each linear layer maps the input of size training_params.embedding_dim to a smaller size defined by head_size.self.register_buffer(‘tril’, …): This creates a lower triangular matrix buffer (tril), used to mask out the upper triangular part of the attention scores matrix, ensuring that the model does not attend to future positions in the sequence. This mask is mandatory for decoder modules in transformers architecture, as they must not have knowledge of the subsequent tokens in the sequence, which aligns with their primary objective.self.dropout: This is a dropout layer that helps in regularizing the model by randomly setting some of the elements to zero during training, it’s not part of the attention expression.

Compute Weights Method:

 

def __compute_weights(self, x):

B, T, C = x.shape
k = self.key(x)
q = self.query(x)

weights = q @ k.transpose(-2, -1) * (k.shape[-1] ** -0.5)
weights = weights.masked_fill(self.tril[:T, :T] == 0, float(‘-inf’))
weights = F.softmax(weights, dim=-1)
weights = self.dropout(weights)
return weights

 

B, T, C = x.shape: Extracts the batch size (B), sequence length (T) – or context_length, and embedding dimension (C) from the input tensor x.k = self.key(x): Projects the input tensor x to the key vector k.q = self.query(x): Projects the input tensor x to the query vector q.weights = q @ k.transpose(-2, -1) * (k.shape[-1] ** -0.5): Computes the scaled dot-product attention scores.q @ k.transpose(-2, -1) is the matrix multiplication between the query and the transposed key matrices. The result is then scaled by the inverse square root of the key dimension (* (k.shape[-1] ** -0.5)) which is the scale part of the attention illustration above.weights = weights.masked_fill(self.tril[:T, :T] == 0, float(‘-inf’)): Applies the mask to ensure the model only attends to current and previous positions (not future ones).weights = F.softmax(weights, dim=-1): Applies the softmax function to the attention scores to obtain the attention weights, which sum to 1.weights = self.dropout(weights): Applies dropout to the attention weights for regularization.

Forward Method:

 

def forward(self, x):

weights = self.__compute_weights(x)
v = self.value(x)
out = weights @ v
return out

 

weights = self.__compute_weights(x): Calls the __compute_weights method to get the attention weights.v = self.value(x): Projects the input tensor x to the value vector v.out = weights @ v: Computes the weighted sum of the value vectors using the attention weights. This is the core operation of the attention mechanism, where each value vector is weighted by the attention scores.

Well, that’s it for attention mechanism. If you need a more visual explanation, maybe Attention in transformers, visually explained | Chapter 6, Deep Learning by 3Blue1Brown can help you with. That’s a great source to go for. Anyway, what next? Maybe you might thing that putting several heads to run in parallel would give the model the ability to “see by many perspectives”, right? This is what Multi-Head Attention does.

Multi-Head Attention

Multi-head attention involves running multiple attention mechanisms, or “heads,” in parallel. Each head operates on a different linear projection of the input, and the results are concatenated and transformed to produce the final output. This allows the model to jointly attend to information from different representation subspaces at different positions.

Linear Projections

For a given input sequence, the model first linearly projects the input into queries (Q), keys (K), and values (V) using learned weight matrices. Each head has its own set of projection matrices.

Where i indicates the i-th head, and WQi, WKi, WVi are the learned projection matrices for the queries, keys, and values of the i-th head, respectively.

Scaled Dot-Product Attention

Each head computes the attention scores using scaled dot-product attention (as we saw). The attention scores for the i-th head are computed as:

Here, dk is the dimension of the key vectors. The softmax function ensures that the attention scores sum to one, forming a probability distribution over the inputs.

Concatenation and Linear Transformation

The outputs of all attention heads are concatenated and linearly transformed to produce the final output:

Where head i ​ = Attention(Qi  ,K i ​ ,V i ​) and Wo is a learned weight matrix that projects the concatenated outputs back to the original dimension. 

Coding Multi-Head Attention

Initialization Method:

 

class MultiHeadAttention(nn.Module):

def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.projection = nn.Linear(head_size * num_heads, training_params.embedding_dim)
self.dropout = nn.Dropout(training_params.dropout)

 

self.heads: This line initializes multiple attention heads. Each head is an instance of the Head class, which implements the single-head attention mechanism. The number of heads is specified by num_heads, and each head operates on a subspace of the input with dimensionality head_size.self.projection: This is a linear layer that projects the concatenated outputs of all attention heads back to the original embedding dimension. The input dimension to this layer is head_size * num_heads because the outputs from all heads are concatenated.self.dropout: This applies dropout regularization to the output of the projection layer, helping to prevent overfitting during training.

Forward Method:

 

def forward(self, x):
head_outputs = [head(x) for head in self.heads]
out = torch.cat(head_outputs, dim=-1)
out = self.dropout(self.projection(out))
return out

 

head_outputs = [head(x) for head in self.heads]: This line iterates over each attention head and applies it to the input x. Each head processes the input independently, capturing different aspects of the input sequence. The result is a list of outputs, one for each head.out = torch.cat(head_outputs, dim=-1): The outputs from all heads are concatenated along the last dimension (i.e., the feature dimension). This combines the information from all heads into a single tensor.out = self.dropout(self.projection(out)): The concatenated output is passed through the projection layer to transform it back to the original embedding dimension. Dropout is then applied to the projected output for regularization.

Multi-head attention provides rich representations by capturing different aspects of the input data with each head focusing on various parts of the sequence, enhances computational efficiency through parallelization, and improves generalization by offering diverse perspectives for better performance across different tasks and datasets.

Feed-Forward Network (FFN)

In the transformer architecture, each layer of the encoder and decoder contains a position-wise feed-forward network, applied independently to each position in the sequence.

 

class FeedForward(nn.Module):

def __init__(self, embedding_dim):

super().__init__()
self.ffnet = nn.Sequential(
nn.Linear(embedding_dim, 4 * embedding_dim),
nn.ReLU(inplace=True),
nn.Linear(4 * embedding_dim, embedding_dim),
nn.Dropout(training_params.dropout)
)

def forward(self, x):
return self.ffnet(x)

 

The feed-forward network (FFN) is also a important component in the transformer architecture that provides additional transformation and learning capacity (“time to think”) to each layer. Unlike traditional RNNs, which apply the same weights to every position in the sequence, the FFN applies a unique set of transformations to each position independently. This enhances the model’s ability to capture complex patterns in the data.

Dimensionality Expansion: The FFN first projects the input into a higher-dimensional space (4 times the embedding dimension). This allows the model to learn more complex features by providing a greater capacity for transformation.Non-Linearity: The ReLU activation function introduces non-linearity into the model, enabling it to learn and represent more complex patterns and relationships in the data.Dimensionality Reduction: The second linear layer projects the higher-dimensional representation back to the original embedding dimension, ensuring that the output has the same size as the input for consistency across layers.Regularization: Dropout is applied to prevent overfitting, improving the model’s generalization to unseen data.

Transformer Block

Each transformer block consists of a multi-head self-attention mechanism followed by a feed-forward network, with layer normalization and residual connections around each sub-layer.

 

class TransformerBlock(nn.Module):

def __init__(self, embedding_dim, num_heads):
super().__init__()
head_size = embedding_dim // num_heads
self.self_attn = MultiHeadAttention(num_heads, head_size)
self.feed_forward = FeedForward(embedding_dim)
self.layer_norm1 = nn.LayerNorm(embedding_dim)
self.layer_norm2 = nn.LayerNorm(embedding_dim)

def forward(self, x):
attention_output = x + self.self_attn(self.layer_norm1(x))
output = attention_output + self.feed_forward(self.layer_norm2(attention_output))

return output

 

The transformer block integrates the key mechanisms that enable transformers to effectively process sequential data:

Multi-Head Self-Attention: This mechanism allows the model to attend to different parts of the input sequence simultaneously. By using multiple attention heads, the model can capture a variety of dependencies and relationships within the sequence. Each head operates on a different projection of the input, enabling the model to learn from multiple representation subspaces.Layer Normalization: Before applying the attention and feed-forward networks, layer normalization is used to stabilize and speed up the training process by normalizing the input across the features. This helps in maintaining consistent activations and gradients.Residual Connections: Adding the original input to the output of the sub-layers (attention and feed-forward) helps in mitigating the vanishing gradient problem and enables better gradient flow during backpropagation. This also facilitates learning identity mappings, which are critical for training deep networks.Feed-Forward Network: After the attention mechanism, the FFN provides additional processing capacity, allowing the model to transform the attended features further.

Full Transformer Model: Shakespeare Language Model

This class represents the full transformer model which we are calling as ShakespeareanLanguageModel, combining embeddings, multiple transformer blocks, and the output layer.

 

class ShakespeareanLanguagelModel(nn.Module):

def __init__(self):
super().__init__()

self.embeddings = nn.Embedding(training_params.dictionary_size, training_params.embedding_dim)
self.position_embeddings = nn.Embedding(training_params.context_length, training_params.embedding_dim)
self.transformer_blocks = nn.Sequential(
*[TransformerBlock(training_params.embedding_dim, num_heads=training_params.attention_heads) for _ in range(training_params.num_layers)]
)
self.layer_norm = nn.LayerNorm(training_params.embedding_dim)
self.output = nn.Linear(training_params.embedding_dim, training_params.dictionary_size)

self.apply(self._init_weights)

def _init_weights(self, module):
if isinstance(module, nn.Linear) or isinstance(module, nn.Embedding):
nn.init.normal_(module.weight, std=0.02)
if hasattr(module, ‘bias’) and module.bias is not None:
nn.init.zeros_(module.bias)

def forward(self, index, targets=None):

B, T = index.shape
token_embeddings = self.embeddings(index)
position_embeddings = self.position_embeddings(torch.arange(T, device=index.device))
x = token_embeddings + position_embeddings
x = self.transformer_blocks(x)
x = self.layer_norm(x)
logits = self.output(x)

if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)

return logits, loss

 

The full transformer model combines several key components to process and generate sequences effectively:

Token Embeddings: Converts the input tokens into dense vector representations. These embeddings capture semantic information about the tokens. At this point you may thing if you can use your own embeddings model and the answer is “yes”. Maybe it is a good moment to check what SAP HANA Vector Engine and SAP Gen AI Hub can do for you.Position Embeddings: Since transformers do not inherently capture the order of the sequence, positional encodings are added to the token embeddings to provide the model with information about the position of each token in the sequence. Transformers may use other positional encoding methods than sinusoidal such as RoPE (rotary positional embedding) or ALiBi (Attention with Linear Biases).Stack of Transformer Blocks: The core processing happens in a stack of transformer blocks, each containing multi-head self-attention and feed-forward networks. This stack enables the model to build deep hierarchical representations of the input data, capturing complex patterns and dependencies.Layer Normalization: Applied after the transformer blocks to ensure stable and consistent activations.Output Layer: A linear layer that projects the final hidden states to the vocabulary size, producing logits for each token. These logits can be used to generate probabilities for each token in the vocabulary.Weight Initialization: Ensures that the model weights are initialized properly, promoting faster convergence during training.Loss Calculation: If target sequences are provided, the model calculates the cross-entropy loss between the predicted logits and the target tokens, facilitating supervised learning.

As we are integrating these components, the model can effectively understand and generate text, making it suitable for a variety of natural language processing tasks, including language modeling, translation, and text generation (our approach). The use of SAP AI Core further enhances the model’s training and deployment capabilities, allowing for efficient handling of large-scale data and computational resources and you’ll see it on the next blogs.

Now that you’ve explored the decoder-only transformer model, named the “Shakespearean Language Model”, you’re equipped with everything needed to generate Shakespearean text. Exciting, isn’t it? All that’s left is to walk through the implementation steps. Let’s get started!

Implementing Shakespeare Language Model (Step-by-Step):

Now that we’ve explored the potential of building our own language model, let’s delve into the code! Python is our trusty companion in this journey, and its libraries offer powerful tools for working with AI and natural language processing so you may need to install some of them, specially, PyTorch

Another thing to mention here is that we’re creating transformers from scratch purely for educational purposes. In real-world scenarios, you can rely on libraries that offer many resources to speed up the development of your architectures such as Hugging Face, PyTorch or TensorFlow. However, we’ll start from scratch so you can gain a better understanding of how everything works behind the scenes.

1. Organizing Language Model Lifecycle with main.py

In our journey to explore the world of language modeling, we’re excited to introduce the implementation of a Shakespearean language model using Python. This script, main.py, encapsulates the key steps involved in training and deploying our custom model. Now, we’ll be discussing the code itself and the Transformers architecture. Some portions of this code will be revisited in upcoming blogs to clarify their usage and demonstrate their relevance to SAP AI Core. Let’s go!

The main.py script serves as the backbone of our language modeling project, integrating various components to build and train the Shakespearean language model.

 

import pickle
import torch
from ShakespeareanGenerator.model.language_models import ShakespeareanLanguagelModel, ModelTrainer
from ShakespeareanGenerator.parameters import TrainingParameters
from ShakespeareanGenerator.data_handler import DataHandler
from ShakespeareanGenerator.logger import Logger

class Run:
def __init__(self):
self.logging = Logger()
self.training_params = TrainingParameters()
self.check_gpu_usage()
self.prepare_data()
self.train_model()

def check_gpu_usage(self):
if torch.cuda.is_available():
self.logging.info(f”GPU is available, using GPU: {torch.cuda.get_device_name(0)}”)
self.logging.info(f”Using CUDA version {torch.version.cuda}”)
else:
self.logging.warning(“GPU is not available, using CPU.”)

def prepare_data(self):
self.logging.info(‘START OF EXECUTION’)
self.logging.info(‘Get DataHandler and Model Instances’)
self.data_handler = DataHandler(self.training_params.DATA_PATH)
self.model_object = ShakespeareanLanguagelModel()
self.model = self.model_object.to(self.training_params.device)
self.logging.info(‘DataHandler and Model Instantiated’)

def train_model(self):
self.trainer = ModelTrainer(self.data_handler, self.model)
self.trainer.train()
self.logging.info(‘Model was trained successfully’)
with open(self.training_params.MODEL_PATH + ‘model.pkl’, ‘wb’) as f:
pickle.dump(self.model, f)
self.logging.info(‘END OF EXECUTION’)

if __name__ == ‘__main__’:
Run()

 

Imports: Imports necessary libraries and modules, including pickle, torch, and custom modules for the model, parameters, data handling, and logging.

Class Initialization:Logger Initialization: Initializes the logging mechanism.Training Parameters: Sets up training parameters.Check GPU Usage: Determines if GPU is available and logs the information.

Check GPU Usage: Checks if a GPU is available and logs the GPU name and CUDA version. If not available, logs a warning to indicate CPU usage.

Prepare Data:Log Start of Execution: Logs the start of execution.Initialize Data Handler and Model: Logs the creation of data handler and model instances.Set Model Device: Moves the model to the specified device (CPU or GPU).Log Data Handler and Model Initialization: Confirms that the data handler and model have been instantiated.Train Model:Model Trainer Initialization: Initializes the model trainer with data handler and model.Model Training: Trains the model.Log Training Success: Logs that the model was trained successfully.Save Trained Model: Saves the trained model to a file using pickle.Log End of Execution: Logs the end of execution.Run Class: Instantiates and runs the Run class when the script is executed. This script sets up the entire workflow for training a Shakespearean language model, including data preparation, model training, logging, and saving the trained model.

2. Configuring Training Parameters for the Shakespearean Language Model

Essentially, the model will require parameters for three main purposes: to make the model training pipeline “tunable” so that anyone can easily adjust its parameters and conduct further experiments, to establish the paths for input and output in the code, and to manage credentials for services such as SAP AI Object Store and SAP HANA. This class, defined in parameters.py, encapsulates the foundational parameters that drive the model training process.

 

# Import necessary libraries
import os
import torch

class TrainingParameters:
def __init__(self):
self.batch_size = int(os.environ.get(‘BATCH_SIZE’))
self.context_length = int(os.environ.get(‘CONTEXT_LENGTH’))
self.iteration_limit = int(os.environ.get(‘ITERATION_LIMIT’))
self.eval_frequency = int(os.environ.get(‘EVAL_FREQUENCY’))
self.eval_steps = int(os.environ.get(‘EVAL_STEPS’))
self.learning_rate = float(os.environ.get(‘LEARNING_RATE’))
self.embedding_dim = int(os.environ.get(‘EMBEDDING_DIM’))
self.attention_heads = int(os.environ.get(‘ATTENTION_HEADS’))
self.num_layers = int(os.environ.get(‘NUM_LAYERS’))
self.dropout = float(os.environ.get(‘DROPOUT’))
self.dictionary_size = int(os.environ.get(‘DICTIONARY_SIZE’))
self.device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’

self.DATA_PATH = ‘/app/data/tinyshakespeare.txt’
self.MODEL_PATH = ‘/app/model/’
self.TOKENIZER_MODEL_PATH = ‘/app/tokenizer/’
self.LOG_PATH = ‘/app/logs/’
self.LOG_NAME = ‘train_logs.log’

 

Environment Variable Initialization: The __init__ method retrieves key training parameters from environment variables which are defined in the YAML templates, and SAP AI Core takes care of the rest by making them available within the container. This allows the code to easily access and utilize these variables as needed.Device Configuration (self.device): The device attribute determines whether to utilize CPU or GPU for model computation based on the availability of CUDA-enabled GPUs.File Paths Definition: Within the TrainingParameters class, file paths are defined for key workflow components. These include the dataset path (DATA_PATH) for input, the trained model path (MODEL_PATH) for output, the tokenizer model path (TOKENIZER_MODEL_PATH), and the log file path (LOG_PATH).

With this class, our language modeling project gains enhanced configurability and efficiency, putting us one step closer to our goal.

3. Tracking Progress with the Logger Class

This utility class logger.py simplifies the process of capturing and organizing important execution details, ensuring clarity and transparency throughout our development process. You can use standard and pre-build resources for logging, but we wanted a bit more “understandable” approach here ?.

 

import logging
from ShakespeareanGenerator.parameters import TrainingParameters

class Logger:
def __init__(self):

self.training_params = TrainingParameters()
self.log_file = self.training_params.LOG_PATH + self.training_params.LOG_NAME

logging.basicConfig(
filename=self.log_file,
filemode=’w’,
format=’%(asctime)s | %(name)s → %(levelname)s: %(message)s’,
level=logging.INFO
)
self.logger = logging.getLogger(__name__)

def log(self, level, message):
getattr(self.logger, level)(message)

def info(self, message):
self.log(‘info’, message)

def warning(self, message):
self.log(‘warning’, message)

def error(self, message):
self.log(‘error’, message)

def critical(self, message):
self.log(‘critical’, message)

 

Initialization: The __init__ method initializes the logger by configuring logging settings such as the log file path (log_file), format, and logging level (INFO) – the more granular level. The TrainingParameters instance is used to retrieve the log file path defined in our project settings.Logging Methods (info, warning, error, critical): The Logger class provides convenience methods (info, warning, error, critical) for logging messages with different severity levels. Each method calls the log function with the corresponding logging level (info, warning, error, critical) to streamline the logging process.

Our language modeling project now gains visibility into execution progress and potential issues. The structured logging format ensures clarity and facilitates effective debugging during model development and training. Believe me, you’ll need some good logs along the way.

4. Managing Data for the Shakespearean Language Model

To create a strong Shakespearean language model (and all the other language models in the world ?), handling data efficiently is key for getting text ready to work with. Our DataHandler class, found in data_handler.py, is at the heart of managing these important data tasks for training and evaluating our language model and for this, we also need to know our data, right?

Meet the Bard by Exploring the Tiny Shakespeare Dataset

Let’s take a closer look at the Tiny Shakespeare dataset, a valuable resource for building language models within the SAP AI Core framework. The Tiny Shakespeare dataset comprises 40,000 lines of Shakespeare from a variety of Shakespeare’s plays, featured in Andrej Karpathy’s blog post “The Unreasonable Effectiveness of Recurrent Neural Networks. It offers a manageable yet diverse collection of Shakespearean language, making it a practical choice for our case study. 

Here are some captivating snippets of Shakespeare’s texts from the dataset:

What Makes the Tiny Shakespeare Dataset Stand Out:

Size and Manageability: With 40,000 lines, it strikes a balance between comprehensiveness and efficiency, enabling effective model training.Linguistic Diversity: It encompasses a broad range of Shakespeare’s works, providing a comprehensive sampling of vocabulary, sentence structures, and writing styles.Practical Exploration: This dataset invites exploration and experimentation in language modeling, offering insights into AI and natural language processing in a hands-on manner.

For further exploration, Andrej Karpathy offers an excellent breakdown of the Transformers architecture in his YouTube video “Let’s build GPT: from scratch, in code, spelled out“. Some aspects of the code we’re discussing here resemble or are identical to what he demonstrates in the video, which can be immensely helpful for better comprehension of this whole blog.

 

import torch
from ShakespeareanGenerator.model.tokenizer import Tokenizer
from ShakespeareanGenerator.parameters import TrainingParameters
from ShakespeareanGenerator.logger import Logger

class DataHandler:

def __init__(self, path):

self.logging = Logger()
self.training_params = TrainingParameters()
self.path = path
self.data = None

def get_data(self):

try:
with open(self.path, ‘r’, encoding=’utf-8′) as file:
self.data = file.read()
except FileNotFoundError:
msg = ‘File {} not found.’.format(self.path)
self.logging.error(msg)
raise FileNotFoundError(msg)

def get_batch(self, split):
if self.data is None:
self.get_data()

tokenizer = Tokenizer(
corpus=self.data,
vocab_size=self.training_params.dictionary_size
)

encoded_corpus = tokenizer.encode(self.data)
data = torch.tensor(encoded_corpus.ids, dtype=torch.long)

split_point = int(0.9 * len(data))
training_set, validation_set = data[:split_point], data[split_point:]
selected_data = training_set if split == ‘train’ else validation_set
indices = torch.randint(len(selected_data) – self.training_params.context_length, (self.training_params.batch_size,))

batches_x = []
batches_y = []
for index in indices:
batch_x = selected_data[index:index + self.training_params.context_length]
batch_y = selected_data[index + 1:index + self.training_params.context_length + 1]
batches_x.append(batch_x)
batches_y.append(batch_y)

x = torch.stack(batches_x)
y = torch.stack(batches_y)
x, y = x.to(self.training_params.device), y.to(self.training_params.device)
return x, y

@torch.no_grad()
def get_estimated_loss(self, model):
out = {}
model.eval()

for split in [‘train’, ‘val’]:
losses = torch.zeros(self.training_params.eval_steps)
for k in range(self.training_params.eval_steps):
X, Y = self.get_batch(split)
logits, loss = model(X, Y)
losses[k] = loss.item()
out[split] = losses.mean()
self.logging.info(‘Estimated losses: {}’.format(losses.mean()))
model.train()
return out

 

Initialization: The __init__ method initializes the DataHandler instance by setting up the logger (Logger) and retrieving training parameters (TrainingParameters). It also stores the path to the data file (path) for subsequent data loading.Data Loading: The get_data method reads textual data from the specified file path (path) and stores it in the self.data attribute.Batch Generation: The get_batch method processes the loaded data to generate input-output pairs (X, Y) for model training. It tokenizes the text data using a custom BPE Tokenizer initialized with the specified vocabulary size (dictionary_size). Random indices are generated to extract batches of data with a defined context length (context_length). The resulting batches (x, y) are converted to tensors and moved to the appropriate computing device (device) for accelerated training.Loss Estimation: The get_estimated_loss is a utility method that evaluates the estimated loss on training and validation data batches (X, Y). It iteratively computes the loss over multiple evaluation steps (eval_steps) and aggregates the results for each split (train or val). The computed losses are logged using the Logger instance.

5. Understanding Tokenization and Byte Pair Encoding (BPE)

Before diving into the tokenizer code, let’s talk about what tokenization is and how Byte Pair Encoding (BPE) works first.

What is Tokenization?

Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be words, characters, or subwords. In natural language processing (NLP), tokenization is a structural step because it converts the raw text into a format that a machine learning model can understand (check the image below).

What is Byte Pair Encoding (BPE)?

There are a bunch of algorithms to tokenize text, ranging from simple ones like splitting by spaces or punctuation to more complex methods like WordPiece and SentencePiece. For example, basic tokenization might just split on whitespace, while WordPiece is used by models like BERT, and SentencePiece can create subword units even for languages with complex morphology. We’ll be using Byte Pair Encoding (BPE) because it strikes a good balance between simplicity and effectiveness, handling rare words well by breaking them down into more frequent subwords. This makes it particularly useful for languages with rich vocabularies and for tasks where handling out-of-vocabulary words is important.

Byte Pair Encoding (BPE) is a tokenization technique that starts with the basic characters and iteratively merges the most frequent pairs of tokens. This way, it builds a vocabulary of subword units. BPE is particularly effective for handling rare words and out-of-vocabulary terms by breaking them into more frequent subwords.

Tokenization with SentencePieceBPETokenizer

Now, let’s look at the code that implements tokenization using the SentencePieceBPETokenizer from the tokenizers library.

 

from tokenizers import SentencePieceBPETokenizer
from ShakespeareanGenerator.parameters import TrainingParameters

class Tokenizer:

def __init__(self, corpus, vocab_size):
training_params = TrainingParameters()
self.TOKENIZER_MODEL_PATH = training_params.TOKENIZER_MODEL_PATH
self.sentences = corpus.split(‘n’)
self.vocab_size = vocab_size
self.tokenizer = None

def train_tokenizer(self):
special_tokens = [“<pad>”, “<unk>”, “<s>”, “</s>”, “<b>”]
self.tokenizer = SentencePieceBPETokenizer()
self.tokenizer.train_from_iterator(
self.sentences,
vocab_size = self.vocab_size,
min_frequency = 2,
special_tokens = special_tokens,
show_progress = False
)
self.tokenizer.save_model(self.TOKENIZER_MODEL_PATH)

def encode(self, text):
if not isinstance(text, str):
raise TypeError(‘Input text must be a string.’)
try:
if self.tokenizer is None:
self.train_tokenizer()
return self.tokenizer.encode(text)
except Exception as e:
print(‘Error occurred during encoding: {}’.format(e))
raise

def decode(self, text):
if not isinstance(text, list):
raise TypeError(‘Input tokens must be a list.’)
try:
if self.tokenizer is None:
self.train_tokenizer()
return self.tokenizer.decode(text)
except Exception as e:
print(‘Error occurred during encoding: {}’.format(e))
raise

 

Initialization: In the __init__ method, we initialize our Tokenizer class with the corpus of text and the desired vocabulary size. We also set the path where our tokenizer model will be saved.Training the Tokenizer: The train_tokenizer method is where the magic happens. It trains the SentencePieceBPETokenizer on our corpus with the specified vocabulary size and special tokens.Encoder: The encode method takes a string of text and converts it into tokens. If the tokenizer hasn’t been trained yet, it calls train_tokenizer first.Decoder: The decode method converts a list of tokens back into the original text. Like encode, it ensures the tokenizer is trained before decoding.

With this Tokenizer class, you can easily manage tokenization using Byte Pair Encoding, making sure your text data is all set for further processing and model training. By getting a grasp on how tokenization and BPE work, you’ll see why this preprocessing step is so foundational to Natural Language Processing (NLP) tasks.

6. Exploring Model Training with ModelTrainer

Now, let’s take a closer look at the ModelTrainer class in our language modeling project. This class handles the training process, logs important info, and optimizes the model’s parameters. We’ll go over its key functions and see how they help make our project successful.

 

class ModelTrainer:

def __init__(self, data_handler, model):
self.data_handler = data_handler
self.model = model

learning_parameters = sum(p.numel() for p in model.parameters()) / 1e6
msg_to_log = ‘The model is learning {} million parameters.’.format(learning_parameters)
logging.info(msg_to_log)
msg_to_metrics = ‘{} million parameters.’.format(learning_parameters)
tracking.set_custom_info(
custom_info=[
MetricCustomInfo(name=”Number of Parameters”, value=str(msg_to_metrics))
]
)
self.optimizer = torch.optim.AdamW(
self.model.parameters(), lr=training_params.learning_rate
)

def train(self):
try:
for iteration in range(training_params.iteration_limit):
if iteration % training_params.eval_frequency == 0 or iteration == training_params.iteration_limit – 1:
logging.info(‘Epoch {} started’.format(iteration))

losses = self.data_handler.get_estimated_loss(self.model)

evaluation_msg = ‘EPOCH {} | LOSS: Train {:.4f} Valid {:.4f}’.format(
str(iteration).ljust(5), losses[‘train’], losses[‘val’]
)
logging.info(evaluation_msg)
tracking.set_custom_info(
custom_info=[
MetricCustomInfo(name=”Epoch Status”, value=str(evaluation_msg))
]
)
# Metric Logging: Step Information
training_loss_msg = ‘{:.4f}’.format(losses[‘train’])
validation_loss_msg = ‘{:.4f}’.format(losses[‘val’])
tracking.log_metrics(
metrics=[
Metric(
name=”Training Loss”,
value=float(training_loss_msg),
timestamp=datetime.now(timezone.utc),
step=iteration
),
Metric(
name=”Validation Loss”,
value=float(validation_loss_msg),
timestamp=datetime.now(timezone.utc),
step=iteration
),
]
)
batches_x, batches_y = self.data_handler.get_batch(‘train’)
logging.info(f’Sent to Data Handler for Tokenization and Generating Batches for iteration {iteration}’)
logits, loss = self.model(batches_x, batches_y)
logging.info(f’Forward Pass for iteration {iteration}’)
self.optimizer.zero_grad(set_to_none=True)
loss.backward()
logging.info(f’Backward Pass for iteration {iteration}’)
self.optimizer.step()
logging.info(f’Optimization Step for iteration {iteration}’)
except Exception as e:
logging.error(f’Training failed at iteration {iteration} with error: {e}’)
raise

 

Class Initialization (ModelTrainer):Data Handler and Model: Initializes with data_handler for managing data and model for the language model to be trained.Learning Parameters Calculation: Calculates the total number of parameters in the model and logs the information.SAP Metrics Logging: Sets custom information in SAP AI Launchpad with the number of model parameters using MetricCustomInfo.Training Method (train):Log Evaluation Start: At specified intervals (training_params.eval_frequency) or at the last iteration, logs the start of a new epoch. Estimates and logs training and validation losses using self.data_handler.get_estimated_loss(self.model).Calculate and Log Losses: Estimates the training and validation losses using data_handler and logs them.Update SAP Metrics: Updates custom information and logs the training and validation losses in SAP AI Launchpad using tracking.log_metrics.Data Handler Invocation: Retrieves a batch of training data using self.data_handler.get_batch(‘train’). Logs the step indicating that the data handler is generating batches for the current iteration.Training Loop: Iterates through the training process for a specified number of iterations (training_params.iteration_limit).Forward Pass: Computes the model’s predictions (logits) and loss for the current batch.Backward Pass: Computes gradients for the model’s parameters.Optimization Step: Updates the model’s parameters using the computed gradients.Exception Handling: Logs any errors encountered during training and raises the exception for further handling.

Understanding and utilizing the ModelTrainer class is fundamental for effective model training and optimization. In our language modeling project, it is enabling training iterations, managing data, and monitoring model performance. Feel free to adapt and explore further to suit your specific machine learning initiatives!

Well, I think we’ve covered enough for now. You’ve come a long way, and it’s time to wrap up this blog and talk about the next steps. So, let’s get to it.

Wrapping Up and Next Steps

Congratulations on making it this far in the transformer-based language modeling topic with the Tiny Shakespeare dataset! In this blog, we’ve explored the implementation of a language model using Transformers from scratch. Amazing work! ?

Let’s recap what we’ve covered:

Introduction to Transformers: We discussed the foundational concepts behind the transformer architecture and its revolutionary attention mechanism.Implementing Attention Mechanism: We broke down the key components of the attention mechanism and implemented it step-by-step in code.Multi-Head Attention: We explained how multi-head attention allows the model to capture diverse aspects of the input data by attending to different parts of the sequence simultaneously.Feed-Forward Network: We covered the role of the position-wise feed-forward network in transforming the attended features.Building the Full Transformer Model: We assembled all the components into a complete transformer model, specifically tailored for generating Shakespearean text.Training the Model: We detailed the process of training the model, including data handling, logging, and optimization steps.

Next Steps

Now that we’ve laid the foundation for language modeling, stay tuned for the upcoming blogs in this series, where we’ll explore how to deploy and enhance our model using SAP AI Core:

Deploying the Training Pipeline: Learn how to deploy the training pipeline using Argo multi-step workflows with SAP AI Core. We’ll cover setting up and orchestrating training jobs efficiently. 
[SAP AI Core is All You Need | 2. Setting the Stage for a Shakespeare-Language Model,
SAP AI Core is All You Need | 3. Workflow, Configuration, and Shakespeare Language Model Training]Improving Model Training Efficiency: Understand how to use checkpointing and resuming to make model training more efficient.
[SAP AI Core is All You Need | 4. Improving Model Training Efficiency with Checkpointing/Resuming]Fine-Tuning with Low-Rank Adaptation (LoRA): Learn how to use LoRA to fine-tune models with fewer parameters, making the process more efficient and effective.
[SAP AI Core is All You Need | 5. Fine Tuning with Low-Rank Adaptation (LoRA)]Fine-Tuning Pipeline: Dive into fine-tuning techniques to enhance model performance on specific datasets or tasks. We’ll explore the deployment of fine-tuning pipelines using SAP AI Core and explore model deployment and serving using KServe with SAP AI Core. Learn how to efficiently serve fine-tuned models for real-world applications.
[SAP AI Core is All You Need | 6. Serving Shakespeare Model using SAP AI Core and KServe]Sampling and Consuming Language Models: Discover methods for sampling from trained language models and integrating them into applications.
[SAP AI Core is All You Need | 7. Deploying Language Models for Text Generation]Developing a Language-Model-Based App: Gain insights into building an application powered by your trained language model.
[SAP AI Core is All You Need | 8. Consuming and Sampling from Shakespeare Language Models]

Further References

Source Code: GitHub repositorySAP AI Core HelpAttention Is All You NeedA New Algorithm for Data Compression OptimizationTransformers WikipediaTiktokenizer 

​ IntroductionWelcome to the first installment of our series “SAP AI Core is All You Need”! In this blog, titled “Building your Own Language Model with Transformers”, we’ll dive into the amazing capabilities of Transformers – the architecture that powers models like GPT (you can also check the first paper on that: Improving Language Understanding by Generative Pre-Training). We’ll guide you through the process of building a (your) language model from scratch using this cutting-edge technology. Without further ado, let’s get started!What to ExpectIn this blog, you will gain hands-on experience with the following key concepts:Understanding Transformers: Learn about the architecture that has revolutionized natural language processing (NLP) by effectively handling sequential data through attention mechanisms.Implementing Transformers from Scratch: Discover the process of creating your own language model tailored to specific needs, using the Shakespearean language as our case study.Get ready to dive into a series of blog posts where we’ll unlock the amazing potential of SAP AI Core and SAP AI Launchpad. We’ll explore a rich collection of components and ideas to fuel your AI adventures. And here’s a fun fact: our series title is a playful nod to the groundbreaking paper “Attention is All You Need”. Why? Because we’ll be using those same principles to build our very own language model. Ready? Getting Hands-On with TransformersFirst, let’s understand what “Transformers” are. The best place to start is the paper “Attention is All You Need” as we mentioned before, authored by Google in 2017. This paper presented a novel approach to handling sequential data by replacing traditional recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) with a fully attention-based model. Now, we’ll dive into this architecture and show you how SAP AI Core can simplify building, training, and deploying these models on a Kubernetes cluster running on GPU. Exciting, right?Defining and Understanding TransformersSo, what are Transformers? They’re a type of deep learning model designed to process sequential data, like text, more effectively than previous models such as RNNs and LSTMs. Transformers are exceptional at capturing long-range dependencies and context, making them promising for tasks like translation, text generation, and language understanding. The magic lies in the attention mechanism, which lets the model dynamically focus on different parts of the input sequence. But, before we move forward, it is important to notice that we can use transformer architecture in different ways:The image provides a visual comparison of three types of transformer architectures: Encoder-Only models like BERT, which are used for tasks such as sentiment analysis; Encoder-Decoder models like BART and T5, which are used for sequence-to-sequence tasks like translation; and Decoder-Only models like GPT, which are used for generative tasks and feature an autoregressive property – that is the one we’ll be building from now on.Why Use Transformers with SAP AI Core?SAP AI Core is a handy service in the SAP Business Technology Platform that manages and runs your AI projects in a standardized, scalable way, without being tied to any specific cloud provider. It smoothly integrates with your SAP solutions, and you can easily use any AI function with open-source frameworks. SAP AI Core takes care of the full lifecycle management of AI scenarios. Plus, you can tap into generative AI capabilities and manage prompts via the generative AI hub. By leveraging SAP AI Core, you can speed up the entire process of building and managing transformers. Here’s how SAP AI Core can enhance your workflow:Scalability and Efficiency: SAP AI Core allows you to train and fine-tune transformers on powerful GPU clusters, ensuring efficient handling of large datasets and complex models.Integration with Kubernetes: Deploy your language models seamlessly on a Kubernetes cluster, benefiting from robust orchestration and resource management.Simplified Implementation: SAP AI Core provides a suite of tools and pre-built components that simplify the implementation of advanced AI workflows, from data preprocessing to model deployment.Implementing Transformers from ScratchHave you ever wondered why building AI models from scratch might not always be the best route, especially when there are fantastic pre-trained models readily available, like those you’ll find on the SAP GenAI Hub? Let’s discuss why creating your own language model can be an exciting and fulfilling experience. Here are some of the main reasons to consider building your own:Tailored to Your Specific Needs: Developing your own language model allows you to customize it precisely to fit your unique business requirements and industry nuances. This level of customization can lead to more accurate and relevant outputs for your specific use cases.Complete Control Over Data and Privacy: By building your own model, you have full control over the data used for training, ensuring privacy and compliance with your organization’s policies and regulations. This control is important, especially for sensitive or proprietary information.Opportunity for Innovation and Learning: Building a language model from scratch presents a valuable learning opportunity for your team. It encourages innovation, problem-solving, and deep understanding of AI technologies, fostering a culture of continuous improvement within your organization.Domain-Specific Insights and Knowledge: In certain industries or niche markets, pre-trained models might not capture specialized vocabulary or context effectively. Creating your own model allows you to incorporate domain-specific knowledge, resulting in more relevant and actionable insights.Long-Term Flexibility and Ownership: Building your own model means you own the intellectual property and have the flexibility to evolve and adapt the model over time as your business needs change. This long-term ownership can be a strategic advantage in a rapidly evolving AI landscape.Potential for Competitive Advantage: A custom language model can differentiate your products or services in the market, providing a unique selling point that sets you apart from competitors relying solely on off-the-shelf solutions.As we dive into building our Shakespearean language model, we’ve got some key classes in our toolkit that handle the most important tasks like attention, training, and logging metrics. We’ll explore each class throughout this blog to understand how they compose a decoder-only Transformer (similar to the used in GPT2 as described on the paper “Language Models are Unsupervised Multitask Learners”). However, we need to start from somewhere, right? And, this is going to be the attention mechanism.Understanding and Implementing the Attention MechanismThe “Scaled Dot-Product Attention” is the core of the attention mechanism that involves the computation of attention scores between query (Q) and key (K) vectors. The process can be broken down into the following steps:Query, Key, and Value MatricesEach input token is transformed into three different vectors: a query vector, a key vector, and a value vector. These vectors are obtained through learned linear projections.Where X is the input matrix, and WQ, WK, WV are the weight matrices for queries, keys, and values, respectively (the W matrices are trainable – so it learns).Calculating Attention ScoresThe attention scores are computed by taking the dot product of the query vectors with the key vectors. This is where MatMul (matrix multiplication) comes into play:The QKT represents the matrix multiplication of the query matrix Q with the transpose of the key matrix K. This operation results in a matrix of attention scores, which indicates how much focus each token of the sequence should give to every other token.SoftmaxThe scaled scores are then passed through a softmax function to obtain normalized attention weights, which sum to one. These weights determine the importance of each key-value pair in the context of the current query.Weighted Sum of ValuesFinally, the normalized attention weights are used to compute a weighted sum of the value vectors. This results in the output of the attention mechanism:Here again, MatMul is used to multiply the attention weights with the value matrix V.Coding Attention MechanismInitialization Method: class Head(nn.Module):

def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(training_params.embedding_dim, head_size, bias=False)
self.query = nn.Linear(training_params.embedding_dim, head_size, bias=False)
self.value = nn.Linear(training_params.embedding_dim, head_size, bias=False)
self.register_buffer(‘tril’, torch.tril(torch.ones(training_params.context_length, training_params.context_length)))
self.dropout = nn.Dropout(training_params.dropout) self.key, self.query, self.value: These are linear layers that project the input embeddings into key, query, and value vectors, respectively. Each linear layer maps the input of size training_params.embedding_dim to a smaller size defined by head_size.self.register_buffer(‘tril’, …): This creates a lower triangular matrix buffer (tril), used to mask out the upper triangular part of the attention scores matrix, ensuring that the model does not attend to future positions in the sequence. This mask is mandatory for decoder modules in transformers architecture, as they must not have knowledge of the subsequent tokens in the sequence, which aligns with their primary objective.self.dropout: This is a dropout layer that helps in regularizing the model by randomly setting some of the elements to zero during training, it’s not part of the attention expression.Compute Weights Method:  def __compute_weights(self, x):

B, T, C = x.shape
k = self.key(x)
q = self.query(x)

weights = q @ k.transpose(-2, -1) * (k.shape[-1] ** -0.5)
weights = weights.masked_fill(self.tril[:T, :T] == 0, float(‘-inf’))
weights = F.softmax(weights, dim=-1)
weights = self.dropout(weights)
return weights B, T, C = x.shape: Extracts the batch size (B), sequence length (T) – or context_length, and embedding dimension (C) from the input tensor x.k = self.key(x): Projects the input tensor x to the key vector k.q = self.query(x): Projects the input tensor x to the query vector q.weights = q @ k.transpose(-2, -1) * (k.shape[-1] ** -0.5): Computes the scaled dot-product attention scores.q @ k.transpose(-2, -1) is the matrix multiplication between the query and the transposed key matrices. The result is then scaled by the inverse square root of the key dimension (* (k.shape[-1] ** -0.5)) which is the scale part of the attention illustration above.weights = weights.masked_fill(self.tril[:T, :T] == 0, float(‘-inf’)): Applies the mask to ensure the model only attends to current and previous positions (not future ones).weights = F.softmax(weights, dim=-1): Applies the softmax function to the attention scores to obtain the attention weights, which sum to 1.weights = self.dropout(weights): Applies dropout to the attention weights for regularization.Forward Method:  def forward(self, x):

weights = self.__compute_weights(x)
v = self.value(x)
out = weights @ v
return out weights = self.__compute_weights(x): Calls the __compute_weights method to get the attention weights.v = self.value(x): Projects the input tensor x to the value vector v.out = weights @ v: Computes the weighted sum of the value vectors using the attention weights. This is the core operation of the attention mechanism, where each value vector is weighted by the attention scores.Well, that’s it for attention mechanism. If you need a more visual explanation, maybe Attention in transformers, visually explained | Chapter 6, Deep Learning by 3Blue1Brown can help you with. That’s a great source to go for. Anyway, what next? Maybe you might thing that putting several heads to run in parallel would give the model the ability to “see by many perspectives”, right? This is what Multi-Head Attention does.Multi-Head AttentionMulti-head attention involves running multiple attention mechanisms, or “heads,” in parallel. Each head operates on a different linear projection of the input, and the results are concatenated and transformed to produce the final output. This allows the model to jointly attend to information from different representation subspaces at different positions.Linear ProjectionsFor a given input sequence, the model first linearly projects the input into queries (Q), keys (K), and values (V) using learned weight matrices. Each head has its own set of projection matrices.Where i indicates the i-th head, and WQi, WKi, WVi are the learned projection matrices for the queries, keys, and values of the i-th head, respectively.Scaled Dot-Product AttentionEach head computes the attention scores using scaled dot-product attention (as we saw). The attention scores for the i-th head are computed as:Here, dk is the dimension of the key vectors. The softmax function ensures that the attention scores sum to one, forming a probability distribution over the inputs.Concatenation and Linear TransformationThe outputs of all attention heads are concatenated and linearly transformed to produce the final output:Where head i ​ = Attention(Qi  ,K i ​ ,V i ​) and Wo is a learned weight matrix that projects the concatenated outputs back to the original dimension. Coding Multi-Head AttentionInitialization Method: class MultiHeadAttention(nn.Module):

def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.projection = nn.Linear(head_size * num_heads, training_params.embedding_dim)
self.dropout = nn.Dropout(training_params.dropout) self.heads: This line initializes multiple attention heads. Each head is an instance of the Head class, which implements the single-head attention mechanism. The number of heads is specified by num_heads, and each head operates on a subspace of the input with dimensionality head_size.self.projection: This is a linear layer that projects the concatenated outputs of all attention heads back to the original embedding dimension. The input dimension to this layer is head_size * num_heads because the outputs from all heads are concatenated.self.dropout: This applies dropout regularization to the output of the projection layer, helping to prevent overfitting during training.Forward Method:  def forward(self, x):
head_outputs = [head(x) for head in self.heads]
out = torch.cat(head_outputs, dim=-1)
out = self.dropout(self.projection(out))
return out head_outputs = [head(x) for head in self.heads]: This line iterates over each attention head and applies it to the input x. Each head processes the input independently, capturing different aspects of the input sequence. The result is a list of outputs, one for each head.out = torch.cat(head_outputs, dim=-1): The outputs from all heads are concatenated along the last dimension (i.e., the feature dimension). This combines the information from all heads into a single tensor.out = self.dropout(self.projection(out)): The concatenated output is passed through the projection layer to transform it back to the original embedding dimension. Dropout is then applied to the projected output for regularization.Multi-head attention provides rich representations by capturing different aspects of the input data with each head focusing on various parts of the sequence, enhances computational efficiency through parallelization, and improves generalization by offering diverse perspectives for better performance across different tasks and datasets.Feed-Forward Network (FFN)In the transformer architecture, each layer of the encoder and decoder contains a position-wise feed-forward network, applied independently to each position in the sequence. class FeedForward(nn.Module):

def __init__(self, embedding_dim):

super().__init__()
self.ffnet = nn.Sequential(
nn.Linear(embedding_dim, 4 * embedding_dim),
nn.ReLU(inplace=True),
nn.Linear(4 * embedding_dim, embedding_dim),
nn.Dropout(training_params.dropout)
)

def forward(self, x):
return self.ffnet(x) The feed-forward network (FFN) is also a important component in the transformer architecture that provides additional transformation and learning capacity (“time to think”) to each layer. Unlike traditional RNNs, which apply the same weights to every position in the sequence, the FFN applies a unique set of transformations to each position independently. This enhances the model’s ability to capture complex patterns in the data.Dimensionality Expansion: The FFN first projects the input into a higher-dimensional space (4 times the embedding dimension). This allows the model to learn more complex features by providing a greater capacity for transformation.Non-Linearity: The ReLU activation function introduces non-linearity into the model, enabling it to learn and represent more complex patterns and relationships in the data.Dimensionality Reduction: The second linear layer projects the higher-dimensional representation back to the original embedding dimension, ensuring that the output has the same size as the input for consistency across layers.Regularization: Dropout is applied to prevent overfitting, improving the model’s generalization to unseen data.Transformer BlockEach transformer block consists of a multi-head self-attention mechanism followed by a feed-forward network, with layer normalization and residual connections around each sub-layer. class TransformerBlock(nn.Module):

def __init__(self, embedding_dim, num_heads):
super().__init__()
head_size = embedding_dim // num_heads
self.self_attn = MultiHeadAttention(num_heads, head_size)
self.feed_forward = FeedForward(embedding_dim)
self.layer_norm1 = nn.LayerNorm(embedding_dim)
self.layer_norm2 = nn.LayerNorm(embedding_dim)

def forward(self, x):
attention_output = x + self.self_attn(self.layer_norm1(x))
output = attention_output + self.feed_forward(self.layer_norm2(attention_output))

return output The transformer block integrates the key mechanisms that enable transformers to effectively process sequential data:Multi-Head Self-Attention: This mechanism allows the model to attend to different parts of the input sequence simultaneously. By using multiple attention heads, the model can capture a variety of dependencies and relationships within the sequence. Each head operates on a different projection of the input, enabling the model to learn from multiple representation subspaces.Layer Normalization: Before applying the attention and feed-forward networks, layer normalization is used to stabilize and speed up the training process by normalizing the input across the features. This helps in maintaining consistent activations and gradients.Residual Connections: Adding the original input to the output of the sub-layers (attention and feed-forward) helps in mitigating the vanishing gradient problem and enables better gradient flow during backpropagation. This also facilitates learning identity mappings, which are critical for training deep networks.Feed-Forward Network: After the attention mechanism, the FFN provides additional processing capacity, allowing the model to transform the attended features further.Full Transformer Model: Shakespeare Language ModelThis class represents the full transformer model which we are calling as ShakespeareanLanguageModel, combining embeddings, multiple transformer blocks, and the output layer. class ShakespeareanLanguagelModel(nn.Module):

def __init__(self):
super().__init__()

self.embeddings = nn.Embedding(training_params.dictionary_size, training_params.embedding_dim)
self.position_embeddings = nn.Embedding(training_params.context_length, training_params.embedding_dim)
self.transformer_blocks = nn.Sequential(
*[TransformerBlock(training_params.embedding_dim, num_heads=training_params.attention_heads) for _ in range(training_params.num_layers)]
)
self.layer_norm = nn.LayerNorm(training_params.embedding_dim)
self.output = nn.Linear(training_params.embedding_dim, training_params.dictionary_size)

self.apply(self._init_weights)

def _init_weights(self, module):
if isinstance(module, nn.Linear) or isinstance(module, nn.Embedding):
nn.init.normal_(module.weight, std=0.02)
if hasattr(module, ‘bias’) and module.bias is not None:
nn.init.zeros_(module.bias)

def forward(self, index, targets=None):

B, T = index.shape
token_embeddings = self.embeddings(index)
position_embeddings = self.position_embeddings(torch.arange(T, device=index.device))
x = token_embeddings + position_embeddings
x = self.transformer_blocks(x)
x = self.layer_norm(x)
logits = self.output(x)

if targets is None:
loss = None
else:
B, T, C = logits.shape
logits = logits.view(B * T, C)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)

return logits, loss The full transformer model combines several key components to process and generate sequences effectively:Token Embeddings: Converts the input tokens into dense vector representations. These embeddings capture semantic information about the tokens. At this point you may thing if you can use your own embeddings model and the answer is “yes”. Maybe it is a good moment to check what SAP HANA Vector Engine and SAP Gen AI Hub can do for you.Position Embeddings: Since transformers do not inherently capture the order of the sequence, positional encodings are added to the token embeddings to provide the model with information about the position of each token in the sequence. Transformers may use other positional encoding methods than sinusoidal such as RoPE (rotary positional embedding) or ALiBi (Attention with Linear Biases).Stack of Transformer Blocks: The core processing happens in a stack of transformer blocks, each containing multi-head self-attention and feed-forward networks. This stack enables the model to build deep hierarchical representations of the input data, capturing complex patterns and dependencies.Layer Normalization: Applied after the transformer blocks to ensure stable and consistent activations.Output Layer: A linear layer that projects the final hidden states to the vocabulary size, producing logits for each token. These logits can be used to generate probabilities for each token in the vocabulary.Weight Initialization: Ensures that the model weights are initialized properly, promoting faster convergence during training.Loss Calculation: If target sequences are provided, the model calculates the cross-entropy loss between the predicted logits and the target tokens, facilitating supervised learning.As we are integrating these components, the model can effectively understand and generate text, making it suitable for a variety of natural language processing tasks, including language modeling, translation, and text generation (our approach). The use of SAP AI Core further enhances the model’s training and deployment capabilities, allowing for efficient handling of large-scale data and computational resources and you’ll see it on the next blogs.Now that you’ve explored the decoder-only transformer model, named the “Shakespearean Language Model”, you’re equipped with everything needed to generate Shakespearean text. Exciting, isn’t it? All that’s left is to walk through the implementation steps. Let’s get started!Implementing Shakespeare Language Model (Step-by-Step):Now that we’ve explored the potential of building our own language model, let’s delve into the code! Python is our trusty companion in this journey, and its libraries offer powerful tools for working with AI and natural language processing so you may need to install some of them, specially, PyTorch. Another thing to mention here is that we’re creating transformers from scratch purely for educational purposes. In real-world scenarios, you can rely on libraries that offer many resources to speed up the development of your architectures such as Hugging Face, PyTorch or TensorFlow. However, we’ll start from scratch so you can gain a better understanding of how everything works behind the scenes.1. Organizing Language Model Lifecycle with main.pyIn our journey to explore the world of language modeling, we’re excited to introduce the implementation of a Shakespearean language model using Python. This script, main.py, encapsulates the key steps involved in training and deploying our custom model. Now, we’ll be discussing the code itself and the Transformers architecture. Some portions of this code will be revisited in upcoming blogs to clarify their usage and demonstrate their relevance to SAP AI Core. Let’s go!The main.py script serves as the backbone of our language modeling project, integrating various components to build and train the Shakespearean language model. import pickle
import torch
from ShakespeareanGenerator.model.language_models import ShakespeareanLanguagelModel, ModelTrainer
from ShakespeareanGenerator.parameters import TrainingParameters
from ShakespeareanGenerator.data_handler import DataHandler
from ShakespeareanGenerator.logger import Logger

class Run:
def __init__(self):
self.logging = Logger()
self.training_params = TrainingParameters()
self.check_gpu_usage()
self.prepare_data()
self.train_model()

def check_gpu_usage(self):
if torch.cuda.is_available():
self.logging.info(f”GPU is available, using GPU: {torch.cuda.get_device_name(0)}”)
self.logging.info(f”Using CUDA version {torch.version.cuda}”)
else:
self.logging.warning(“GPU is not available, using CPU.”)

def prepare_data(self):
self.logging.info(‘START OF EXECUTION’)
self.logging.info(‘Get DataHandler and Model Instances’)
self.data_handler = DataHandler(self.training_params.DATA_PATH)
self.model_object = ShakespeareanLanguagelModel()
self.model = self.model_object.to(self.training_params.device)
self.logging.info(‘DataHandler and Model Instantiated’)

def train_model(self):
self.trainer = ModelTrainer(self.data_handler, self.model)
self.trainer.train()
self.logging.info(‘Model was trained successfully’)
with open(self.training_params.MODEL_PATH + ‘model.pkl’, ‘wb’) as f:
pickle.dump(self.model, f)
self.logging.info(‘END OF EXECUTION’)

if __name__ == ‘__main__’:
Run() Imports: Imports necessary libraries and modules, including pickle, torch, and custom modules for the model, parameters, data handling, and logging.Class Initialization:Logger Initialization: Initializes the logging mechanism.Training Parameters: Sets up training parameters.Check GPU Usage: Determines if GPU is available and logs the information.Check GPU Usage: Checks if a GPU is available and logs the GPU name and CUDA version. If not available, logs a warning to indicate CPU usage.Prepare Data:Log Start of Execution: Logs the start of execution.Initialize Data Handler and Model: Logs the creation of data handler and model instances.Set Model Device: Moves the model to the specified device (CPU or GPU).Log Data Handler and Model Initialization: Confirms that the data handler and model have been instantiated.Train Model:Model Trainer Initialization: Initializes the model trainer with data handler and model.Model Training: Trains the model.Log Training Success: Logs that the model was trained successfully.Save Trained Model: Saves the trained model to a file using pickle.Log End of Execution: Logs the end of execution.Run Class: Instantiates and runs the Run class when the script is executed. This script sets up the entire workflow for training a Shakespearean language model, including data preparation, model training, logging, and saving the trained model.2. Configuring Training Parameters for the Shakespearean Language ModelEssentially, the model will require parameters for three main purposes: to make the model training pipeline “tunable” so that anyone can easily adjust its parameters and conduct further experiments, to establish the paths for input and output in the code, and to manage credentials for services such as SAP AI Object Store and SAP HANA. This class, defined in parameters.py, encapsulates the foundational parameters that drive the model training process. # Import necessary libraries
import os
import torch

class TrainingParameters:
def __init__(self):
self.batch_size = int(os.environ.get(‘BATCH_SIZE’))
self.context_length = int(os.environ.get(‘CONTEXT_LENGTH’))
self.iteration_limit = int(os.environ.get(‘ITERATION_LIMIT’))
self.eval_frequency = int(os.environ.get(‘EVAL_FREQUENCY’))
self.eval_steps = int(os.environ.get(‘EVAL_STEPS’))
self.learning_rate = float(os.environ.get(‘LEARNING_RATE’))
self.embedding_dim = int(os.environ.get(‘EMBEDDING_DIM’))
self.attention_heads = int(os.environ.get(‘ATTENTION_HEADS’))
self.num_layers = int(os.environ.get(‘NUM_LAYERS’))
self.dropout = float(os.environ.get(‘DROPOUT’))
self.dictionary_size = int(os.environ.get(‘DICTIONARY_SIZE’))
self.device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’

self.DATA_PATH = ‘/app/data/tinyshakespeare.txt’
self.MODEL_PATH = ‘/app/model/’
self.TOKENIZER_MODEL_PATH = ‘/app/tokenizer/’
self.LOG_PATH = ‘/app/logs/’
self.LOG_NAME = ‘train_logs.log’ Environment Variable Initialization: The __init__ method retrieves key training parameters from environment variables which are defined in the YAML templates, and SAP AI Core takes care of the rest by making them available within the container. This allows the code to easily access and utilize these variables as needed.Device Configuration (self.device): The device attribute determines whether to utilize CPU or GPU for model computation based on the availability of CUDA-enabled GPUs.File Paths Definition: Within the TrainingParameters class, file paths are defined for key workflow components. These include the dataset path (DATA_PATH) for input, the trained model path (MODEL_PATH) for output, the tokenizer model path (TOKENIZER_MODEL_PATH), and the log file path (LOG_PATH).With this class, our language modeling project gains enhanced configurability and efficiency, putting us one step closer to our goal.3. Tracking Progress with the Logger ClassThis utility class logger.py simplifies the process of capturing and organizing important execution details, ensuring clarity and transparency throughout our development process. You can use standard and pre-build resources for logging, but we wanted a bit more “understandable” approach here ?. import logging
from ShakespeareanGenerator.parameters import TrainingParameters

class Logger:
def __init__(self):

self.training_params = TrainingParameters()
self.log_file = self.training_params.LOG_PATH + self.training_params.LOG_NAME

logging.basicConfig(
filename=self.log_file,
filemode=’w’,
format=’%(asctime)s | %(name)s → %(levelname)s: %(message)s’,
level=logging.INFO
)
self.logger = logging.getLogger(__name__)

def log(self, level, message):
getattr(self.logger, level)(message)

def info(self, message):
self.log(‘info’, message)

def warning(self, message):
self.log(‘warning’, message)

def error(self, message):
self.log(‘error’, message)

def critical(self, message):
self.log(‘critical’, message) Initialization: The __init__ method initializes the logger by configuring logging settings such as the log file path (log_file), format, and logging level (INFO) – the more granular level. The TrainingParameters instance is used to retrieve the log file path defined in our project settings.Logging Methods (info, warning, error, critical): The Logger class provides convenience methods (info, warning, error, critical) for logging messages with different severity levels. Each method calls the log function with the corresponding logging level (info, warning, error, critical) to streamline the logging process.Our language modeling project now gains visibility into execution progress and potential issues. The structured logging format ensures clarity and facilitates effective debugging during model development and training. Believe me, you’ll need some good logs along the way.4. Managing Data for the Shakespearean Language ModelTo create a strong Shakespearean language model (and all the other language models in the world ?), handling data efficiently is key for getting text ready to work with. Our DataHandler class, found in data_handler.py, is at the heart of managing these important data tasks for training and evaluating our language model and for this, we also need to know our data, right?Meet the Bard by Exploring the Tiny Shakespeare DatasetLet’s take a closer look at the Tiny Shakespeare dataset, a valuable resource for building language models within the SAP AI Core framework. The Tiny Shakespeare dataset comprises 40,000 lines of Shakespeare from a variety of Shakespeare’s plays, featured in Andrej Karpathy’s blog post “The Unreasonable Effectiveness of Recurrent Neural Networks”. It offers a manageable yet diverse collection of Shakespearean language, making it a practical choice for our case study. Here are some captivating snippets of Shakespeare’s texts from the dataset:What Makes the Tiny Shakespeare Dataset Stand Out:Size and Manageability: With 40,000 lines, it strikes a balance between comprehensiveness and efficiency, enabling effective model training.Linguistic Diversity: It encompasses a broad range of Shakespeare’s works, providing a comprehensive sampling of vocabulary, sentence structures, and writing styles.Practical Exploration: This dataset invites exploration and experimentation in language modeling, offering insights into AI and natural language processing in a hands-on manner.For further exploration, Andrej Karpathy offers an excellent breakdown of the Transformers architecture in his YouTube video “Let’s build GPT: from scratch, in code, spelled out”. Some aspects of the code we’re discussing here resemble or are identical to what he demonstrates in the video, which can be immensely helpful for better comprehension of this whole blog. import torch
from ShakespeareanGenerator.model.tokenizer import Tokenizer
from ShakespeareanGenerator.parameters import TrainingParameters
from ShakespeareanGenerator.logger import Logger

class DataHandler:

def __init__(self, path):

self.logging = Logger()
self.training_params = TrainingParameters()
self.path = path
self.data = None

def get_data(self):

try:
with open(self.path, ‘r’, encoding=’utf-8′) as file:
self.data = file.read()
except FileNotFoundError:
msg = ‘File {} not found.’.format(self.path)
self.logging.error(msg)
raise FileNotFoundError(msg)

def get_batch(self, split):
if self.data is None:
self.get_data()

tokenizer = Tokenizer(
corpus=self.data,
vocab_size=self.training_params.dictionary_size
)

encoded_corpus = tokenizer.encode(self.data)
data = torch.tensor(encoded_corpus.ids, dtype=torch.long)

split_point = int(0.9 * len(data))
training_set, validation_set = data[:split_point], data[split_point:]
selected_data = training_set if split == ‘train’ else validation_set
indices = torch.randint(len(selected_data) – self.training_params.context_length, (self.training_params.batch_size,))

batches_x = []
batches_y = []
for index in indices:
batch_x = selected_data[index:index + self.training_params.context_length]
batch_y = selected_data[index + 1:index + self.training_params.context_length + 1]
batches_x.append(batch_x)
batches_y.append(batch_y)

x = torch.stack(batches_x)
y = torch.stack(batches_y)
x, y = x.to(self.training_params.device), y.to(self.training_params.device)
return x, y

@torch.no_grad()
def get_estimated_loss(self, model):
out = {}
model.eval()

for split in [‘train’, ‘val’]:
losses = torch.zeros(self.training_params.eval_steps)
for k in range(self.training_params.eval_steps):
X, Y = self.get_batch(split)
logits, loss = model(X, Y)
losses[k] = loss.item()
out[split] = losses.mean()
self.logging.info(‘Estimated losses: {}’.format(losses.mean()))
model.train()
return out Initialization: The __init__ method initializes the DataHandler instance by setting up the logger (Logger) and retrieving training parameters (TrainingParameters). It also stores the path to the data file (path) for subsequent data loading.Data Loading: The get_data method reads textual data from the specified file path (path) and stores it in the self.data attribute.Batch Generation: The get_batch method processes the loaded data to generate input-output pairs (X, Y) for model training. It tokenizes the text data using a custom BPE Tokenizer initialized with the specified vocabulary size (dictionary_size). Random indices are generated to extract batches of data with a defined context length (context_length). The resulting batches (x, y) are converted to tensors and moved to the appropriate computing device (device) for accelerated training.Loss Estimation: The get_estimated_loss is a utility method that evaluates the estimated loss on training and validation data batches (X, Y). It iteratively computes the loss over multiple evaluation steps (eval_steps) and aggregates the results for each split (train or val). The computed losses are logged using the Logger instance.5. Understanding Tokenization and Byte Pair Encoding (BPE)Before diving into the tokenizer code, let’s talk about what tokenization is and how Byte Pair Encoding (BPE) works first.What is Tokenization?Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be words, characters, or subwords. In natural language processing (NLP), tokenization is a structural step because it converts the raw text into a format that a machine learning model can understand (check the image below).What is Byte Pair Encoding (BPE)?There are a bunch of algorithms to tokenize text, ranging from simple ones like splitting by spaces or punctuation to more complex methods like WordPiece and SentencePiece. For example, basic tokenization might just split on whitespace, while WordPiece is used by models like BERT, and SentencePiece can create subword units even for languages with complex morphology. We’ll be using Byte Pair Encoding (BPE) because it strikes a good balance between simplicity and effectiveness, handling rare words well by breaking them down into more frequent subwords. This makes it particularly useful for languages with rich vocabularies and for tasks where handling out-of-vocabulary words is important.Byte Pair Encoding (BPE) is a tokenization technique that starts with the basic characters and iteratively merges the most frequent pairs of tokens. This way, it builds a vocabulary of subword units. BPE is particularly effective for handling rare words and out-of-vocabulary terms by breaking them into more frequent subwords.Tokenization with SentencePieceBPETokenizerNow, let’s look at the code that implements tokenization using the SentencePieceBPETokenizer from the tokenizers library. from tokenizers import SentencePieceBPETokenizer
from ShakespeareanGenerator.parameters import TrainingParameters

class Tokenizer:

def __init__(self, corpus, vocab_size):
training_params = TrainingParameters()
self.TOKENIZER_MODEL_PATH = training_params.TOKENIZER_MODEL_PATH
self.sentences = corpus.split(‘n’)
self.vocab_size = vocab_size
self.tokenizer = None

def train_tokenizer(self):
special_tokens = [“<pad>”, “<unk>”, “<s>”, “</s>”, “<b>”]
self.tokenizer = SentencePieceBPETokenizer()
self.tokenizer.train_from_iterator(
self.sentences,
vocab_size = self.vocab_size,
min_frequency = 2,
special_tokens = special_tokens,
show_progress = False
)
self.tokenizer.save_model(self.TOKENIZER_MODEL_PATH)

def encode(self, text):
if not isinstance(text, str):
raise TypeError(‘Input text must be a string.’)
try:
if self.tokenizer is None:
self.train_tokenizer()
return self.tokenizer.encode(text)
except Exception as e:
print(‘Error occurred during encoding: {}’.format(e))
raise

def decode(self, text):
if not isinstance(text, list):
raise TypeError(‘Input tokens must be a list.’)
try:
if self.tokenizer is None:
self.train_tokenizer()
return self.tokenizer.decode(text)
except Exception as e:
print(‘Error occurred during encoding: {}’.format(e))
raise Initialization: In the __init__ method, we initialize our Tokenizer class with the corpus of text and the desired vocabulary size. We also set the path where our tokenizer model will be saved.Training the Tokenizer: The train_tokenizer method is where the magic happens. It trains the SentencePieceBPETokenizer on our corpus with the specified vocabulary size and special tokens.Encoder: The encode method takes a string of text and converts it into tokens. If the tokenizer hasn’t been trained yet, it calls train_tokenizer first.Decoder: The decode method converts a list of tokens back into the original text. Like encode, it ensures the tokenizer is trained before decoding.With this Tokenizer class, you can easily manage tokenization using Byte Pair Encoding, making sure your text data is all set for further processing and model training. By getting a grasp on how tokenization and BPE work, you’ll see why this preprocessing step is so foundational to Natural Language Processing (NLP) tasks.6. Exploring Model Training with ModelTrainerNow, let’s take a closer look at the ModelTrainer class in our language modeling project. This class handles the training process, logs important info, and optimizes the model’s parameters. We’ll go over its key functions and see how they help make our project successful. class ModelTrainer:

def __init__(self, data_handler, model):
self.data_handler = data_handler
self.model = model

learning_parameters = sum(p.numel() for p in model.parameters()) / 1e6
msg_to_log = ‘The model is learning {} million parameters.’.format(learning_parameters)
logging.info(msg_to_log)
msg_to_metrics = ‘{} million parameters.’.format(learning_parameters)
tracking.set_custom_info(
custom_info=[
MetricCustomInfo(name=”Number of Parameters”, value=str(msg_to_metrics))
]
)
self.optimizer = torch.optim.AdamW(
self.model.parameters(), lr=training_params.learning_rate
)

def train(self):
try:
for iteration in range(training_params.iteration_limit):
if iteration % training_params.eval_frequency == 0 or iteration == training_params.iteration_limit – 1:
logging.info(‘Epoch {} started’.format(iteration))

losses = self.data_handler.get_estimated_loss(self.model)

evaluation_msg = ‘EPOCH {} | LOSS: Train {:.4f} Valid {:.4f}’.format(
str(iteration).ljust(5), losses[‘train’], losses[‘val’]
)
logging.info(evaluation_msg)
tracking.set_custom_info(
custom_info=[
MetricCustomInfo(name=”Epoch Status”, value=str(evaluation_msg))
]
)
# Metric Logging: Step Information
training_loss_msg = ‘{:.4f}’.format(losses[‘train’])
validation_loss_msg = ‘{:.4f}’.format(losses[‘val’])
tracking.log_metrics(
metrics=[
Metric(
name=”Training Loss”,
value=float(training_loss_msg),
timestamp=datetime.now(timezone.utc),
step=iteration
),
Metric(
name=”Validation Loss”,
value=float(validation_loss_msg),
timestamp=datetime.now(timezone.utc),
step=iteration
),
]
)
batches_x, batches_y = self.data_handler.get_batch(‘train’)
logging.info(f’Sent to Data Handler for Tokenization and Generating Batches for iteration {iteration}’)
logits, loss = self.model(batches_x, batches_y)
logging.info(f’Forward Pass for iteration {iteration}’)
self.optimizer.zero_grad(set_to_none=True)
loss.backward()
logging.info(f’Backward Pass for iteration {iteration}’)
self.optimizer.step()
logging.info(f’Optimization Step for iteration {iteration}’)
except Exception as e:
logging.error(f’Training failed at iteration {iteration} with error: {e}’)
raise Class Initialization (ModelTrainer):Data Handler and Model: Initializes with data_handler for managing data and model for the language model to be trained.Learning Parameters Calculation: Calculates the total number of parameters in the model and logs the information.SAP Metrics Logging: Sets custom information in SAP AI Launchpad with the number of model parameters using MetricCustomInfo.Training Method (train):Log Evaluation Start: At specified intervals (training_params.eval_frequency) or at the last iteration, logs the start of a new epoch. Estimates and logs training and validation losses using self.data_handler.get_estimated_loss(self.model).Calculate and Log Losses: Estimates the training and validation losses using data_handler and logs them.Update SAP Metrics: Updates custom information and logs the training and validation losses in SAP AI Launchpad using tracking.log_metrics.Data Handler Invocation: Retrieves a batch of training data using self.data_handler.get_batch(‘train’). Logs the step indicating that the data handler is generating batches for the current iteration.Training Loop: Iterates through the training process for a specified number of iterations (training_params.iteration_limit).Forward Pass: Computes the model’s predictions (logits) and loss for the current batch.Backward Pass: Computes gradients for the model’s parameters.Optimization Step: Updates the model’s parameters using the computed gradients.Exception Handling: Logs any errors encountered during training and raises the exception for further handling.Understanding and utilizing the ModelTrainer class is fundamental for effective model training and optimization. In our language modeling project, it is enabling training iterations, managing data, and monitoring model performance. Feel free to adapt and explore further to suit your specific machine learning initiatives!Well, I think we’ve covered enough for now. You’ve come a long way, and it’s time to wrap up this blog and talk about the next steps. So, let’s get to it.Wrapping Up and Next StepsCongratulations on making it this far in the transformer-based language modeling topic with the Tiny Shakespeare dataset! In this blog, we’ve explored the implementation of a language model using Transformers from scratch. Amazing work! ?Let’s recap what we’ve covered:Introduction to Transformers: We discussed the foundational concepts behind the transformer architecture and its revolutionary attention mechanism.Implementing Attention Mechanism: We broke down the key components of the attention mechanism and implemented it step-by-step in code.Multi-Head Attention: We explained how multi-head attention allows the model to capture diverse aspects of the input data by attending to different parts of the sequence simultaneously.Feed-Forward Network: We covered the role of the position-wise feed-forward network in transforming the attended features.Building the Full Transformer Model: We assembled all the components into a complete transformer model, specifically tailored for generating Shakespearean text.Training the Model: We detailed the process of training the model, including data handling, logging, and optimization steps.Next StepsNow that we’ve laid the foundation for language modeling, stay tuned for the upcoming blogs in this series, where we’ll explore how to deploy and enhance our model using SAP AI Core:Deploying the Training Pipeline: Learn how to deploy the training pipeline using Argo multi-step workflows with SAP AI Core. We’ll cover setting up and orchestrating training jobs efficiently. [SAP AI Core is All You Need | 2. Setting the Stage for a Shakespeare-Language Model,SAP AI Core is All You Need | 3. Workflow, Configuration, and Shakespeare Language Model Training]Improving Model Training Efficiency: Understand how to use checkpointing and resuming to make model training more efficient.[SAP AI Core is All You Need | 4. Improving Model Training Efficiency with Checkpointing/Resuming]Fine-Tuning with Low-Rank Adaptation (LoRA): Learn how to use LoRA to fine-tune models with fewer parameters, making the process more efficient and effective.[SAP AI Core is All You Need | 5. Fine Tuning with Low-Rank Adaptation (LoRA)]Fine-Tuning Pipeline: Dive into fine-tuning techniques to enhance model performance on specific datasets or tasks. We’ll explore the deployment of fine-tuning pipelines using SAP AI Core and explore model deployment and serving using KServe with SAP AI Core. Learn how to efficiently serve fine-tuned models for real-world applications.[SAP AI Core is All You Need | 6. Serving Shakespeare Model using SAP AI Core and KServe]Sampling and Consuming Language Models: Discover methods for sampling from trained language models and integrating them into applications.[SAP AI Core is All You Need | 7. Deploying Language Models for Text Generation]Developing a Language-Model-Based App: Gain insights into building an application powered by your trained language model.[SAP AI Core is All You Need | 8. Consuming and Sampling from Shakespeare Language Models]Further ReferencesSource Code: GitHub repositorySAP AI Core HelpAttention Is All You NeedA New Algorithm for Data Compression OptimizationTransformers WikipediaTiktokenizer   Read More Technology Blogs by SAP articles 

#SAP

#SAPTechnologyblog

You May Also Like

More From Author

+ There are no comments

Add yours