SAP AI Core is All You Need | 6. Serving Shakespeare Model using SAP AI Core and KServe

Estimated read time 40 min read

Introduction

Welcome back to our series “SAP AI Core is All You Need“!

In this blog, we’re diving into the exciting world of deploying and serving AI models using SAP AI Core and KServe. Our focus? The legendary Shakespeare Language Model. If you’re keen to explore how to bring advanced AI capabilities to life, join us as we build the infrastructure and the code necessary to deploy the Shakespeare Language Model for inference. To achieve this, we’ll leverage the capabilities of SAP AI Core and the versatile Serving Template.

Let’s make Shakespeare come alive in the world of artificial intelligence!

What to Expect

In this blog, you will gain practical insights into the following:

Deploying AI Models: Learn the importance of integrating custom classes and modules, focusing on the Shakespeare Language Model’s unique architecture.Code Breakdown: Explore the critical files and their roles in making the Shakespeare Language Model work, including detailed explanations of key components like the generator and main files.Building a Text Generation API: Set up and run a Flask app to generate Shakespearean text, with step-by-step instructions and practical examples.Logging in MLOps: Understand the crucial role of logging for monitoring and troubleshooting in machine learning operations.

By the end of this blog, you’ll have the knowledge needed to understand the tools and resources used for serving models using SAP AI Core and also it will enable you to the next blog where we’ll really deploy the models.

Why All the Classes Come Together

This model is not your typical out-of-the-box solution; it relies on a complex architecture that includes custom classes and modules like TransformerBlock and FeedForward within PyTorch.

When we pickle and load our PyTorch model (let’s call it model.pkl), it’s not just the model’s weights that get serialized. The entire structure, including these custom classes, is bundled together. This means that when you load the pickled model, your environment needs to have access to these original class definitions.

Why does it matter? Well, unlike scikit-learn models that are self-contained and don’t require external dependencies during inference (actually it does, but not in the same way), our PyTorch model relies on these custom components. Think of it like needing the original blueprint to rebuild a sophisticated machine.In PyTorch, you can save either the entire model or just the model’s parameter dictionary (Saving and Loading Models). The most recommended approach is to save only the model’s state dictionary; however, I think it is a good homework exercise to try saving the whole model.

So, as you prepare to deploy your AI model/system, ensure that all the necessary custom classes and modules are available in your environment. This ensures that when you load and use your pickled PyTorch model, everything reconstructs nicely – just like the Bard’s intricate prose ? .

While it’s not a recommended practice for larger codebases, you can indeed serialize entire class definitions alongside your model object using libraries like cloudpickle. This approach can simplify deployment but requires careful management of dependencies.

We’ve finished discussing saving and loading models, so let’s move on to the code breakdown for deploying our models. Yes! We now have two models, remember? The Shakespeare Language Model for text generation and the PeFT Model for Shakespeare style transfer.

 

Understanding the Code

By now, you’re familiar with the setup – our custom classes and modules play a structural role in bringing our Shakespeare Language Model to life. Let’s take a closer look at some key files that make this magic happen.

language_models.py & tokenizer.py: These files are essential for the unpickling process, ensuring that our model and tokenizer are reconstructed correctly. They hold the definitions of our custom classes and functions, ensuring everything aligns perfectly during model loading.

logger.py: This file may seem minor, but it plays a critical role. We’ve made some tweaks here to enhance logging functionality, ensuring smooth operation and easier troubleshooting.

parameters.py: Here’s where things get interesting. We’ve tweaked parameters to optimize our model’s performance. Stay tuned for a deeper dive into these optimizations.

generator.py: A new addition to our toolkit! This file houses the code responsible for generating text samples from our language model. We’ll explore how this generator interacts with our trained model to produce those elegant Shakespearean phrases.

main.py: The heart of our inference process. Let’s dissect this file to understand how sampling from our language model is orchestrated. From loading the model to generating text, this is where the magic unfolds.

Together, these files form the backbone of our AI deployment. They encapsulate everything we need from our Shakespeare Language Model and pave the way for model inference. By the end, you’ll have a clear picture of how our AI is deployed using SAP AI Core (and KServe behind the scenes).

 

Generator

Let’s break down the essential components and files needed to create our API. Our journey begins with exploring the generator.py file.

 

import io
import torch
import pickle
from torch.nn import functional as F
from ShakespeareanGenerator.model.tokenizer import Tokenizer
from ShakespeareanGenerator.parameters import ServingParameters
from ShakespeareanGenerator.logger import Logger

 

Here, we start by importing necessary dependencies for our text generation API deployment:

io: Used to handle in-memory binary streams, specifically to facilitate loading a PyTorch model onto the CPU.pickle: Used for serializing and deserializing Python objects, crucial for loading our pre-trained language model.torch: The PyTorch library for deep learning tasks.torch.nn.functional as F: Importing PyTorch’s functional API for neural network operations.Tokenizer: Custom class from ShakespeareanGenerator.model.tokenizer responsible for tokenizing text inputs.ServingParameters: Custom parameters for serving the model, defined in ShakespeareanGenerator.parameters.Logger: Custom logging utilities from ShakespeareanGenerator.logger for monitoring and debugging.

These imports set the stage for our text generation pipeline, enabling us to load the model, preprocess inputs, and manage serving configurations. Now, let’s dive into the ModelManager class, a core component of our text generation API deployment.

 

class ModelManager:
def __init__(self):
self.model = None
self.model_loaded = False
self.serving_params = ServingParameters()
self.logging = Logger()
self.check_gpu_usage()

def check_gpu_usage(self):
if torch.cuda.is_available():
self.logging.info(f”GPU is available, using GPU: {torch.cuda.get_device_name(0)}”)
self.logging.info(f”Using CUDA version {torch.version.cuda}”)
else:
self.logging.warning(“GPU is not available, using CPU.”)

def load_model(self):
with open(self.serving_params.INPUT_MODEL, ‘rb’) as f:
self.model = CPU_Unpickler(f).load()

self.model.eval()
self.model = self.model.to(self.serving_params.device)
self.logging.info(f”Model loaded and sent to {self.serving_params.device}”)
self.model_loaded = True

def is_model_loaded(self):
return self.model_loaded

 

Initialization: This initializes the ModelManager class. It sets up attributes including model (to hold our loaded model), model_loaded (a flag to indicate if the model is successfully loaded), serving_params (an instance of ServingParameters for managing serving configurations), and logging (an instance of Logger for handling logs).Loading the Model: This method is responsible for loading the pre-trained model into memory. It opens the serialized model file (INPUT_MODEL specified in ServingParameters) using pickle.load(), deserializes the model, sets it to evaluation mode (self.model.eval()), moves the model to the specified device (GPU or CPU) based on serving_params.device, and logs the successful loading of the model.is_model_loaded Method: This method simply checks and returns the status (True or False) of model_loaded, indicating whether the model is successfully loaded into memory.

The ModelManager takes care of loading and managing the model, making sure our text generation API has everything it needs to create Shakespearean text with SAP AI Core.

Next, let’s dive into the Generator class, which is responsible for generating text using our loaded language model. This class is designed to handle the text generation process using our loaded language model. Let’s break down how it works, including some key concepts like temperature, top_k, and top_p.

Temperature: This controls the randomness of predictions by scaling the logits before applying softmax. A higher temperature value makes the model output more random, while a lower value makes it more deterministic.Top-K Sampling: This limits the sampling pool to the top K highest-probability tokens. It helps in generating more coherent text by focusing on the most likely options.Top-P (Nucleus) Sampling: This method selects tokens from the smallest set whose cumulative probability is at least P. It dynamically chooses the number of top tokens based on their cumulative probability, allowing for more flexible sampling than top-K.

 

class Generator:
def __init__(self, model_manager, max_tokens, temperature, top_k, top_p):
self.max_tokens = max_tokens
self.temperature = temperature
self.top_k = top_k
self.top_p = top_p
self.tokenizer = Tokenizer()
self.model_manager = model_manager

def __sample_from_model(self, index):
self.model = self.model_manager.model
for _ in range(self.max_tokens):
try:
current_index = index[:, -self.model.position_embeddings.weight.shape[0]:]
logits, _ = self.model(current_index)
scaled_logits = (lambda l, t: l / t if t > 0.0 else l)(logits[:, -1, :], self.temperature)
probs = F.softmax(scaled_logits, dim=-1)

if self.top_k > 0:
probs_value, probs_indices = torch.topk(probs, self.top_k, dim=-1)
filtered_probs = probs.clone().fill_(0.0)
filtered_probs.scatter_(dim=-1, index=probs_indices, src=probs_value)
probs = filtered_probs / torch.sum(filtered_probs, dim=-1, keepdim=True)

sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

sorted_indices_to_remove = cumulative_probs > self.top_p
if torch.any(sorted_indices_to_remove):
cutoff_idx = torch.where(sorted_indices_to_remove)[1][0]
indices_to_remove = sorted_indices[:, cutoff_idx + 1 :]
probs.scatter_(dim=-1, index=indices_to_remove, value=0.0)
probs = probs / torch.sum(probs, dim=-1, keepdim=True)

next_index = torch.multinomial(probs, num_samples=1)
index = torch.cat((index, next_index), dim=1)
except Exception as e:
self.model_manager.logging.error(f”Error during text generation: {str(e)}”)
raise
return index

def post_process_text(self, generated_text):
cleaned_text = generated_text.replace(“<s>”, “”).replace(“</s>”, “”).replace(“<b>”, “”).strip()
return cleaned_text

with torch.inference_mode():
def generate(self):
if not self.model_manager.is_model_loaded():
self.model_manager.load_model()
try:
idx = torch.full((1, 1), 4, dtype=torch.long, device=self.model_manager.serving_params.device)
completion = self.tokenizer.decode(self.__sample_from_model(idx)[0].tolist())
self.length = len(self.tokenizer.encode(completion).ids)
self.model_manager.logging.info(f”Text generated successfully with length: {self.length}”)
self.model_manager.logging.info(f”With max tokens set to: {self.max_tokens}”)
self.model_manager.logging.info(f”With temperature set to: {self.temperature}”)
self.model_manager.logging.info(f”With top k set to: {self.top_k}”)
self.model_manager.logging.info(f”With top p set to: {self.top_p}”)
return completion
except Exception as e:
self.model_manager.logging.error(f”Error during text generation: {str(e)}”)
raise

 

Parameters:

model_manager: Manages the language model.max_tokens: The maximum number of tokens to generate.temperature: Controls the randomness of predictions by scaling the logits.top_k: Limits sampling to the top k probable tokens.top_p: Limits sampling to the smallest number of tokens whose cumulative probability is above a threshold p.Tokenizer: Converts text to tokens and vice versa.

Sampling:

In the __sample_from_model method, the main idea is to apply various sampling techniques to generate text from the model.

Steps:

Loop for max_tokens: Generates tokens one by one up to the maximum limit.Current context: Gets the most recent part of the sequence to use as context.Model prediction: Gets logits (predictions) from the model.Temperature scaling: Adjusts the logits to control randomness.Probability distribution: Converts logits to probabilities.Top-k filtering: Keeps only the top k probable tokens.Top-p filtering: Keeps the smallest number of tokens whose cumulative probability is above p.Sampling: Selects the next token based on the adjusted probabilities.Append token: Adds the selected token to the sequence.

Post Processing:

Replace special tokens: Removes <s>, </s>, and <b> from the text.Strip: Removes any leading or trailing whitespace.

Generate:

Check if model is loaded: Loads the model if not already loaded.Initialize sequence: Starts with a specific token (e.g., <s>).Generate sequence: Uses the sampling method to create a sequence of tokens.Decode tokens: Converts the sequence of tokens back into text.Log details: Records the generation details such as length, temperature, top_k, and top_p.Return text: Provides the generated text as output.

The Generator class is designed to generate text using a pre-trained language model. It handles everything from setting up parameters, sampling tokens, cleaning up the text, and logging the process. Cool, hun?

 

Building a Text Generation API with Flask and Shakespeare

Now it’s time to define how the model will be consumed. One common approach is through API generation. APIs provide a flexible and standardized way to interact with the model, allowing various applications and users to access its functionality without needing to understand the underlying code. This approach is particularly useful for integrating the model into web services, mobile apps, or other systems that require real-time or on-demand text generation. And that’s exactly what we’ll do!

Meet the Key Players

Before we jump into the code, let’s introduce the main components that will be used:

Flask: Our reliable web framework for building the API.request: Used to capture data from incoming HTTP requests.jsonify: Transforms our Python dictionaries into neat JSON responses.Generator: The core engine responsible for generating the Shakespearean text.ModelManager: Manages the loading and handling of our language models.ServingParameters: Contains essential configuration settings such as device and context length.Logger: Ensures we keep track of everything happening behind the scenes.

Code Explanation and Main Concepts

This Python script sets up a simple web server using Flask to generate text in the style of Shakespeare (you don’t have to use Flask, you may use another python library if you would like to, ok?). It makes use of some custom classes and modules, and here’s how it works:

Imports and Setup

 

from flask import Flask, request, jsonify
from ShakespeareanGenerator.generator import Generator, ModelManager
from ShakespeareanGenerator.parameters import ServingParameters
from ShakespeareanGenerator.logger import Logger

 

Flask: A lightweight web framework for Python used to create web applications.Generator, ModelManager, ServingParameters, Logger: Custom classes from the ShakespeareanGenerator module to handle text generation, model management, serving parameters, and logging, respectively (as you may already be familiar).

Initialize Flask App

 

app = Flask(__name__)
app.json.sort_keys = False

 

Flask App: Create an instance of the Flask app.JSON Configuration: Ensures the JSON responses are not sorted by keys.

Initialize Custom Classes

 

model_manager = ModelManager()
logging = Logger()

 

ModelManager: Manages loading and handling the language model.Logger: Handles logging of information and errors.

Load Model Before Handling Requests

 

def load_model():
try:
if not model_manager.is_model_loaded():
model_manager.load_model()
else:
logging.info(“Model already loaded”)
except Exception as e:
logging.error(f”Error loading model: {str(e)}”)
raise

@app.before_request
def initialize():
load_model()

 

app.before_request: A Flask decorator that runs init_model before each request.Model Loading: Checks if the model is loaded, loads it if not, and logs the outcome.

Text Generation Endpoint

 

.route(‘/v2/generate’, methods=[“POST”])
def generate_text():
data = request.get_json()
max_tokens = int(data.get(‘max_tokens’, 300))
temperature = float(data.get(‘temperature’, 1.0))
top_k = int(data.get(‘top_k’, 0))
top_p = float(data.get(‘top_p’, 0.9))

generator = Generator(model_manager, max_tokens, temperature, top_k=top_k, top_p=top_p)
generated_text = generator.generate()
processed_text = generator.post_process_text(generated_text)
lines = [line.strip() for line in processed_text.split(‘.’) if line.strip()]

response = {
‘generated_text’: lines,
‘model_details’: {
‘model_name’: ‘shakespeare-language-model’,
‘temperature’: generator.temperature,
‘length’: generator.length,
‘top_k’: generator.top_k,
‘top_p’: generator.top_p,
}
}
return jsonify(response)

 

/v2/generate: Defines a POST endpoint for generating text. You may modify the endpoint name and format, but each endpoint must have the prefix /v<NUMBER>.Request Data: Extracts parameters like max_tokens, temperature, top_k, and top_p from the JSON request.Generator Initialization: Uses the Generator class to create text based on the given parameters.Text Processing: Processes the generated text to split it into lines and remove unnecessary spaces.Response: Constructs a JSON response with the generated text and model details.

In this part, we define an API endpoint (‘/v2/generate’) where you can send POST requests to trigger the text generation process. Simply include query parameters like max_tokens, temperature, top_k, and top_p to customize the generated text.

Once the local server is up and running (thanks to app.run()), you can access your text generation API using the tool of your choice for that, such curl:

 

curl -X POST http://localhost:9001/v2/generate -H “Content-Type: application/json” -d ‘{“max_tokens”: 30, “temperature”: 0.5, “top_k”: 0, “top_p”: 0.9}’

 

Easy peasy! But let’s talk a little bit more about testing it locally which will be a very common practice on your experiment cycles.

 

Testing Your Application Locally

So, you’ve built your text generation app with Flask and you’re eager to test it out before deploying it with SAP AI Core. No worries! Let’s walk through how you can easily test your app locally using Docker and make quick fixes if needed.

Running the Local Image

To start testing locally, follow these steps:

STEP 1: Run the Local Image (Container)

Assuming you’ve built a Docker image for your Flask app, you can run it locally using Docker. Open your terminal and run:

 

docker run -p 9001:9001 -d your-image-name

 

This command starts a Docker container based on your image, mapping port 9001 of the container to 9001 on your localhost (-p 9001:9001). The -d flag runs the container in detached mode (in the background).

STEP 2: Test your app

Open your web browser or use a tool like curl or Postman to send requests to your app running locally:

In addition, “GET” requests are used to retrieve data, while “POST” requests are used to send data to the server. For the text generation to work as intended, input parameters need to be sent, which is typically done via a “POST” request.

However, if you want to test the endpoint in the browser with a simple “GET” request (for testing purposes or to return some default generated text), you can add a “GET” method to the endpoint.

STEP 3: Keep testing and iterating

Now you’re all set to test and iterate on your text generation app locally. Feel free to experiment with different parameters, make changes to your code, and see the results in real-time.

 

Why Logging is Essential in Machine Learning Operations (MLOps)

In MLOps, maintaining observability and understanding your deployed models’ behavior is key to success. Logging, as demonstrated in the Logger class below, plays a very important role in achieving these goals.

 

import logging
import boto3
import threading
import tempfile
from ShakespeareanGenerator.parameters import LogParameters

class Logger:
def __init__(self):
self.log_params = LogParameters()
self.logger = logging.getLogger(__name__)
self.logger.setLevel(logging.INFO)
self.temp_file = tempfile.NamedTemporaryFile(mode=’a’, delete=False)
self.file_handler = logging.FileHandler(self.temp_file.name)
self.file_handler.setFormatter(logging.Formatter(‘%(asctime)s | %(name)s → %(levelname)s: %(message)s’))
self.logger.addHandler(self.file_handler)
self.s3 = self.__get_s3_connection()
self.upload_logs_to_s3()

def __get_s3_connection(self):
return boto3.client(
‘s3’,
aws_access_key_id=self.log_params.access_key_id,
aws_secret_access_key=self.log_params.secret_access_key
)

def upload_logs_to_s3(self):
try:
# Read logs from the temporary file
with open(self.temp_file.name, ‘r’) as f:
log_body = f.read().strip()

if log_body:
file_key = self.log_params.log_prefix + self.log_params.LOG_NAME
self.s3.put_object(
Bucket=self.log_params.bucket_name,
Key=file_key,
Body=log_body.encode(‘utf-8’)
)
else:
self.logger.info(“No logs to upload.”)
except Exception as e:
self.logger.error(f”Error uploading log to S3: {e}”)

# Reschedule the timer for the next upload
self.schedule_next_upload()

def schedule_next_upload(self):
# Create a new timer for the next upload after the specified interval
self.upload_timer = threading.Timer(self.log_params.upload_interval, self.upload_logs_to_s3)
self.upload_timer.start()

def log(self, level, message):
getattr(self.logger, level)(message)

def info(self, message):
self.log(‘info’, message)

def warning(self, message):
self.log(‘warning’, message)

def error(self, message):
self.log(‘error’, message)

def critical(self, message):
self.log(‘critical’, message)

 

 

Understanding Model Behavior

The Logger class captures informative messages about your model’s behavior. By setting the logging level to INFO, it records details like when the model was active (%(asctime)s), which module generated the log (%(name)s), the severity level (%(levelname)s), and the specific message.

Debugging and Troubleshooting

When errors occur during log upload to Amazon S3 (upload_logs_to_s3()), the Logger class captures the details (self.logger.error(f”Error uploading log to S3: {e}”)). These logs are invaluable for troubleshooting issues efficiently.

Monitoring Model Performance

The Logger class schedules regular uploads of logs to Amazon S3 (schedule_next_upload()), allowing you to monitor your model’s performance over time. You can track metrics like the frequency of log uploads and identify patterns or anomalies in model behavior.

Alerting and Notification

In case of empty logs (if log_body:), the Logger class logs an informative message (self.logger.info(“No logs to upload.”)). This kind of alerting within the logging system helps you stay informed about critical events.

Connection to Observability

The Logger class uses Python’s logging library to centralize and format log messages, providing a clear picture of your model’s activities. Logs are very important and structural for achieving observability by capturing real-time data about your model’s interactions and performance.

As we are leveraging logging effectively, as demonstrated by the Logger class, you can enhance the observability of your AI models in production. Remember, good logging practices are essential for maintaining reliable and performant MLOps workflows.

See you in the next blog ?.

Wrapping Up and Next Steps

Congratulations on taking the first step into deploying AI models with SAP AI Core! In this blog, we explored how to bring the Shakespeare Language Model to life using SAP AI Core and KServe.

Let’s recap what we’ve covered:

Introduction to SAP AI Core and KServe: We introduced the foundational concepts behind deploying and serving AI models using SAP AI Core and KServe.

Deploying AI Models: We learned the importance of integrating custom classes and modules, focusing on the unique architecture of the Shakespeare Language Model.

Code Breakdown: We explored critical files and their roles in making the Shakespeare Language Model work, including detailed explanations of key components like the generator and main files.

Building a Text Generation API: We set up and ran a Flask app to generate Shakespearean text, providing step-by-step instructions and practical examples.

Logging in MLOps: We understood the crucial role of logging for monitoring and troubleshooting in machine learning operations.

Next Steps

Now that we’ve laid the foundation for serving the AI models, stay tuned for the upcoming blogs in this series, where we’ll explore how to deploy and enhance our model using SAP AI Core:

Sampling and Consuming Language Models: Discover methods for sampling from trained language models and integrating them into applications.
[SAP AI Core is All You Need | 7. Deploying Language Models for Text Generation]Developing a Language-Model-Based App: Gain insights into building an application powered by your trained language model.
[SAP AI Core is All You Need | 8. Consuming and Sampling from Shakespeare Language Models]

Further References

Source Code: GitHub repositorySAP AI Core HelpSAP AI LaunchpadKubernetesKserve 

​ IntroductionWelcome back to our series “SAP AI Core is All You Need”!In this blog, we’re diving into the exciting world of deploying and serving AI models using SAP AI Core and KServe. Our focus? The legendary Shakespeare Language Model. If you’re keen to explore how to bring advanced AI capabilities to life, join us as we build the infrastructure and the code necessary to deploy the Shakespeare Language Model for inference. To achieve this, we’ll leverage the capabilities of SAP AI Core and the versatile Serving Template.Let’s make Shakespeare come alive in the world of artificial intelligence!What to ExpectIn this blog, you will gain practical insights into the following:Deploying AI Models: Learn the importance of integrating custom classes and modules, focusing on the Shakespeare Language Model’s unique architecture.Code Breakdown: Explore the critical files and their roles in making the Shakespeare Language Model work, including detailed explanations of key components like the generator and main files.Building a Text Generation API: Set up and run a Flask app to generate Shakespearean text, with step-by-step instructions and practical examples.Logging in MLOps: Understand the crucial role of logging for monitoring and troubleshooting in machine learning operations.By the end of this blog, you’ll have the knowledge needed to understand the tools and resources used for serving models using SAP AI Core and also it will enable you to the next blog where we’ll really deploy the models.Why All the Classes Come TogetherThis model is not your typical out-of-the-box solution; it relies on a complex architecture that includes custom classes and modules like TransformerBlock and FeedForward within PyTorch.When we pickle and load our PyTorch model (let’s call it model.pkl), it’s not just the model’s weights that get serialized. The entire structure, including these custom classes, is bundled together. This means that when you load the pickled model, your environment needs to have access to these original class definitions.Why does it matter? Well, unlike scikit-learn models that are self-contained and don’t require external dependencies during inference (actually it does, but not in the same way), our PyTorch model relies on these custom components. Think of it like needing the original blueprint to rebuild a sophisticated machine.In PyTorch, you can save either the entire model or just the model’s parameter dictionary (Saving and Loading Models). The most recommended approach is to save only the model’s state dictionary; however, I think it is a good homework exercise to try saving the whole model.So, as you prepare to deploy your AI model/system, ensure that all the necessary custom classes and modules are available in your environment. This ensures that when you load and use your pickled PyTorch model, everything reconstructs nicely – just like the Bard’s intricate prose ? .While it’s not a recommended practice for larger codebases, you can indeed serialize entire class definitions alongside your model object using libraries like cloudpickle. This approach can simplify deployment but requires careful management of dependencies.We’ve finished discussing saving and loading models, so let’s move on to the code breakdown for deploying our models. Yes! We now have two models, remember? The Shakespeare Language Model for text generation and the PeFT Model for Shakespeare style transfer. Understanding the CodeBy now, you’re familiar with the setup – our custom classes and modules play a structural role in bringing our Shakespeare Language Model to life. Let’s take a closer look at some key files that make this magic happen.language_models.py & tokenizer.py: These files are essential for the unpickling process, ensuring that our model and tokenizer are reconstructed correctly. They hold the definitions of our custom classes and functions, ensuring everything aligns perfectly during model loading.logger.py: This file may seem minor, but it plays a critical role. We’ve made some tweaks here to enhance logging functionality, ensuring smooth operation and easier troubleshooting.parameters.py: Here’s where things get interesting. We’ve tweaked parameters to optimize our model’s performance. Stay tuned for a deeper dive into these optimizations.generator.py: A new addition to our toolkit! This file houses the code responsible for generating text samples from our language model. We’ll explore how this generator interacts with our trained model to produce those elegant Shakespearean phrases.main.py: The heart of our inference process. Let’s dissect this file to understand how sampling from our language model is orchestrated. From loading the model to generating text, this is where the magic unfolds.Together, these files form the backbone of our AI deployment. They encapsulate everything we need from our Shakespeare Language Model and pave the way for model inference. By the end, you’ll have a clear picture of how our AI is deployed using SAP AI Core (and KServe behind the scenes). GeneratorLet’s break down the essential components and files needed to create our API. Our journey begins with exploring the generator.py file. import io
import torch
import pickle
from torch.nn import functional as F
from ShakespeareanGenerator.model.tokenizer import Tokenizer
from ShakespeareanGenerator.parameters import ServingParameters
from ShakespeareanGenerator.logger import Logger Here, we start by importing necessary dependencies for our text generation API deployment:io: Used to handle in-memory binary streams, specifically to facilitate loading a PyTorch model onto the CPU.pickle: Used for serializing and deserializing Python objects, crucial for loading our pre-trained language model.torch: The PyTorch library for deep learning tasks.torch.nn.functional as F: Importing PyTorch’s functional API for neural network operations.Tokenizer: Custom class from ShakespeareanGenerator.model.tokenizer responsible for tokenizing text inputs.ServingParameters: Custom parameters for serving the model, defined in ShakespeareanGenerator.parameters.Logger: Custom logging utilities from ShakespeareanGenerator.logger for monitoring and debugging.These imports set the stage for our text generation pipeline, enabling us to load the model, preprocess inputs, and manage serving configurations. Now, let’s dive into the ModelManager class, a core component of our text generation API deployment. class ModelManager:
def __init__(self):
self.model = None
self.model_loaded = False
self.serving_params = ServingParameters()
self.logging = Logger()
self.check_gpu_usage()

def check_gpu_usage(self):
if torch.cuda.is_available():
self.logging.info(f”GPU is available, using GPU: {torch.cuda.get_device_name(0)}”)
self.logging.info(f”Using CUDA version {torch.version.cuda}”)
else:
self.logging.warning(“GPU is not available, using CPU.”)

def load_model(self):
with open(self.serving_params.INPUT_MODEL, ‘rb’) as f:
self.model = CPU_Unpickler(f).load()

self.model.eval()
self.model = self.model.to(self.serving_params.device)
self.logging.info(f”Model loaded and sent to {self.serving_params.device}”)
self.model_loaded = True

def is_model_loaded(self):
return self.model_loaded Initialization: This initializes the ModelManager class. It sets up attributes including model (to hold our loaded model), model_loaded (a flag to indicate if the model is successfully loaded), serving_params (an instance of ServingParameters for managing serving configurations), and logging (an instance of Logger for handling logs).Loading the Model: This method is responsible for loading the pre-trained model into memory. It opens the serialized model file (INPUT_MODEL specified in ServingParameters) using pickle.load(), deserializes the model, sets it to evaluation mode (self.model.eval()), moves the model to the specified device (GPU or CPU) based on serving_params.device, and logs the successful loading of the model.is_model_loaded Method: This method simply checks and returns the status (True or False) of model_loaded, indicating whether the model is successfully loaded into memory.The ModelManager takes care of loading and managing the model, making sure our text generation API has everything it needs to create Shakespearean text with SAP AI Core.Next, let’s dive into the Generator class, which is responsible for generating text using our loaded language model. This class is designed to handle the text generation process using our loaded language model. Let’s break down how it works, including some key concepts like temperature, top_k, and top_p.Temperature: This controls the randomness of predictions by scaling the logits before applying softmax. A higher temperature value makes the model output more random, while a lower value makes it more deterministic.Top-K Sampling: This limits the sampling pool to the top K highest-probability tokens. It helps in generating more coherent text by focusing on the most likely options.Top-P (Nucleus) Sampling: This method selects tokens from the smallest set whose cumulative probability is at least P. It dynamically chooses the number of top tokens based on their cumulative probability, allowing for more flexible sampling than top-K. class Generator:
def __init__(self, model_manager, max_tokens, temperature, top_k, top_p):
self.max_tokens = max_tokens
self.temperature = temperature
self.top_k = top_k
self.top_p = top_p
self.tokenizer = Tokenizer()
self.model_manager = model_manager

def __sample_from_model(self, index):
self.model = self.model_manager.model
for _ in range(self.max_tokens):
try:
current_index = index[:, -self.model.position_embeddings.weight.shape[0]:]
logits, _ = self.model(current_index)
scaled_logits = (lambda l, t: l / t if t > 0.0 else l)(logits[:, -1, :], self.temperature)
probs = F.softmax(scaled_logits, dim=-1)

if self.top_k > 0:
probs_value, probs_indices = torch.topk(probs, self.top_k, dim=-1)
filtered_probs = probs.clone().fill_(0.0)
filtered_probs.scatter_(dim=-1, index=probs_indices, src=probs_value)
probs = filtered_probs / torch.sum(filtered_probs, dim=-1, keepdim=True)

sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

sorted_indices_to_remove = cumulative_probs > self.top_p
if torch.any(sorted_indices_to_remove):
cutoff_idx = torch.where(sorted_indices_to_remove)[1][0]
indices_to_remove = sorted_indices[:, cutoff_idx + 1 :]
probs.scatter_(dim=-1, index=indices_to_remove, value=0.0)
probs = probs / torch.sum(probs, dim=-1, keepdim=True)

next_index = torch.multinomial(probs, num_samples=1)
index = torch.cat((index, next_index), dim=1)
except Exception as e:
self.model_manager.logging.error(f”Error during text generation: {str(e)}”)
raise
return index

def post_process_text(self, generated_text):
cleaned_text = generated_text.replace(“<s>”, “”).replace(“</s>”, “”).replace(“<b>”, “”).strip()
return cleaned_text

with torch.inference_mode():
def generate(self):
if not self.model_manager.is_model_loaded():
self.model_manager.load_model()
try:
idx = torch.full((1, 1), 4, dtype=torch.long, device=self.model_manager.serving_params.device)
completion = self.tokenizer.decode(self.__sample_from_model(idx)[0].tolist())
self.length = len(self.tokenizer.encode(completion).ids)
self.model_manager.logging.info(f”Text generated successfully with length: {self.length}”)
self.model_manager.logging.info(f”With max tokens set to: {self.max_tokens}”)
self.model_manager.logging.info(f”With temperature set to: {self.temperature}”)
self.model_manager.logging.info(f”With top k set to: {self.top_k}”)
self.model_manager.logging.info(f”With top p set to: {self.top_p}”)
return completion
except Exception as e:
self.model_manager.logging.error(f”Error during text generation: {str(e)}”)
raise Parameters:model_manager: Manages the language model.max_tokens: The maximum number of tokens to generate.temperature: Controls the randomness of predictions by scaling the logits.top_k: Limits sampling to the top k probable tokens.top_p: Limits sampling to the smallest number of tokens whose cumulative probability is above a threshold p.Tokenizer: Converts text to tokens and vice versa.Sampling:In the __sample_from_model method, the main idea is to apply various sampling techniques to generate text from the model.Steps:Loop for max_tokens: Generates tokens one by one up to the maximum limit.Current context: Gets the most recent part of the sequence to use as context.Model prediction: Gets logits (predictions) from the model.Temperature scaling: Adjusts the logits to control randomness.Probability distribution: Converts logits to probabilities.Top-k filtering: Keeps only the top k probable tokens.Top-p filtering: Keeps the smallest number of tokens whose cumulative probability is above p.Sampling: Selects the next token based on the adjusted probabilities.Append token: Adds the selected token to the sequence.Post Processing:Replace special tokens: Removes <s>, </s>, and <b> from the text.Strip: Removes any leading or trailing whitespace.Generate:Check if model is loaded: Loads the model if not already loaded.Initialize sequence: Starts with a specific token (e.g., <s>).Generate sequence: Uses the sampling method to create a sequence of tokens.Decode tokens: Converts the sequence of tokens back into text.Log details: Records the generation details such as length, temperature, top_k, and top_p.Return text: Provides the generated text as output.The Generator class is designed to generate text using a pre-trained language model. It handles everything from setting up parameters, sampling tokens, cleaning up the text, and logging the process. Cool, hun? Building a Text Generation API with Flask and ShakespeareNow it’s time to define how the model will be consumed. One common approach is through API generation. APIs provide a flexible and standardized way to interact with the model, allowing various applications and users to access its functionality without needing to understand the underlying code. This approach is particularly useful for integrating the model into web services, mobile apps, or other systems that require real-time or on-demand text generation. And that’s exactly what we’ll do!Meet the Key PlayersBefore we jump into the code, let’s introduce the main components that will be used:Flask: Our reliable web framework for building the API.request: Used to capture data from incoming HTTP requests.jsonify: Transforms our Python dictionaries into neat JSON responses.Generator: The core engine responsible for generating the Shakespearean text.ModelManager: Manages the loading and handling of our language models.ServingParameters: Contains essential configuration settings such as device and context length.Logger: Ensures we keep track of everything happening behind the scenes.Code Explanation and Main ConceptsThis Python script sets up a simple web server using Flask to generate text in the style of Shakespeare (you don’t have to use Flask, you may use another python library if you would like to, ok?). It makes use of some custom classes and modules, and here’s how it works:Imports and Setup from flask import Flask, request, jsonify
from ShakespeareanGenerator.generator import Generator, ModelManager
from ShakespeareanGenerator.parameters import ServingParameters
from ShakespeareanGenerator.logger import Logger Flask: A lightweight web framework for Python used to create web applications.Generator, ModelManager, ServingParameters, Logger: Custom classes from the ShakespeareanGenerator module to handle text generation, model management, serving parameters, and logging, respectively (as you may already be familiar).Initialize Flask App app = Flask(__name__)
app.json.sort_keys = False Flask App: Create an instance of the Flask app.JSON Configuration: Ensures the JSON responses are not sorted by keys.Initialize Custom Classes model_manager = ModelManager()
logging = Logger() ModelManager: Manages loading and handling the language model.Logger: Handles logging of information and errors.Load Model Before Handling Requests def load_model():
try:
if not model_manager.is_model_loaded():
model_manager.load_model()
else:
logging.info(“Model already loaded”)
except Exception as e:
logging.error(f”Error loading model: {str(e)}”)
raise

@app.before_request
def initialize():
load_model() app.before_request: A Flask decorator that runs init_model before each request.Model Loading: Checks if the model is loaded, loads it if not, and logs the outcome.Text Generation Endpoint .route(‘/v2/generate’, methods=[“POST”])
def generate_text():
data = request.get_json()
max_tokens = int(data.get(‘max_tokens’, 300))
temperature = float(data.get(‘temperature’, 1.0))
top_k = int(data.get(‘top_k’, 0))
top_p = float(data.get(‘top_p’, 0.9))

generator = Generator(model_manager, max_tokens, temperature, top_k=top_k, top_p=top_p)
generated_text = generator.generate()
processed_text = generator.post_process_text(generated_text)
lines = [line.strip() for line in processed_text.split(‘.’) if line.strip()]

response = {
‘generated_text’: lines,
‘model_details’: {
‘model_name’: ‘shakespeare-language-model’,
‘temperature’: generator.temperature,
‘length’: generator.length,
‘top_k’: generator.top_k,
‘top_p’: generator.top_p,
}
}
return jsonify(response) /v2/generate: Defines a POST endpoint for generating text. You may modify the endpoint name and format, but each endpoint must have the prefix /v<NUMBER>.Request Data: Extracts parameters like max_tokens, temperature, top_k, and top_p from the JSON request.Generator Initialization: Uses the Generator class to create text based on the given parameters.Text Processing: Processes the generated text to split it into lines and remove unnecessary spaces.Response: Constructs a JSON response with the generated text and model details.In this part, we define an API endpoint (‘/v2/generate’) where you can send POST requests to trigger the text generation process. Simply include query parameters like max_tokens, temperature, top_k, and top_p to customize the generated text.Once the local server is up and running (thanks to app.run()), you can access your text generation API using the tool of your choice for that, such curl: curl -X POST http://localhost:9001/v2/generate -H “Content-Type: application/json” -d ‘{“max_tokens”: 30, “temperature”: 0.5, “top_k”: 0, “top_p”: 0.9}’ Easy peasy! But let’s talk a little bit more about testing it locally which will be a very common practice on your experiment cycles. Testing Your Application LocallySo, you’ve built your text generation app with Flask and you’re eager to test it out before deploying it with SAP AI Core. No worries! Let’s walk through how you can easily test your app locally using Docker and make quick fixes if needed.Running the Local ImageTo start testing locally, follow these steps:STEP 1: Run the Local Image (Container)Assuming you’ve built a Docker image for your Flask app, you can run it locally using Docker. Open your terminal and run: docker run -p 9001:9001 -d your-image-name This command starts a Docker container based on your image, mapping port 9001 of the container to 9001 on your localhost (-p 9001:9001). The -d flag runs the container in detached mode (in the background).STEP 2: Test your appOpen your web browser or use a tool like curl or Postman to send requests to your app running locally:In addition, “GET” requests are used to retrieve data, while “POST” requests are used to send data to the server. For the text generation to work as intended, input parameters need to be sent, which is typically done via a “POST” request.However, if you want to test the endpoint in the browser with a simple “GET” request (for testing purposes or to return some default generated text), you can add a “GET” method to the endpoint.STEP 3: Keep testing and iteratingNow you’re all set to test and iterate on your text generation app locally. Feel free to experiment with different parameters, make changes to your code, and see the results in real-time. Why Logging is Essential in Machine Learning Operations (MLOps)In MLOps, maintaining observability and understanding your deployed models’ behavior is key to success. Logging, as demonstrated in the Logger class below, plays a very important role in achieving these goals. import logging
import boto3
import threading
import tempfile
from ShakespeareanGenerator.parameters import LogParameters

class Logger:
def __init__(self):
self.log_params = LogParameters()
self.logger = logging.getLogger(__name__)
self.logger.setLevel(logging.INFO)
self.temp_file = tempfile.NamedTemporaryFile(mode=’a’, delete=False)
self.file_handler = logging.FileHandler(self.temp_file.name)
self.file_handler.setFormatter(logging.Formatter(‘%(asctime)s | %(name)s → %(levelname)s: %(message)s’))
self.logger.addHandler(self.file_handler)
self.s3 = self.__get_s3_connection()
self.upload_logs_to_s3()

def __get_s3_connection(self):
return boto3.client(
‘s3’,
aws_access_key_id=self.log_params.access_key_id,
aws_secret_access_key=self.log_params.secret_access_key
)

def upload_logs_to_s3(self):
try:
# Read logs from the temporary file
with open(self.temp_file.name, ‘r’) as f:
log_body = f.read().strip()

if log_body:
file_key = self.log_params.log_prefix + self.log_params.LOG_NAME
self.s3.put_object(
Bucket=self.log_params.bucket_name,
Key=file_key,
Body=log_body.encode(‘utf-8’)
)
else:
self.logger.info(“No logs to upload.”)
except Exception as e:
self.logger.error(f”Error uploading log to S3: {e}”)

# Reschedule the timer for the next upload
self.schedule_next_upload()

def schedule_next_upload(self):
# Create a new timer for the next upload after the specified interval
self.upload_timer = threading.Timer(self.log_params.upload_interval, self.upload_logs_to_s3)
self.upload_timer.start()

def log(self, level, message):
getattr(self.logger, level)(message)

def info(self, message):
self.log(‘info’, message)

def warning(self, message):
self.log(‘warning’, message)

def error(self, message):
self.log(‘error’, message)

def critical(self, message):
self.log(‘critical’, message)  Understanding Model BehaviorThe Logger class captures informative messages about your model’s behavior. By setting the logging level to INFO, it records details like when the model was active (%(asctime)s), which module generated the log (%(name)s), the severity level (%(levelname)s), and the specific message.Debugging and TroubleshootingWhen errors occur during log upload to Amazon S3 (upload_logs_to_s3()), the Logger class captures the details (self.logger.error(f”Error uploading log to S3: {e}”)). These logs are invaluable for troubleshooting issues efficiently.Monitoring Model PerformanceThe Logger class schedules regular uploads of logs to Amazon S3 (schedule_next_upload()), allowing you to monitor your model’s performance over time. You can track metrics like the frequency of log uploads and identify patterns or anomalies in model behavior.Alerting and NotificationIn case of empty logs (if log_body:), the Logger class logs an informative message (self.logger.info(“No logs to upload.”)). This kind of alerting within the logging system helps you stay informed about critical events.Connection to ObservabilityThe Logger class uses Python’s logging library to centralize and format log messages, providing a clear picture of your model’s activities. Logs are very important and structural for achieving observability by capturing real-time data about your model’s interactions and performance.As we are leveraging logging effectively, as demonstrated by the Logger class, you can enhance the observability of your AI models in production. Remember, good logging practices are essential for maintaining reliable and performant MLOps workflows.See you in the next blog ?.Wrapping Up and Next StepsCongratulations on taking the first step into deploying AI models with SAP AI Core! In this blog, we explored how to bring the Shakespeare Language Model to life using SAP AI Core and KServe.Let’s recap what we’ve covered:Introduction to SAP AI Core and KServe: We introduced the foundational concepts behind deploying and serving AI models using SAP AI Core and KServe.Deploying AI Models: We learned the importance of integrating custom classes and modules, focusing on the unique architecture of the Shakespeare Language Model.Code Breakdown: We explored critical files and their roles in making the Shakespeare Language Model work, including detailed explanations of key components like the generator and main files.Building a Text Generation API: We set up and ran a Flask app to generate Shakespearean text, providing step-by-step instructions and practical examples.Logging in MLOps: We understood the crucial role of logging for monitoring and troubleshooting in machine learning operations.Next StepsNow that we’ve laid the foundation for serving the AI models, stay tuned for the upcoming blogs in this series, where we’ll explore how to deploy and enhance our model using SAP AI Core:Sampling and Consuming Language Models: Discover methods for sampling from trained language models and integrating them into applications.[SAP AI Core is All You Need | 7. Deploying Language Models for Text Generation]Developing a Language-Model-Based App: Gain insights into building an application powered by your trained language model.[SAP AI Core is All You Need | 8. Consuming and Sampling from Shakespeare Language Models]Further ReferencesSource Code: GitHub repositorySAP AI Core HelpSAP AI LaunchpadKubernetesKserve   Read More Technology Blogs by SAP articles 

#SAP

#SAPTechnologyblog

You May Also Like

More From Author

+ There are no comments

Add yours