We are pleased to introduce our new NLP function in SAP HANA Cloud QRC04 2024 Predictive Analysis Library (PAL) – Text Chunking in this blog post. After reading, you will grasp:

The concept of text chunking and its operation in SAP HANA CloudAn example of text chunking using Python and SQLAn illustration of text chunking effectiveness in improving text search

1. Introduction

Text chunking is the process of breaking down large texts into smaller, more manageable segments for analysis. It is a standard method among essential preprocessing techniques such as text cleaning, tokenization, and stemming in Natural Language Processing (NLP). This practice is necessary because embedding models and Large Language Models (LLMs) often have token length limits. Exceeding these limits can result in information loss or inaccuracies. Thus, text chunking ensures efficient text processing, mitigating potential issues.

In relation to text chunking, terms like Text Segmentation and Text Splitting frequently arise. Text Segmentation refers to the division of continuous text data into smaller, meaningful units, such as sentences or paragraphs, and can even extend to finer-grained semantic units. This facilitates subsequent natural language analysis. Text segmentation can be based on whitespace, sentence boundaries, or semantic content. Text splitting is a more general term that can encompass various methods of dividing text, including chunking. Furthermore, in certain scenarios, Chunking typically refers to the identification of phrases or meaningful chunks within a sequence of words, often based on part-of-speech tagging. The terms “chunking” and “text splitting” are often used interchangeably, as both processes involve dividing text into manageable units.

Text Chunking in SAP HANA Cloud offers several approaches to divide the text into chunks.

One method is fixed-size chunking, which divides text into chunks of a predefined number of characters.Another method is recursive chunking, which uses a set of separators to divide text hierarchically and iteratively, with overlapping chunks to preserve semantic context.Additionally, we provide a Document Splitter that employs different methodologies tailored to various document types (PlainText, HTML) and languages to chunk the data.

This implementation does not generally require an understanding of the text’s semantics. The selection of which method to use depends on the specific requirements of the NLP task and the characteristics of the text being processed.

2. HANA Text Chunking

Text Chunking in SAP HANA Cloud is supported by both the SQL procedure, PAL_TEXTSPLIT, and the Python API – TextSplitter, found in the Python machine learning client for SAP HANA (hana-ml). The necessary input data comprises of two columns – the first for text ID and the second for text content.

Highlighted below are some crucial parameters in PAL_TEXTSPLIT / TextSplitter:

CHUNK_SIZE / chunk_size : Specifies the maximum size of chunks to return.OVERLAP / overlap : Defines the count of overlapping characters between chunks.GLOBAL_SPLIT_TYPE / split_type : Determines the method to split the text.Character splitter : Splits the text based on a predefined number of characters, without considering whether it divides a complete word.Recursive splitter (by default) : Operates on a list of separators.Document splitter : Uses various methodologies corresponding to different document types (PlainText, HTML) and languages to chunk the data.GLOBAL_LANGUAGE_TYPE / language: Options include ‘auto’ (auto detection), ‘en’ (English), ‘zh’ (Chinese), ‘ja’ (Japanese), ‘de’ (German), ‘fr’ (French), ‘es’ (Spanish), ‘pt’ (Portuguese).GLOBAL_SEPARATOR / separator: Determines the separator configurations.

In the PAL_TEXTSPLIT SQL procedure, two tables are returned as output – one is a subdivided text list with the original ID plus an additional SUB_ID column, while the other is a statistical data table. In Python API, TextSplitter is a class, which invokes its split_text function to partition the text. The results of subdivided text are returned in a HANA DataFrame. The statistical table, on the other hand, is retained in an attribute named statistics_. The specific code will be displayed in the next section.

Moreover, the chunk size typically doesn’t correspond to the number of tokens, but rather it refers to the number of characters. The direct relationship between chunk size and the number of tokens is largely dependent on the language used and the tokenization method applied. For instance, in many Western languages, a common estimation can be derived by dividing the total number of characters in the text by 4, as this can provide an approximate token count for the target text, assuming an average word length of 4-5 characters.

However, this estimation may vary for other languages. For instance, in languages like Chinese or Japanese, each individual character usually represents a token, therefore, the token count would be close to the character count. Keeping the token limit of the embedding model in mind, for instance, 256 tokens in the case of the embedding model of PAL_TEXTEMBEDDING, you could set a chunk size of around 1000 in PAL_TEXTSPLIT. This strategic adjustment allows for reducing information loss due to token measure limits.

3. Text Chunking in Practice: An Illustrative Example

The use of Text Chunking is straightforward. In the succeeding example, we will demonstrate how to utilize SQL and Python API to split lengthy text. The dataset we used is a public dataset named MLDR (Multilingual Long-Document Retrieval). The dataset consists of 13 languages, with the average length of text spanning from 3300 (English) to 9000 (Russian/Arabic). The number of lines ranges from 6569 to 200,000. Additionally, the dataset provides a test set of queries and their respective document IDs.

Figure 1 exhibits the first ten lines of the English corpus from MLDR. The data comprises of two columns –’docid’ and ‘text’. df is a HANA DataFrame of the English corpora, which has been imported into a HANA table named EN_DATA_TBL. The total number of rows is 200,000.

>>> df.head(10).collect()

Figure 1. The first ten lines of HANA DataFrame df

Next up, we will show the operation using Python code. First, initialize a TextSplitter instance ‘tspliter’ and then invoke the split_text method. A HANA DataFrame named ‘result’ is then returned containing 3 columns, ‘docid’, ‘SUB_ID’, signifying the sequence number of the divided subtext, and the ‘CONTENT’ of the divided subtext . The first 10 lines of result is shown in the Figure 2. Meanwhile, the statistical data is stored in the ‘statistics_’ attribute. The results show the default three separators – ‘nn’, ‘n’ and blank space in action.

>>> from hana_ml.text.text_splitter import TextSplitter
>>> tspliter = TextSplitter(chunk_size=1000, overlap=20)
>>> result = tspliter.split_text(data=df)
>>> print(result.head(10).collect())
>>> print(textsplitter.statistics_.collect())

Figure 2. The first 10 lines of HANA DataFrame ‘result’

Figure 3. The HANA DataFrame ‘textsplitter.statistics_’

Following, we’re presenting an example that uses SQL to run the PAL_TEXTSPLIT procedure.

DROP TABLE #PAL_PARAMETER_TBL;

CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_PARAMETER_TBL (
“PARAM_NAME” VARCHAR(100),
“INT_VALUE” INTEGER,
“DOUBLE_VALUE” DOUBLE,
“STRIN_VALUE” VARCHAR(100)
);
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘CHUNK_SIZE’, 1000, NULL, NULL);
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘OVERLAP’, 20, NULL, NULL);

CALL _SYS_AFL.PAL_TEXTSPLIT(EN_DATA_TBL, “#PAL_PARAMETER_TBL”, ?, ?);

4.Summary

In this blog post, we introduced the Text Chunking in SAP HANA Cloud, an essential preprocessing method in NLP. This feature allows us to split long texts into shorter, manageable subtexts, thereby enhancing the effectiveness of downstream tasks. We demonstrated how to invoke Text Chunking using both SQL and Python.

Furthermore, text chunking can boost the performance of information retrieval. Although there is a trade-off involving resource allocation and the time required to generate necessary embeddings, the benefits of improved text search capabilities, make text chunking a valuable feature for search-related tasks.

Other Useful Links:

Install the Python Machine Learning client from the pypi public repository: hana-ml

We also provide a R API for SAP HANA PAL called hana.ml.r, please refer to more information on the documentation.

For other blog posts on hana-ml:

We are pleased to introduce our new NLP function in SAP HANA Cloud QRC04 2024 Predictive Analysis Library (PAL) – Text Chunking in this blog post. After reading, you will grasp:The concept of text chunking and its operation in SAP HANA CloudAn example of text chunking using Python and SQLAn illustration of text chunking effectiveness in improving text search1. IntroductionText chunking is the process of breaking down large texts into smaller, more manageable segments for analysis. It is a standard method among essential preprocessing techniques such as text cleaning, tokenization, and stemming in Natural Language Processing (NLP). This practice is necessary because embedding models and Large Language Models (LLMs) often have token length limits. Exceeding these limits can result in information loss or inaccuracies. Thus, text chunking ensures efficient text processing, mitigating potential issues.In relation to text chunking, terms like Text Segmentation and Text Splitting frequently arise. Text Segmentation refers to the division of continuous text data into smaller, meaningful units, such as sentences or paragraphs, and can even extend to finer-grained semantic units. This facilitates subsequent natural language analysis. Text segmentation can be based on whitespace, sentence boundaries, or semantic content. Text splitting is a more general term that can encompass various methods of dividing text, including chunking. Furthermore, in certain scenarios, Chunking typically refers to the identification of phrases or meaningful chunks within a sequence of words, often based on part-of-speech tagging. The terms “chunking” and “text splitting” are often used interchangeably, as both processes involve dividing text into manageable units.Text Chunking in SAP HANA Cloud offers several approaches to divide the text into chunks.One method is fixed-size chunking, which divides text into chunks of a predefined number of characters.Another method is recursive chunking, which uses a set of separators to divide text hierarchically and iteratively, with overlapping chunks to preserve semantic context.Additionally, we provide a Document Splitter that employs different methodologies tailored to various document types (PlainText, HTML) and languages to chunk the data.This implementation does not generally require an understanding of the text’s semantics. The selection of which method to use depends on the specific requirements of the NLP task and the characteristics of the text being processed. 2. HANA Text ChunkingText Chunking in SAP HANA Cloud is supported by both the SQL procedure, PAL_TEXTSPLIT, and the Python API – TextSplitter, found in the Python machine learning client for SAP HANA (hana-ml). The necessary input data comprises of two columns – the first for text ID and the second for text content.Highlighted below are some crucial parameters in PAL_TEXTSPLIT / TextSplitter:CHUNK_SIZE / chunk_size : Specifies the maximum size of chunks to return.OVERLAP / overlap : Defines the count of overlapping characters between chunks.GLOBAL_SPLIT_TYPE / split_type : Determines the method to split the text.Character splitter : Splits the text based on a predefined number of characters, without considering whether it divides a complete word.Recursive splitter (by default) : Operates on a list of separators.Document splitter : Uses various methodologies corresponding to different document types (PlainText, HTML) and languages to chunk the data.GLOBAL_LANGUAGE_TYPE / language: Options include ‘auto’ (auto detection), ‘en’ (English), ‘zh’ (Chinese), ‘ja’ (Japanese), ‘de’ (German), ‘fr’ (French), ‘es’ (Spanish), ‘pt’ (Portuguese).GLOBAL_SEPARATOR / separator: Determines the separator configurations.In the PAL_TEXTSPLIT SQL procedure, two tables are returned as output – one is a subdivided text list with the original ID plus an additional SUB_ID column, while the other is a statistical data table. In Python API, TextSplitter is a class, which invokes its split_text function to partition the text. The results of subdivided text are returned in a HANA DataFrame. The statistical table, on the other hand, is retained in an attribute named statistics_. The specific code will be displayed in the next section.Moreover, the chunk size typically doesn’t correspond to the number of tokens, but rather it refers to the number of characters. The direct relationship between chunk size and the number of tokens is largely dependent on the language used and the tokenization method applied. For instance, in many Western languages, a common estimation can be derived by dividing the total number of characters in the text by 4, as this can provide an approximate token count for the target text, assuming an average word length of 4-5 characters.However, this estimation may vary for other languages. For instance, in languages like Chinese or Japanese, each individual character usually represents a token, therefore, the token count would be close to the character count. Keeping the token limit of the embedding model in mind, for instance, 256 tokens in the case of the embedding model of PAL_TEXTEMBEDDING, you could set a chunk size of around 1000 in PAL_TEXTSPLIT. This strategic adjustment allows for reducing information loss due to token measure limits. 3. Text Chunking in Practice: An Illustrative ExampleThe use of Text Chunking is straightforward. In the succeeding example, we will demonstrate how to utilize SQL and Python API to split lengthy text. The dataset we used is a public dataset named MLDR (Multilingual Long-Document Retrieval). The dataset consists of 13 languages, with the average length of text spanning from 3300 (English) to 9000 (Russian/Arabic). The number of lines ranges from 6569 to 200,000. Additionally, the dataset provides a test set of queries and their respective document IDs.Figure 1 exhibits the first ten lines of the English corpus from MLDR. The data comprises of two columns –’docid’ and ‘text’. df is a HANA DataFrame of the English corpora, which has been imported into a HANA table named EN_DATA_TBL. The total number of rows is 200,000.>>> df.head(10).collect()Figure 1. The first ten lines of HANA DataFrame dfNext up, we will show the operation using Python code. First, initialize a TextSplitter instance ‘tspliter’ and then invoke the split_text method. A HANA DataFrame named ‘result’ is then returned containing 3 columns, ‘docid’, ‘SUB_ID’, signifying the sequence number of the divided subtext, and the ‘CONTENT’ of the divided subtext . The first 10 lines of result is shown in the Figure 2. Meanwhile, the statistical data is stored in the ‘statistics_’ attribute. The results show the default three separators – ‘nn’, ‘n’ and blank space in action. >>> from hana_ml.text.text_splitter import TextSplitter
>>> tspliter = TextSplitter(chunk_size=1000, overlap=20)
>>> result = tspliter.split_text(data=df)
>>> print(result.head(10).collect())
>>> print(textsplitter.statistics_.collect()) Figure 2. The first 10 lines of HANA DataFrame ‘result’Figure 3. The HANA DataFrame ‘textsplitter.statistics_’Following, we’re presenting an example that uses SQL to run the PAL_TEXTSPLIT procedure. DROP TABLE #PAL_PARAMETER_TBL;

CALL _SYS_AFL.PAL_TEXTSPLIT(EN_DATA_TBL, “#PAL_PARAMETER_TBL”, ?, ?); 4.SummaryIn this blog post, we introduced the Text Chunking in SAP HANA Cloud, an essential preprocessing method in NLP. This feature allows us to split long texts into shorter, manageable subtexts, thereby enhancing the effectiveness of downstream tasks. We demonstrated how to invoke Text Chunking using both SQL and Python.Furthermore, text chunking can boost the performance of information retrieval. Although there is a trade-off involving resource allocation and the time required to generate necessary embeddings, the benefits of improved text search capabilities, make text chunking a valuable feature for search-related tasks.Other Useful Links:Install the Python Machine Learning client from the pypi public repository: hana-ml We also provide a R API for SAP HANA PAL called hana.ml.r, please refer to more information on the documentation.For other blog posts on hana-ml:A Multivariate Time Series Modeling and Forecasting Guide with Python Machine Learning Client for SAP HANAOutlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANAOutlier Detection by Clustering using Python Machine Learning Client for SAP HANAAnomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for SAP HANAOutlier Detection with One-class Classification using Python Machine Learning Client for SAP HANALearning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client for SAP HANAPython Machine Learning Client for SAP HANAImport multiple excel files into a single SAP HANA tableCOPD study, explanation and interpretability with Python machine learning client for SAP HANAModel Storage with Python Machine Learning Client for SAP HANAIdentification of Seasonality in Time Series with Python Machine Learning Client for SAP HANA Read More Technology Blogs by SAP articles

#SAP

#SAPTechnologyblog