This blog post delves into two important information retrieval techniques: BM25 and Approximate Nearest Neighbor (ANN) Search as an advanced text mining technique, which have been introduced in the SAP HANA Cloud QRC04 2024 Predictive Analysis Library (PAL). The following sections will cover the topics:
The working principles of ANN search and BM25 in information retrieval and text mining applicationsThe implementation in the SAP HANA Cloud Predictive Analysis Library (PAL)Use cases
1. Introduction
Information Retrieval (IR) Systems are designed to help users locate and extract information stored within databases or collections of documents. BM25 and ANN search are two important technologies in IR that have been widely applied across various domains such as web search engines, question answering systems and chatbots. When delving into the specifics of these technologies, we first examine two main categories of IR:
Dense Retrieval: Use dense vectors for text matching. The relevant answers to the query can be retrieved based on the embedding similarity. Sparse Retrieval: Use sparse vectors for text matching. In these methods, the majority dimension of the vector is zero, with only a few dimensions having values, and the overall vector dimensionality can be very high.
Dense retrieval is suitable for scenarios where capturing rich semantic information is necessary, while sparse retrieval is ideal for situations with limited computational resources or when rapid retrieval is required. Dense retrieval generally outperforms sparse retrieval in terms of retrieval effectiveness, but at the cost of higher computational expenses. Sparse retrieval, on the other hand, has a distinct advantage in efficiency, especially when dealing with large datasets.
This distinction introduces us to two retrieval technologies: ANN search and BM25. ANN search, as a method of dense retrieval, focuses on swiftly identifying vectors in high-dimensional spaces that are similar to a given vector in an approximate manner. ANN search is especially adept at handling high-dimensional, voluminous dense vector data, which is crucial in modern recommendation systems or large scale vectorized text context stores.
Simultaneously, BM25, as a representative of sparse information retrieval, optimizes the relevance scoring between documents and queries through its probabilistic framework. BM25 introduces saturation functions and empirically observed parameter adjustments, providing a more nuanced relevance score, which has led to its widespread adoption in search engines for ranking search results.
BM25 is renowned for its simplicity and speed in handling large datasets but may struggle with semantic gaps between queries and documents. In contrast, ANN search offers better recall but at the cost of higher computational resources. The choice between BM25 and ANN search hinges on the application’s requirements for efficiency, precision, and semantic depth.
While BM25 and ANN search each offer distinct advantages, there are scenarios where the limitations of each method become apparent. To bridge these gaps, the concept of hybrid search has emerged. Hybrid search integrates different search technologies – ranging from traditional keyword-based methods to modern vector-based techniques – to enhance the retrieval process. For example, this approach can leverage the precision of BM25 for text matching and the semantic richness of ANN search for high-dimensional data, resulting in a better understanding of user queries.
2. ANN Search and BM25 in SAP HANA Cloud PAL
2.1 ANN Search
The ANN search implementation in PAL leverages the Flat IVF algorithm, which clusters the original vector set and divides the vector space into smaller groups referred to as “cells”. It then lists the reference of vectors belonging to each cluster for retrieval. KMeans clustering is used to build the vector-cluster references on a given dataset, dividing it into numerous clusters controlled by the parameter “GROUP_NUMBER”. These cluster centers are stored as the Vector-cluster References along with the original datasets into stateful in-memory ANN search model, thereby enabling much faster responses for search predictions.
More specifically, when “GROUP_NUMBER” is set to 1, the K-Nearest Neighbor approach is utilized; otherwise, the Flat IVF strategy is employed, leveraging the cloud system’s AVX instruction set along with multi-threading and matrix computing to improve the search efficiency.
PAL’s ANN search accepts both documents (raw text) and vectors for model building through two SQL procedures.
The SQL procedure PAL_ANNS_MODEL_CREATE_BY_DOC internally invokes the PAL Text Embedding procedure, PAL_TEXTEMBEDDING, converting documents into vectors and then build the ANN search model.Users can employ their own embedding model to generate vectors and subsequently call the SQL procedure PAL_ANNS_MODEL_CREATE_BY_VECTOR to build the ANN search model.
In the Python machine learning client for SAP HANA (hana-ml 2.23) class ANNSmodel, a parameter ‘by_doc’ allows the selection between the two input formats. For example, if ‘by_doc’ is set to True, PAL_ANNS_MODEL_CREATE_BY_DOC is automatically invoked.
When searching against the stateful Approximate Nearest Neighbor (ANN) search model with new prediction data, such as a single query or multiple queries, the appropriate SQL procedure is applied based on the format of the queries. This could be either PAL_ANN_SEARCH_BY_VECTOR or PAL_ANN_SEARCH_BY_DOC. For instance, if you have used an embedding model to generate vectors, you should use the same model to generate the embedding of the query. Then, you can call PAL_ANN_SEARCH_BY_VECTOR to perform the search. Such search procedure identifies the nearest clusters by calculating the distance to their centers and selects the top k clusters.
The stateful in-memory ANN search models, ensure quickest responses during search predictions while optimizing resource utilization. For governance and management of these stateful in-memory ANN search models, we also provide two dedicated SQL procedures, PAL_ANNS_MODEL_LIST and PAL_ANNS_MODEL_DELETE. Since these models are stored in memory, it is recommended they are removed to free up resources when they are no longer required.
2.2 BM25
The BM25 (Best Matching 25) is an extension of the Binary Independence Model (BIM) and you could invoke a SQL procedure named PAL_SEARCH_DOCS_BY_KEYWORDS to use this algorithm. This procedure draws on the TFIDF (Term Frequency-Inverse Document Frequency) values calculated using the PAL_TF_ANALYSIS. These TFIDF values are used to search for documents that closely align with user-provided keywords. Although there are a few variations of BM25 in existence, the one currently implemented is the original version.
The algorithm’s parameters are easy to adjust and include ‘NUM_BEST_MATCHES’, ‘BM25_K1’, and ‘BM25_B’. The ‘BM25_K1’ parameter controls the saturation level of term frequency, while ‘BM25_B’ determines the extent of document length normalization.
3. Use Cases of ANN Search and BM25
In the subsequent sections, we will illustrate the use of SQL and Python code to invoke ANN search and BM25 in SAP HANA Cloud. Note that the scripts provided are purely for illustrative purposes and are not intended for production use.
The dataset utilized is a public dataset known as MLDR (Multilingual Long-Document Retrieval). This dataset comprises documents in 13 languages, with the text’s average length ranging from 3300 words (English text) to 9000 words (Russian/Arabic text). The document’s number of lines varies between 6569 to 200,000. Moreover, the dataset provides a test set of queries and their associated document IDs.
To better demonstrate and compare the performance of ANN Search and BM25, we selected the French corpus from MLDR, containing 10,000 long texts and we used the first query from the test set to search for its associated document. This query is entitled ‘q-fr-5’ and the document most closely associated with this query is ‘doc-fr-40’. Hence, we expect these two algorithms to identify this document from the French corpus and rank it first.
Figure 1 displays the first 5 lines of the French corpus. The data consists of two columns – ‘docid’ and ‘text’. ‘df’ represents a HANA DataFrame of the French corpora, which has been imported into a HANA table named ‘FR_DATA_TBL’. The total number of rows in this dataset is 10,000.
>>> df.head(5).collect()
Figure 1. The first 5 lines of HANA DataFrame df
In addition, ‘df_query’ represents a HANA DataFrame of the query, which has been stored into a HANA table named ‘QUERY_TBL’.
>>> df_query.collect()
Figure 2. The HANA DataFrame df_query
3.1 ANN Search
Next, we will display ANN search usage via hana-ml Python script. Since the input text is raw, we set the parameter by_doc=True. We also record the execution time for subsequent comparison between the two algorithms. From the information in the attribute anns.state_, we can observe that the ANN search model was successfully built and stored in the AFL state. The state ID is ‘256E38733E407A4D9F3D6B116AA7348A’, and the information in its COMMENT section corresponds to the comment=’MLDR_FR’ set in the parameters.
>>> from hana_ml.text.anns_model import ANNSModel
>>> anns = ANNSModel(by_doc=True)
>>> anns.fit(data=df, key=’docid’, target=’text’, comment=’MLDR_FR’)
>>> print(f”Running time :{anns.runtime} s”)
>>> print(anns.state_.collect())
>>> print(anns. embedding_result_.head(5).collect())
Figure 3. The attribute anns.state_
The embeddings of documents are shown in Figure 4. In the column “VECTOR_COL”, each row contains a 768-dimensional vector.
Figure 4. The fist 5 lines of attribute anns.embedding_result_
Next, we will use an SQL script to illustrate this example by invoking PAL_ANNS_MODEL_CREATE_BY_DOC.
DROP TABLE PAL_ANN_PARAMETER_TBL;
CREATE COLUMN TABLE PAL_ANN_PARAMETER_TBL(
“PARAM_NAME” NVARCHAR(256),
“INT_VALUE” INTEGER,
“DOUBLE_VALUE” DOUBLE,
“STRING_VALUE” NVARCHAR(1000)
);
INSERT INTO PAL_ANN_PARAMETER_TBL VALUES (‘COMMENT’, NULL, NULL, ‘MLDR_FR’);
DO BEGIN
tv_data = SELECT * FROM FR_DATA_TBL;
tv_param = SELECT * FROM PAL_ANN_PARAMETER_TBL;
call _SYS_AFL.PAL_ANNS_MODEL_CREATE_BY_DOC(:tv_data, :tv_param, tv_out1, tv_out2);
SELECT * FROM :tv_out1;
SELECT * FROM :tv_out2;
END;
Next, we will search for the document most relevant to the query within the HANA DataFrame df_query and the HANA table name is “MLDK_QUERY_FR_DATA_1”. Since we have not set the parameter GROUP_NUMBER, its default value is 1, which means the KNN algorithm will be used and will return the top 1 result. As depicted in Figure 5, the correct document ‘doc-fr-40’ has been successfully retrieved as the top result, with the search runtime being approximately 5.8 seconds.
>>> res = anns.predict(data=df_query, key=’ID’, target=’QUERY’, is_query=True)
>>> print(f”Running time :{anns.runtime} s”)
>>> print(res.collect())
Figure 5. The runtime and search result of ANN Search
The SQL script to call PAL_ANN_SEARCH_BY_DOC for searching is as follows. It’s important to note that the parameter “STATE_ID” needs to be explicitly stated in the parameter table.
DROP TABLE PAL_ANN_PRED_PARAMETER_TBL;
CREATE COLUMN TABLE PAL_ANN_PRED_PARAMETER_TBL(
“PARAM_NAME” NVARCHAR(256),
“INT_VALUE” INTEGER,
“DOUBLE_VALUE” DOUBLE,
“STRING_VALUE” NVARCHAR(1000)
);
INSERT INTO PAL_ANN_PRED_PARAMETER_TBL VALUES (‘K_NEAREST_NEIGHBOURS’, 1, NULL, NULL);
INSERT INTO PAL_ANN_PRED_PARAMETER_TBL VALUES (‘IS_QUERY’, 1, NULL, NULL);
INSERT INTO PAL_ANN_PRED_PARAMETER_TBL VALUES (‘STATE_ID’, NULL, NULL, ‘256E38733E407A4D9F3D6B116AA7348A’);
CREATE OR REPLACE FUNCTION PALDEVUSER.HANA_PAL_ANNSEARCH (IN QUERY_SET TABLE(“ID” INTEGER, “QUERY” NVARCHAR(5000)))
RETURNS TABLE (“ID” INTEGER, “K” INTEGER, “TRAIN_ID” NVARCHAR(100), “DISTANCE” DOUBLE)
LANGUAGE SQLSCRIPT
SQL SECURITY INVOKER AS
BEGIN
DECLARE tv_final TABLE(“ID” INTEGER, “K” INTEGER, “TRAIN_ID” NVARCHAR(100), “DISTANCE” DOUBLE);
tv_data = SELECT * FROM :QUERY_SET;
tv_param = SELECT * FROM PAL_ANN_PRED_PARAMETER_TBL;
CALL _SYS_AFL.PAL_ANN_SEARCH_BY_DOC(:tv_data, :tv_param, tv_out, tv_out2);
tv_final =SELECT “TEST_ID” AS “ID”, K, TRAIN_ID, DISTANCE FROM :tv_out;
RETURN SELECT * FROM :tv_final;
END;
SELECT * FROM PALDEVUSER.HANA_PAL_ANNSEARCH(MLDK_QUERY_FR_DATA_1);
3.2 BM25
If you wish to use BM25, in the hana-ml Python client , you should call the function search_docs_by_keywords(). For SQL, the corresponding procedure is named PAL_SEARCH_DOCS_BY_KEYWORDS.
The search_docs_by_keywords() function requires two inputs: ref_data, which is the corpus to be searched, and pred_data, which is the prediction data. ref_data can be either a single HANA DataFrame or one of the three HANA DataFrames (TF-IDF Result, Doc Term Frequency, Doc Category) obtained after executing PAL text analysis (PAL_TF_ANALYSIS). If the input detected is a single DataFrame, the function will implicitly perform PAL_TF_ANALYSIS and then call PAL_SEARCH_DOCS_BY_KEYWORDS. Therefore, the input ref_data must conform to the data format expected by PAL text analysis. In our example, we added a third column with each entry being 1 during preprocessing. This HANA DataFrame, named df_ref, is depicted in Figure 6. The content of this column will not influence the subsequent BM25 calculations.
>>> df_ref.collect()
Figure 6. The first 5 lines of HANA DataFrame df_ref
As for the keywords in the query of pred_data, we preprocessed the df_query by calling PAL_TF_ANALYSIS and then get a list of keywords separated by commas as shown in Figure 7.
>>> df_keywords.collect()
Figure 7. The HANA DataFrame df_keywords
The following Python code demonstrates how to invoke the search_docs_by_keywords() function. As shown in Figure 8, the results indicate that BM25 is also capable of finding the correct document, which is referred to as ‘doc-fr-40’.
>>> from hana_ml.text.tm import search_docs_by_keywords
>>> import time
>>> start_time = time.time()
>>> res, _ = search_docs_by_keywords(pred_data=df_keywords, ref_data=df_ref, lang=’fr’, thread_ratio=1.0)
>>> end_time = time.time()
>>> print(f”Running time:{end_time-start_time} s”) >>> print(res.collect())
Figure 8. The search result of BM25
The SQL code is shown as follows.
DROP TABLE PAL_PARAMETER_TAB;
CREATE COLUMN TABLE PAL_PARAMETER_TAB(
“PARAM_NAME” NVARCHAR(256),
“INT_VALUE” INTEGER,
“DOUBLE_VALUE” DOUBLE,
“STRING_VALUE” NVARCHAR(1000)
);
CREATE OR REPLACE FUNCTION PALDEVUSER.HANA_PAL_BM25 (IN DATA_SET TABLE(“docid” INTEGER, “text” NCLOB, “CATEGORY” INTEGER), IN KEYWORDS_SET TABLE(“ID” INTEGER, “keywords” NVARCHAR(5000)))
RETURNS TABLE (“PREDICT_ID” INTEGER, “INDEX” INTEGER, “BEST_MATCHED” NVARCHAR(1000), “SCORE” DOUBLE)
LANGUAGE SQLSCRIPT
SQL SECURITY INVOKER AS
BEGIN
tv_data = SELECT * FROM :DATA_SET;
tv_keywords = SELECT * FROM :KEYWORDS_SET;
tv_param = SELECT * FROM PAL_PARAMETER_TAB;
CALL _SYS_AFL.PAL_TF_ANALYSIS(:tv_data, :tv_param, lt_tm_term, lt_tm_doc_term_freq, lt_cate);
CALL _SYS_AFL.PAL_SEARCH_DOCS_BY_KEYWORDS(:lt_tm_term, :lt_tm_doc_term_freq, :lt_cate, :tv_keywords, :tv_param, lt_result, lt_extra);
RETURN SELECT * FROM :lt_result;
END;
select * from PALDEVUSER.HANA_PAL_BM25(FR_DATA_TBL_CATEGORY, MLDK_QUERY_BM25_DATA_1);
3.3 Discussions
From the examples demonstrated in Sessions 3.1 and 3.2, we noted the time taken to retrieve a document corresponding to a query and the search results. These are key metrics when evaluating the performance of information retrieval (IR) methods like ANN search and BM25, focusing on their efficiency and accuracy. For ANN search, the majority of time is spent on generating corpus embeddings/model. Once the model is built, ANN search outperforms BM25 in terms of search speed. However, as the scale and dimensionality of the dataset increase, algorithm performance can suffer. To address this issue, PAL supports Vector PCA for dimensionality reduction which can accelerates the speed of the search algorithms and reduces memory usage. The steps are shown below with name of SQL procedures and Python functions. Due to space constraints, only Python code is shown. For a comprehensive guide of PAL Vector PCA and related code, please refer to another the blog post [titled “Dim_Reduction_Hybrid_Prediction.”].
1. Generate corpus and query embeddings using PAL_TEXTEMBEDDING/ PALEmbeddings.fit_transform().
>>> embed = PALEmbeddings()
>>> corpus_vector = embed.fit_transform(data=df, key=’docid’, target=’text’)
>>> query_vector = embed.fit_transform(data=query_df, key=’ID’, target=’QUERY’,is_query=True)
2. Apply PAL_VECPCA/ VectorPCA to reduce the dimensionality of the corpus and query embeddings from 768 (the default dimension of vector in PAL) to 256..
>>> vecpca = VectorPCA(n_components=256)
>>> corpus_vector.index = ‘docid’ >>> corpus_vpca = vecpca.fit_transform(data=corpus_vector.select([“docid”, “VECTOR_COL”]), key=’docid’)
>>> query_vpca = vecpca.transform(data=query_vector.select([“ID”, “VECTOR_COL”]), key=’ID’)
3. Build the ANN search model using PAL_ANNS_MODEL_CREATE_BY_VECTOR/ANNSModel.
>>> anns = ANNSModel(by_doc=False)
>>> anns.fit(corpus_vpca[[‘docid’, ‘SCORE_VECTOR’]], key=’docid’, target=’SCORE_VECTOR’)
4. Perform the search using PAL_ANN_SEARCH_BY_VECTOR/ ANNSModel.predict().
>>> search_res = anns.predict(query_vpca, key=’ID’, target=’SCORE_VECTOR’, is_query=True, k_nearest_neighbours=5)
>>> print(search_res.collect())
>>> anns.runtime
0.5038766860961914
Figure 9. The search result of ANNS
This result indicates that the ANN search time has decreased from 5.8 seconds to 0.5 seconds, and the correct document has been found. In addition, depending on the memory difference in HANA Cloud, PAL ANN search is capable of handling datasets from 1 million to x*100 million in size. If the dataset is 1 million, we recommend setting the number of groups to be 1000, and if it is 10 million, we suggest setting it to 10,000. This allows for a speed boost whilst ensuring precision.
4. Summary
This blog post provides an overview of the working principles, implementation, and use cases of ANN search and BM25 within information retrieval systems, specifically within the context of the SAP HANA Cloud PAL implementations. It demonstrates the application of these technologies using Python and SQL code, leveraging a French language corpus from Hugging Face dataset name MLDR to identify documents most relevant to a given query. The experimental results highlight the differences in runtime efficiency between ANN search and BM25, with BM25 outperforming ANN search in speed, while ANN search offers potential advantages in semantic accuracy for complex or larger datasets.
For those exploring advanced applications such as RAG (Retrieval-Augmented Generation) and beyond, we encourage you to leverage these powerful tools to enhance your projects and unlock new possibilities in data-driven insights.
Other Useful Links:
Install the Python Machine Learning client from the pypi public repository: hana-ml
We also provide a R API for SAP HANA PAL called hana.ml.r, please refer to more information on the documentation.
For other blog posts on hana-ml:
A Multivariate Time Series Modeling and Forecasting Guide with Python Machine Learning Client for SAP HANAOutlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANAOutlier Detection by Clustering using Python Machine Learning Client for SAP HANAAnomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for SAP HANAOutlier Detection with One-class Classification using Python Machine Learning Client for SAP HANALearning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client for SAP HANAPython Machine Learning Client for SAP HANAImport multiple excel files into a single SAP HANA tableCOPD study, explanation and interpretability with Python machine learning client for SAP HANAModel Storage with Python Machine Learning Client for SAP HANAIdentification of Seasonality in Time Series with Python Machine Learning Client for SAP HANA
This blog post delves into two important information retrieval techniques: BM25 and Approximate Nearest Neighbor (ANN) Search as an advanced text mining technique, which have been introduced in the SAP HANA Cloud QRC04 2024 Predictive Analysis Library (PAL). The following sections will cover the topics:The working principles of ANN search and BM25 in information retrieval and text mining applicationsThe implementation in the SAP HANA Cloud Predictive Analysis Library (PAL)Use cases 1. IntroductionInformation Retrieval (IR) Systems are designed to help users locate and extract information stored within databases or collections of documents. BM25 and ANN search are two important technologies in IR that have been widely applied across various domains such as web search engines, question answering systems and chatbots. When delving into the specifics of these technologies, we first examine two main categories of IR:Dense Retrieval: Use dense vectors for text matching. The relevant answers to the query can be retrieved based on the embedding similarity. Sparse Retrieval: Use sparse vectors for text matching. In these methods, the majority dimension of the vector is zero, with only a few dimensions having values, and the overall vector dimensionality can be very high.Dense retrieval is suitable for scenarios where capturing rich semantic information is necessary, while sparse retrieval is ideal for situations with limited computational resources or when rapid retrieval is required. Dense retrieval generally outperforms sparse retrieval in terms of retrieval effectiveness, but at the cost of higher computational expenses. Sparse retrieval, on the other hand, has a distinct advantage in efficiency, especially when dealing with large datasets.This distinction introduces us to two retrieval technologies: ANN search and BM25. ANN search, as a method of dense retrieval, focuses on swiftly identifying vectors in high-dimensional spaces that are similar to a given vector in an approximate manner. ANN search is especially adept at handling high-dimensional, voluminous dense vector data, which is crucial in modern recommendation systems or large scale vectorized text context stores.Simultaneously, BM25, as a representative of sparse information retrieval, optimizes the relevance scoring between documents and queries through its probabilistic framework. BM25 introduces saturation functions and empirically observed parameter adjustments, providing a more nuanced relevance score, which has led to its widespread adoption in search engines for ranking search results.BM25 is renowned for its simplicity and speed in handling large datasets but may struggle with semantic gaps between queries and documents. In contrast, ANN search offers better recall but at the cost of higher computational resources. The choice between BM25 and ANN search hinges on the application’s requirements for efficiency, precision, and semantic depth.While BM25 and ANN search each offer distinct advantages, there are scenarios where the limitations of each method become apparent. To bridge these gaps, the concept of hybrid search has emerged. Hybrid search integrates different search technologies – ranging from traditional keyword-based methods to modern vector-based techniques – to enhance the retrieval process. For example, this approach can leverage the precision of BM25 for text matching and the semantic richness of ANN search for high-dimensional data, resulting in a better understanding of user queries. 2. ANN Search and BM25 in SAP HANA Cloud PAL2.1 ANN SearchThe ANN search implementation in PAL leverages the Flat IVF algorithm, which clusters the original vector set and divides the vector space into smaller groups referred to as “cells”. It then lists the reference of vectors belonging to each cluster for retrieval. KMeans clustering is used to build the vector-cluster references on a given dataset, dividing it into numerous clusters controlled by the parameter “GROUP_NUMBER”. These cluster centers are stored as the Vector-cluster References along with the original datasets into stateful in-memory ANN search model, thereby enabling much faster responses for search predictions.More specifically, when “GROUP_NUMBER” is set to 1, the K-Nearest Neighbor approach is utilized; otherwise, the Flat IVF strategy is employed, leveraging the cloud system’s AVX instruction set along with multi-threading and matrix computing to improve the search efficiency.PAL’s ANN search accepts both documents (raw text) and vectors for model building through two SQL procedures.The SQL procedure PAL_ANNS_MODEL_CREATE_BY_DOC internally invokes the PAL Text Embedding procedure, PAL_TEXTEMBEDDING, converting documents into vectors and then build the ANN search model.Users can employ their own embedding model to generate vectors and subsequently call the SQL procedure PAL_ANNS_MODEL_CREATE_BY_VECTOR to build the ANN search model.In the Python machine learning client for SAP HANA (hana-ml 2.23) class ANNSmodel, a parameter ‘by_doc’ allows the selection between the two input formats. For example, if ‘by_doc’ is set to True, PAL_ANNS_MODEL_CREATE_BY_DOC is automatically invoked.When searching against the stateful Approximate Nearest Neighbor (ANN) search model with new prediction data, such as a single query or multiple queries, the appropriate SQL procedure is applied based on the format of the queries. This could be either PAL_ANN_SEARCH_BY_VECTOR or PAL_ANN_SEARCH_BY_DOC. For instance, if you have used an embedding model to generate vectors, you should use the same model to generate the embedding of the query. Then, you can call PAL_ANN_SEARCH_BY_VECTOR to perform the search. Such search procedure identifies the nearest clusters by calculating the distance to their centers and selects the top k clusters.The stateful in-memory ANN search models, ensure quickest responses during search predictions while optimizing resource utilization. For governance and management of these stateful in-memory ANN search models, we also provide two dedicated SQL procedures, PAL_ANNS_MODEL_LIST and PAL_ANNS_MODEL_DELETE. Since these models are stored in memory, it is recommended they are removed to free up resources when they are no longer required.2.2 BM25The BM25 (Best Matching 25) is an extension of the Binary Independence Model (BIM) and you could invoke a SQL procedure named PAL_SEARCH_DOCS_BY_KEYWORDS to use this algorithm. This procedure draws on the TFIDF (Term Frequency-Inverse Document Frequency) values calculated using the PAL_TF_ANALYSIS. These TFIDF values are used to search for documents that closely align with user-provided keywords. Although there are a few variations of BM25 in existence, the one currently implemented is the original version.The algorithm’s parameters are easy to adjust and include ‘NUM_BEST_MATCHES’, ‘BM25_K1’, and ‘BM25_B’. The ‘BM25_K1’ parameter controls the saturation level of term frequency, while ‘BM25_B’ determines the extent of document length normalization. 3. Use Cases of ANN Search and BM25In the subsequent sections, we will illustrate the use of SQL and Python code to invoke ANN search and BM25 in SAP HANA Cloud. Note that the scripts provided are purely for illustrative purposes and are not intended for production use.The dataset utilized is a public dataset known as MLDR (Multilingual Long-Document Retrieval). This dataset comprises documents in 13 languages, with the text’s average length ranging from 3300 words (English text) to 9000 words (Russian/Arabic text). The document’s number of lines varies between 6569 to 200,000. Moreover, the dataset provides a test set of queries and their associated document IDs.To better demonstrate and compare the performance of ANN Search and BM25, we selected the French corpus from MLDR, containing 10,000 long texts and we used the first query from the test set to search for its associated document. This query is entitled ‘q-fr-5’ and the document most closely associated with this query is ‘doc-fr-40’. Hence, we expect these two algorithms to identify this document from the French corpus and rank it first.Figure 1 displays the first 5 lines of the French corpus. The data consists of two columns – ‘docid’ and ‘text’. ‘df’ represents a HANA DataFrame of the French corpora, which has been imported into a HANA table named ‘FR_DATA_TBL’. The total number of rows in this dataset is 10,000.>>> df.head(5).collect() Figure 1. The first 5 lines of HANA DataFrame dfIn addition, ‘df_query’ represents a HANA DataFrame of the query, which has been stored into a HANA table named ‘QUERY_TBL’.>>> df_query.collect() Figure 2. The HANA DataFrame df_query3.1 ANN SearchNext, we will display ANN search usage via hana-ml Python script. Since the input text is raw, we set the parameter by_doc=True. We also record the execution time for subsequent comparison between the two algorithms. From the information in the attribute anns.state_, we can observe that the ANN search model was successfully built and stored in the AFL state. The state ID is ‘256E38733E407A4D9F3D6B116AA7348A’, and the information in its COMMENT section corresponds to the comment=’MLDR_FR’ set in the parameters.>>> from hana_ml.text.anns_model import ANNSModel>>> anns = ANNSModel(by_doc=True)>>> anns.fit(data=df, key=’docid’, target=’text’, comment=’MLDR_FR’)>>> print(f”Running time :{anns.runtime} s”)>>> print(anns.state_.collect())>>> print(anns. embedding_result_.head(5).collect()) Figure 3. The attribute anns.state_The embeddings of documents are shown in Figure 4. In the column “VECTOR_COL”, each row contains a 768-dimensional vector.Figure 4. The fist 5 lines of attribute anns.embedding_result_Next, we will use an SQL script to illustrate this example by invoking PAL_ANNS_MODEL_CREATE_BY_DOC. DROP TABLE PAL_ANN_PARAMETER_TBL;
CREATE COLUMN TABLE PAL_ANN_PARAMETER_TBL(
“PARAM_NAME” NVARCHAR(256),
“INT_VALUE” INTEGER,
“DOUBLE_VALUE” DOUBLE,
“STRING_VALUE” NVARCHAR(1000)
);
INSERT INTO PAL_ANN_PARAMETER_TBL VALUES (‘COMMENT’, NULL, NULL, ‘MLDR_FR’);
DO BEGIN
tv_data = SELECT * FROM FR_DATA_TBL;
tv_param = SELECT * FROM PAL_ANN_PARAMETER_TBL;
call _SYS_AFL.PAL_ANNS_MODEL_CREATE_BY_DOC(:tv_data, :tv_param, tv_out1, tv_out2);
SELECT * FROM :tv_out1;
SELECT * FROM :tv_out2;
END; Next, we will search for the document most relevant to the query within the HANA DataFrame df_query and the HANA table name is “MLDK_QUERY_FR_DATA_1″. Since we have not set the parameter GROUP_NUMBER, its default value is 1, which means the KNN algorithm will be used and will return the top 1 result. As depicted in Figure 5, the correct document ‘doc-fr-40′ has been successfully retrieved as the top result, with the search runtime being approximately 5.8 seconds.>>> res = anns.predict(data=df_query, key=’ID’, target=’QUERY’, is_query=True)>>> print(f”Running time :{anns.runtime} s”)>>> print(res.collect()) Figure 5. The runtime and search result of ANN SearchThe SQL script to call PAL_ANN_SEARCH_BY_DOC for searching is as follows. It’s important to note that the parameter “STATE_ID” needs to be explicitly stated in the parameter table. DROP TABLE PAL_ANN_PRED_PARAMETER_TBL;
CREATE COLUMN TABLE PAL_ANN_PRED_PARAMETER_TBL(
“PARAM_NAME” NVARCHAR(256),
“INT_VALUE” INTEGER,
“DOUBLE_VALUE” DOUBLE,
“STRING_VALUE” NVARCHAR(1000)
);
INSERT INTO PAL_ANN_PRED_PARAMETER_TBL VALUES (‘K_NEAREST_NEIGHBOURS’, 1, NULL, NULL);
INSERT INTO PAL_ANN_PRED_PARAMETER_TBL VALUES (‘IS_QUERY’, 1, NULL, NULL);
INSERT INTO PAL_ANN_PRED_PARAMETER_TBL VALUES (‘STATE_ID’, NULL, NULL, ‘256E38733E407A4D9F3D6B116AA7348A’);
CREATE OR REPLACE FUNCTION PALDEVUSER.HANA_PAL_ANNSEARCH (IN QUERY_SET TABLE(“ID” INTEGER, “QUERY” NVARCHAR(5000)))
RETURNS TABLE (“ID” INTEGER, “K” INTEGER, “TRAIN_ID” NVARCHAR(100), “DISTANCE” DOUBLE)
LANGUAGE SQLSCRIPT
SQL SECURITY INVOKER AS
BEGIN
DECLARE tv_final TABLE(“ID” INTEGER, “K” INTEGER, “TRAIN_ID” NVARCHAR(100), “DISTANCE” DOUBLE);
tv_data = SELECT * FROM :QUERY_SET;
tv_param = SELECT * FROM PAL_ANN_PRED_PARAMETER_TBL;
CALL _SYS_AFL.PAL_ANN_SEARCH_BY_DOC(:tv_data, :tv_param, tv_out, tv_out2);
tv_final =SELECT “TEST_ID” AS “ID”, K, TRAIN_ID, DISTANCE FROM :tv_out;
RETURN SELECT * FROM :tv_final;
END;
SELECT * FROM PALDEVUSER.HANA_PAL_ANNSEARCH(MLDK_QUERY_FR_DATA_1); 3.2 BM25If you wish to use BM25, in the hana-ml Python client , you should call the function search_docs_by_keywords(). For SQL, the corresponding procedure is named PAL_SEARCH_DOCS_BY_KEYWORDS.The search_docs_by_keywords() function requires two inputs: ref_data, which is the corpus to be searched, and pred_data, which is the prediction data. ref_data can be either a single HANA DataFrame or one of the three HANA DataFrames (TF-IDF Result, Doc Term Frequency, Doc Category) obtained after executing PAL text analysis (PAL_TF_ANALYSIS). If the input detected is a single DataFrame, the function will implicitly perform PAL_TF_ANALYSIS and then call PAL_SEARCH_DOCS_BY_KEYWORDS. Therefore, the input ref_data must conform to the data format expected by PAL text analysis. In our example, we added a third column with each entry being 1 during preprocessing. This HANA DataFrame, named df_ref, is depicted in Figure 6. The content of this column will not influence the subsequent BM25 calculations.>>> df_ref.collect()Figure 6. The first 5 lines of HANA DataFrame df_refAs for the keywords in the query of pred_data, we preprocessed the df_query by calling PAL_TF_ANALYSIS and then get a list of keywords separated by commas as shown in Figure 7.>>> df_keywords.collect()Figure 7. The HANA DataFrame df_keywordsThe following Python code demonstrates how to invoke the search_docs_by_keywords() function. As shown in Figure 8, the results indicate that BM25 is also capable of finding the correct document, which is referred to as ‘doc-fr-40′.>>> from hana_ml.text.tm import search_docs_by_keywords>>> import time>>> start_time = time.time()>>> res, _ = search_docs_by_keywords(pred_data=df_keywords, ref_data=df_ref, lang=’fr’, thread_ratio=1.0) >>> end_time = time.time()>>> print(f”Running time:{end_time-start_time} s”) >>> print(res.collect()) Figure 8. The search result of BM25The SQL code is shown as follows. DROP TABLE PAL_PARAMETER_TAB;
CREATE COLUMN TABLE PAL_PARAMETER_TAB(
“PARAM_NAME” NVARCHAR(256),
“INT_VALUE” INTEGER,
“DOUBLE_VALUE” DOUBLE,
“STRING_VALUE” NVARCHAR(1000)
);
CREATE OR REPLACE FUNCTION PALDEVUSER.HANA_PAL_BM25 (IN DATA_SET TABLE(“docid” INTEGER, “text” NCLOB, “CATEGORY” INTEGER), IN KEYWORDS_SET TABLE(“ID” INTEGER, “keywords” NVARCHAR(5000)))
RETURNS TABLE (“PREDICT_ID” INTEGER, “INDEX” INTEGER, “BEST_MATCHED” NVARCHAR(1000), “SCORE” DOUBLE)
LANGUAGE SQLSCRIPT
SQL SECURITY INVOKER AS
BEGIN
tv_data = SELECT * FROM :DATA_SET;
tv_keywords = SELECT * FROM :KEYWORDS_SET;
tv_param = SELECT * FROM PAL_PARAMETER_TAB;
CALL _SYS_AFL.PAL_TF_ANALYSIS(:tv_data, :tv_param, lt_tm_term, lt_tm_doc_term_freq, lt_cate);
CALL _SYS_AFL.PAL_SEARCH_DOCS_BY_KEYWORDS(:lt_tm_term, :lt_tm_doc_term_freq, :lt_cate, :tv_keywords, :tv_param, lt_result, lt_extra);
RETURN SELECT * FROM :lt_result;
END;
select * from PALDEVUSER.HANA_PAL_BM25(FR_DATA_TBL_CATEGORY, MLDK_QUERY_BM25_DATA_1); 3.3 DiscussionsFrom the examples demonstrated in Sessions 3.1 and 3.2, we noted the time taken to retrieve a document corresponding to a query and the search results. These are key metrics when evaluating the performance of information retrieval (IR) methods like ANN search and BM25, focusing on their efficiency and accuracy. For ANN search, the majority of time is spent on generating corpus embeddings/model. Once the model is built, ANN search outperforms BM25 in terms of search speed. However, as the scale and dimensionality of the dataset increase, algorithm performance can suffer. To address this issue, PAL supports Vector PCA for dimensionality reduction which can accelerates the speed of the search algorithms and reduces memory usage. The steps are shown below with name of SQL procedures and Python functions. Due to space constraints, only Python code is shown. For a comprehensive guide of PAL Vector PCA and related code, please refer to another the blog post [titled “Dim_Reduction_Hybrid_Prediction.”].1. Generate corpus and query embeddings using PAL_TEXTEMBEDDING/ PALEmbeddings.fit_transform().>>> embed = PALEmbeddings()>>> corpus_vector = embed.fit_transform(data=df, key=’docid’, target=’text’) >>> query_vector = embed.fit_transform(data=query_df, key=’ID’, target=’QUERY’,is_query=True) 2. Apply PAL_VECPCA/ VectorPCA to reduce the dimensionality of the corpus and query embeddings from 768 (the default dimension of vector in PAL) to 256..>>> vecpca = VectorPCA(n_components=256) >>> corpus_vector.index = ‘docid’ >>> corpus_vpca = vecpca.fit_transform(data=corpus_vector.select([“docid”, “VECTOR_COL”]), key=’docid’) >>> query_vpca = vecpca.transform(data=query_vector.select([“ID”, “VECTOR_COL”]), key=’ID’) 3. Build the ANN search model using PAL_ANNS_MODEL_CREATE_BY_VECTOR/ANNSModel. >>> anns = ANNSModel(by_doc=False) >>> anns.fit(corpus_vpca[[‘docid’, ‘SCORE_VECTOR’]], key=’docid’, target=’SCORE_VECTOR’) 4. Perform the search using PAL_ANN_SEARCH_BY_VECTOR/ ANNSModel.predict(). >>> search_res = anns.predict(query_vpca, key=’ID’, target=’SCORE_VECTOR’, is_query=True, k_nearest_neighbours=5) >>> print(search_res.collect()) >>> anns.runtime 0.5038766860961914 Figure 9. The search result of ANNSThis result indicates that the ANN search time has decreased from 5.8 seconds to 0.5 seconds, and the correct document has been found. In addition, depending on the memory difference in HANA Cloud, PAL ANN search is capable of handling datasets from 1 million to x*100 million in size. If the dataset is 1 million, we recommend setting the number of groups to be 1000, and if it is 10 million, we suggest setting it to 10,000. This allows for a speed boost whilst ensuring precision. 4. SummaryThis blog post provides an overview of the working principles, implementation, and use cases of ANN search and BM25 within information retrieval systems, specifically within the context of the SAP HANA Cloud PAL implementations. It demonstrates the application of these technologies using Python and SQL code, leveraging a French language corpus from Hugging Face dataset name MLDR to identify documents most relevant to a given query. The experimental results highlight the differences in runtime efficiency between ANN search and BM25, with BM25 outperforming ANN search in speed, while ANN search offers potential advantages in semantic accuracy for complex or larger datasets.For those exploring advanced applications such as RAG (Retrieval-Augmented Generation) and beyond, we encourage you to leverage these powerful tools to enhance your projects and unlock new possibilities in data-driven insights.Other Useful Links:Install the Python Machine Learning client from the pypi public repository: hana-ml We also provide a R API for SAP HANA PAL called hana.ml.r, please refer to more information on the documentation.For other blog posts on hana-ml:A Multivariate Time Series Modeling and Forecasting Guide with Python Machine Learning Client for SAP HANAOutlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANAOutlier Detection by Clustering using Python Machine Learning Client for SAP HANAAnomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for SAP HANAOutlier Detection with One-class Classification using Python Machine Learning Client for SAP HANALearning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client for SAP HANAPython Machine Learning Client for SAP HANAImport multiple excel files into a single SAP HANA tableCOPD study, explanation and interpretability with Python machine learning client for SAP HANAModel Storage with Python Machine Learning Client for SAP HANAIdentification of Seasonality in Time Series with Python Machine Learning Client for SAP HANA Read More Technology Blogs by SAP articles
#SAP
#SAPTechnologyblog