We are excited to announce the integration of a newly developed constrained clustering algorithm into SAP HANA Predictive Analysis Library (PAL). In this blog post, we will explore the fundamentals of constrained clustering and demonstrate its application in document clustering.

Constrained clustering

As data complexity increases, traditional clustering algorithms such as k-means and hierarchical clustering often fall short in capturing the intrinsic structure of the data. Deep clustering addresses these limitations by combining deep learning and clustering techniques to provide powerful representations and facilitate more accurate clustering. This approach serves as an excellent foundation for developing constrained clustering algorithms, which incorporate domain-specific constraints into the clustering process.

Deep Embedded Clustering (DEC) is a framework that leverages deep neural networks to learn low-dimensional representations of data, which are subsequently used for clustering. The key idea behind DEC is to use a deep autoencoder to obtain meaningful embeddings and then refine these embeddings through clustering-oriented learning. The components of DEC contains:

Autoencoder Architecture: An autoencoder is a type of neural network designed to learn efficient codings of input data. It consists of two parts: the encoder, which compresses the data, and the decoder, which reconstructs the data from the compressed representation. In DEC, the encoder is crucial for learning compact, informative representations that can be clustered effectively.Clustering Layer: Once a suitable representation is obtained from the encoder, DEC introduces a clustering layer that refines these representations to better fit a clustering model.Optimization Objective: The DEC framework focuses on minimizing the reconstruction loss of the autoencoder along with a clustering loss that aims to improve clustering accuracy.

Constrained clustering is an extension of traditional clustering paradigms that incorporates domain-specific knowledge into the clustering process through constraints. This approach is particularly beneficial in scenarios where certain data points are known to belong together or certain separations should be maintained. By building on the DEC framework, constrained clustering can effectively leverage deep representations while respecting additional constraints. To integrate constraints into DEC, important modifications include:

Defining Constraints: Constraints can be pairwise, such as must-link constraints (indicating that two data points should be clustered together) and cannot-link constraints (indicating that two data points should belong to different clusters). Constraints can also be triplet constraint by specifying relative similarity relationships among data points, where a given anchor point should be more similar to a designated positive point than to a negative point.Modified Clustering Layer: In the DEC algorithm that accounts for constraints, the clustering layer can be extended to handle these constraints. This involves modifying the clustering loss to not only minimize the disparity within clusters but also satisfy the specified constraints.Optimization with Constraints: The optimization objective of constrained clustering balances between the fidelity of data representation, clustering quality, and adherence to constraints. This involves sophisticated optimization techniques such as alternating optimization to balance these competing aspects effectively.

In the following section, we will explore the application of constrained clustering in the context of document clustering.

Example

Document clustering is a crucial process in the realm of natural language processing (NLP) and information retrieval. It involves grouping text documents into clusters, where documents in the same cluster share similar topics or contents. This technique has vast applications in organizing, filtering, and searching information, thus enhancing user experience in navigating vast text archives. The dataset for applying document clustering is the 20 Newsgroups dataset.

Dataset

The 20 Newsgroups dataset is a benchmark dataset used extensively for text classification and clustering tasks. It comprises approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups, each corresponding to a different topic. This dataset serves as a useful test bed for document clustering due to its diversity in topics and ease of accessibility. These topics range broadly from computer science themes such as “comp.graphics” to recreational or opinion-based themes like “rec.sport.baseball.”

Before applying clustering algorithms, documents must be preprocessed to transform the text into a suitable format for analysis. This includes:

– Tokenization: Breaking down text into individual words or tokens.

– Stopwords Removal: Excluding common words that do not contribute to topic differentiation (e.g., “the,” “is,” “and”).

– Stemming/Lemmatization: Reducing words to their base or root form (e.g., “running” to “run”).

– Vectorization: Converting text into numerical vectors using techniques like TF-IDF or embeddings. Interested readers can find more information on text embeddings in the topics linked at the bottom of the page.

Once the document data has been vectorized, we can proceed to prepare the constraints information. For 20 Newsgroups datasets that includes target information, we employ a straightforward strategy to automatically generate constraints.

Pairwise constraints

We randomly select pairs of instances to generate pairwise constraints. To ensure the constraints are transitive, we calculate the transitive closure for all instances that are must-linked. Subsequently, we derive additional constraints from the cannot-link constraints to maintain this property. The resulting constraints are entered into the constraints table. In this table, the first column specifies the type of constraint, with ‘1’ representing a must-link and ‘-1’ representing a cannot-link. The second and third columns contain the IDs of the instance pairs. Here is an example of a pairwise constraint table for illustration:

CONSTRAINT_TYPEID1ID2101-112-123

Triplet constraints

Triplet constraints indicate that instance i is more similar to instance j than to instance k. To emulate human input on these triplet constraints, we randomly choose instances to serve as anchors (i). For each anchor, we then randomly select two instances (j and k) based on their similarity to the anchor. This similarity is determined by calculating the Euclidean distance between the preprocessed embeddings of the instances. The following is an example of generated triplet constraints specifying anchor, positive and nagetive instances:

ANCHORPOSITIVENAGETIVE123245367

Once the input data and constraints are prepared, using constrained clustering is as straightforward as employing any other algorithm in the PAL. Below is an annotated code example for your reference.

–########## COLUMN TABLE CREATION ##########
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_PARAMETER_TBL (“PARAM_NAME” NVARCHAR (256), “INT_VALUE” INTEGER, “DOUBLE_VALUE” DOUBLE, “STRING_VALUE” NVARCHAR (1000));
CREATE COLUMN TABLE PAL_DCC_PREMODEL_TBL (“ROW_INDEX” INTEGER, “MODEL_CONTENT” NVARCHAR (5000));
CREATE COLUMN TABLE PAL_DCC_RESULT_TBL (“ID” INTEGER, “CLUSTER_ID” INTEGER, “CONFIDENCE” DOUBLE);
CREATE COLUMN TABLE PAL_DCC_MODEL_TBL (“ROW_INDEX” INTEGER, “MODEL_CONTENT” NVARCHAR (5000));
CREATE COLUMN TABLE PAL_DCC_LOG_TBL (“EPOCH” INTEGER, “BATCH” NVARCHAR (256), “VALUE” DOUBLE);
CREATE COLUMN TABLE PAL_DCC_STATISTICS_TBL (“STAT_NAME” NVARCHAR (256), “STAT_VALUE” NVARCHAR (1000), “STAT_VECTOR” REAL_VECTOR);

–########## TABLE INSERTS ##########
— Assume the data is inserted in the PAL_DCC_DATA_TBL and the constraints are in the PAL_DCC_CONSTRAINTS_TBL table.

–########## PAL_PARAMETER_TBL DATA INSERTION ##########
— mandatory, number of groups
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘GROUP_NUMBER’, 4, NULL, NULL);
— mandatory, type of constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘CONSTRAINT_TYPE’, 0, NULL, NULL);
— hidden layer sizes of encoder
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘ENCODER_HIDDEN_DIMS’, NULL, NULL, ’50, 200′);
— dimension of latent space
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘EMBEDDING_DIM’, 3, NULL, NULL);
— whether to use normalization
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘NORMALIZATION’, 1, NULL, NULL);
— seed for random number generator
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘SEED’, 1, NULL, NULL);
— learning rate of pretraining
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘PRETRAIN_LEARNING_RATE’, NULL, 0.01, NULL);
— number of pretraining epochs
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘PRETRAIN_EPOCHS’, 200, NULL, NULL);
— number of training samples in a batch
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘PRETRAIN_BATCH_SIZE’, 16, NULL, NULL);
— ratio of total number of threads that can be used
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘THREAD_RATIO’, NULL, 1.0, NULL);
— degree of distorting latent space
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘GAMMA’, NULL, 0.1, NULL);
— pairwise, penalty for must-link constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘ML_PENALTY’, NULL, 0.1, NULL);
— pairwise, penalty for cannot-link constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘CL_PENALTY’, NULL, 1.0, NULL);
— triplet, margin in triplet loss
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘THETA’, NULL, 0.1, NULL);
— learning rate
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘LEARNING_RATE’, NULL, 0.01, NULL);
— maximum number of training epochs
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘MAX_EPOCHS’, 100, NULL, NULL);
— number of training samples in a batch
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘BATCH_SIZE’, 16, NULL, NULL);
— frequency of updating target distribution
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘UPDATE_INTERVAL’, 1, NULL, NULL);
— pairwise, number of must-link constraints in a batch
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘ML_BATCH_SIZE’, 16, NULL, NULL);
— pairwise, number of cannot-link constraints in a batch
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘CL_BATCH_SIZE’, 16, NULL, NULL);
— triplet, number of triplet constraints in a batch
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘TRIPLET_BATCH_SIZE’, 16, NULL, NULL);
— pairwise, frequency of training with must-link constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘ML_UPDATE_INTERVAL’, 1, NULL, NULL);
— pairwise, frequency of training with cannot-link constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘CL_UPDATE_INTERVAL’, 1, NULL, NULL);
— triplet, frequency of training with triplet constraints
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘TRIPTLET_UPDATE_INTERVAL’, 1, NULL, NULL);
— stopping threshold
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘TOLERANCE’, NULL, 0.01, NULL);
— verbosity of log
INSERT INTO #PAL_PARAMETER_TBL VALUES (‘VERBOSE’, 0, NULL, NULL);

–########## PAL_PIPELINE_EXPLAIN FOR TIME SERIES CALL ##########
DO BEGIN
lt_data = SELECT * FROM PAL_DCC_DATA_TBL;
lt_constraints = SELECT * FROM PAL_DCC_CONSTRAINTS_TBL;
lt_param = SELECT * FROM #PAL_PARAMETER_TBL;
lt_premodel = SELECT * FROM PAL_DCC_PREMODEL_TBL;
CALL _SYS_AFL.PAL_CONSTRAINED_CLUSTERING(:lt_data, :lt_constraints, :lt_param, :lt_premodel, lt_result, lt_model, lt_log, lt_statistics);
INSERT INTO PAL_DCC_RESULT_TBL SELECT * FROM :lt_result;
INSERT INTO PAL_DCC_MODEL_TBL SELECT * FROM :lt_model;
INSERT INTO PAL_DCC_LOG_TBL SELECT * FROM :lt_log;
INSERT INTO PAL_DCC_STATISTICS_TBL SELECT * FROM :lt_statistics;
END;

–########## SELECT * TABLES ##########
SELECT * FROM PAL_DCC_RESULT_TBL;
SELECT * FROM PAL_DCC_MODEL_TBL;
SELECT * FROM PAL_DCC_LOG_TBL;

You can view the clustering results, model, and training log in the output tables. Below are illustrative snapshots of the output tables.

The result table presents the assigned cluster ID for each instance and includes a third column that indicates the soft label or probabilities.

The model table documents crucial training information and the weights for the trained neural networks.

The entire training process can be monitored through the log table, which contains loss values for each training iteration. Additionally, the verbosity of the log can be adjusted using the VERBOSE parameter.

Summary

Deep clustering provides a robust framework that pushes the boundaries of unsupervised learning through the integration of deep learning. By extending DEC to incorporate constraints, we show a variant that combines the strengths of deep representations with domain-specific knowledge, enhancing clustering applicability in complex real-world problems.

Recent topics on HANA machine learning:

Hybrid Prediction with Tabular and Text Inputs using Hybrid Gradient Boosting Trees

Dimensionality Reduction of Text Embeddings for Hybrid Prediction Data

Text Embedding Service in SAP HANA Cloud Predictive Analysis Library (PAL)

Document Clustering using KMeans and Text Embeddings

New Text Analysis in SAP HANA Cloud Predictive Analysis Library (PAL)

New Information Retrieval Techniques in SAP HANA Cloud using BM25 and ANNS for Advanced Text Mining

Text Chunking – An Exciting New NLP Function in SAP HANA Cloud

Exploring ML Explainability in SAP HANA PAL – AutoML

Clustering is a fundamental technique in data analysis and machine learning, widely used for identifying and grouping similar objects within a dataset based on inherent characteristics or patterns. It involves partitioning data points into distinct subsets known as clusters, where data points within the same cluster are more similar to each other than to those in other clusters. This unsupervised learning approach facilitates data exploration, pattern recognition, and information retrieval across various fields, from market segmentation to image analysis. However, traditional clustering methods operate without prior knowledge about the data, often leading to challenges in achieving meaningful or contextually relevant groupings. This limitation paves the way for constrained clustering, an advanced form of clustering that incorporates domain-specific constraints or prior information to guide the clustering process, ensuring more accurate and meaningful results tailored to specific analytical needs.We are excited to announce the integration of a newly developed constrained clustering algorithm into SAP HANA Predictive Analysis Library (PAL). In this blog post, we will explore the fundamentals of constrained clustering and demonstrate its application in document clustering. Constrained clusteringAs data complexity increases, traditional clustering algorithms such as k-means and hierarchical clustering often fall short in capturing the intrinsic structure of the data. Deep clustering addresses these limitations by combining deep learning and clustering techniques to provide powerful representations and facilitate more accurate clustering. This approach serves as an excellent foundation for developing constrained clustering algorithms, which incorporate domain-specific constraints into the clustering process.Deep Embedded Clustering (DEC) is a framework that leverages deep neural networks to learn low-dimensional representations of data, which are subsequently used for clustering. The key idea behind DEC is to use a deep autoencoder to obtain meaningful embeddings and then refine these embeddings through clustering-oriented learning. The components of DEC contains:Autoencoder Architecture: An autoencoder is a type of neural network designed to learn efficient codings of input data. It consists of two parts: the encoder, which compresses the data, and the decoder, which reconstructs the data from the compressed representation. In DEC, the encoder is crucial for learning compact, informative representations that can be clustered effectively.Clustering Layer: Once a suitable representation is obtained from the encoder, DEC introduces a clustering layer that refines these representations to better fit a clustering model.Optimization Objective: The DEC framework focuses on minimizing the reconstruction loss of the autoencoder along with a clustering loss that aims to improve clustering accuracy.Constrained clustering is an extension of traditional clustering paradigms that incorporates domain-specific knowledge into the clustering process through constraints. This approach is particularly beneficial in scenarios where certain data points are known to belong together or certain separations should be maintained. By building on the DEC framework, constrained clustering can effectively leverage deep representations while respecting additional constraints. To integrate constraints into DEC, important modifications include:Defining Constraints: Constraints can be pairwise, such as must-link constraints (indicating that two data points should be clustered together) and cannot-link constraints (indicating that two data points should belong to different clusters). Constraints can also be triplet constraint by specifying relative similarity relationships among data points, where a given anchor point should be more similar to a designated positive point than to a negative point.Modified Clustering Layer: In the DEC algorithm that accounts for constraints, the clustering layer can be extended to handle these constraints. This involves modifying the clustering loss to not only minimize the disparity within clusters but also satisfy the specified constraints.Optimization with Constraints: The optimization objective of constrained clustering balances between the fidelity of data representation, clustering quality, and adherence to constraints. This involves sophisticated optimization techniques such as alternating optimization to balance these competing aspects effectively.In the following section, we will explore the application of constrained clustering in the context of document clustering. ExampleDocument clustering is a crucial process in the realm of natural language processing (NLP) and information retrieval. It involves grouping text documents into clusters, where documents in the same cluster share similar topics or contents. This technique has vast applications in organizing, filtering, and searching information, thus enhancing user experience in navigating vast text archives. The dataset for applying document clustering is the 20 Newsgroups dataset.DatasetThe 20 Newsgroups dataset is a benchmark dataset used extensively for text classification and clustering tasks. It comprises approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups, each corresponding to a different topic. This dataset serves as a useful test bed for document clustering due to its diversity in topics and ease of accessibility. These topics range broadly from computer science themes such as “comp.graphics” to recreational or opinion-based themes like “rec.sport.baseball.”Before applying clustering algorithms, documents must be preprocessed to transform the text into a suitable format for analysis. This includes:- Tokenization: Breaking down text into individual words or tokens.- Stopwords Removal: Excluding common words that do not contribute to topic differentiation (e.g., “the,” “is,” “and”).- Stemming/Lemmatization: Reducing words to their base or root form (e.g., “running” to “run”).- Vectorization: Converting text into numerical vectors using techniques like TF-IDF or embeddings. Interested readers can find more information on text embeddings in the topics linked at the bottom of the page.Once the document data has been vectorized, we can proceed to prepare the constraints information. For 20 Newsgroups datasets that includes target information, we employ a straightforward strategy to automatically generate constraints.Pairwise constraintsWe randomly select pairs of instances to generate pairwise constraints. To ensure the constraints are transitive, we calculate the transitive closure for all instances that are must-linked. Subsequently, we derive additional constraints from the cannot-link constraints to maintain this property. The resulting constraints are entered into the constraints table. In this table, the first column specifies the type of constraint, with ‘1’ representing a must-link and ‘-1’ representing a cannot-link. The second and third columns contain the IDs of the instance pairs. Here is an example of a pairwise constraint table for illustration:CONSTRAINT_TYPEID1ID2101-112-123Triplet constraintsTriplet constraints indicate that instance i is more similar to instance j than to instance k. To emulate human input on these triplet constraints, we randomly choose instances to serve as anchors (i). For each anchor, we then randomly select two instances (j and k) based on their similarity to the anchor. This similarity is determined by calculating the Euclidean distance between the preprocessed embeddings of the instances. The following is an example of generated triplet constraints specifying anchor, positive and nagetive instances:ANCHORPOSITIVENAGETIVE123245367Once the input data and constraints are prepared, using constrained clustering is as straightforward as employing any other algorithm in the PAL. Below is an annotated code example for your reference.–########## COLUMN TABLE CREATION ##########
CREATE LOCAL TEMPORARY COLUMN TABLE #PAL_PARAMETER_TBL (“PARAM_NAME” NVARCHAR (256), “INT_VALUE” INTEGER, “DOUBLE_VALUE” DOUBLE, “STRING_VALUE” NVARCHAR (1000));
CREATE COLUMN TABLE PAL_DCC_PREMODEL_TBL (“ROW_INDEX” INTEGER, “MODEL_CONTENT” NVARCHAR (5000));
CREATE COLUMN TABLE PAL_DCC_RESULT_TBL (“ID” INTEGER, “CLUSTER_ID” INTEGER, “CONFIDENCE” DOUBLE);
CREATE COLUMN TABLE PAL_DCC_MODEL_TBL (“ROW_INDEX” INTEGER, “MODEL_CONTENT” NVARCHAR (5000));
CREATE COLUMN TABLE PAL_DCC_LOG_TBL (“EPOCH” INTEGER, “BATCH” NVARCHAR (256), “VALUE” DOUBLE);
CREATE COLUMN TABLE PAL_DCC_STATISTICS_TBL (“STAT_NAME” NVARCHAR (256), “STAT_VALUE” NVARCHAR (1000), “STAT_VECTOR” REAL_VECTOR);

–########## TABLE INSERTS ##########
— Assume the data is inserted in the PAL_DCC_DATA_TBL and the constraints are in the PAL_DCC_CONSTRAINTS_TBL table.

–########## SELECT * TABLES ##########
SELECT * FROM PAL_DCC_RESULT_TBL;
SELECT * FROM PAL_DCC_MODEL_TBL;
SELECT * FROM PAL_DCC_LOG_TBL;You can view the clustering results, model, and training log in the output tables. Below are illustrative snapshots of the output tables.The result table presents the assigned cluster ID for each instance and includes a third column that indicates the soft label or probabilities.The model table documents crucial training information and the weights for the trained neural networks.The entire training process can be monitored through the log table, which contains loss values for each training iteration. Additionally, the verbosity of the log can be adjusted using the VERBOSE parameter. SummaryDeep clustering provides a robust framework that pushes the boundaries of unsupervised learning through the integration of deep learning. By extending DEC to incorporate constraints, we show a variant that combines the strengths of deep representations with domain-specific knowledge, enhancing clustering applicability in complex real-world problems. Recent topics on HANA machine learning:Hybrid Prediction with Tabular and Text Inputs using Hybrid Gradient Boosting Trees Dimensionality Reduction of Text Embeddings for Hybrid Prediction Data Text Embedding Service in SAP HANA Cloud Predictive Analysis Library (PAL) Document Clustering using KMeans and Text Embeddings New Text Analysis in SAP HANA Cloud Predictive Analysis Library (PAL) New Information Retrieval Techniques in SAP HANA Cloud using BM25 and ANNS for Advanced Text Mining Text Chunking – An Exciting New NLP Function in SAP HANA Cloud Exploring ML Explainability in SAP HANA PAL – AutoML Read More Technology Blog Posts by SAP articles

#SAP

#SAPTechnologyblog