Document Clustering using KMeans and Text Embeddings

Traditionally document clustering is mainly realized by term-based vectorization approaches like TF-IDF analysis in conjunction with regular clustering algorithms for tabular data. However, term-based vectorizations (as well as its variances like latent semantic analysis) often fail to explore the semantic information in document texts, thus give unsatisfying clustering results. In contrast, text embeddings produced by pre-trained large language models embraces more sematic meaning of the corresponding texts, and give more semantic consistent results when applied to document clustering scenarios. 

 

​ Traditionally document clustering is mainly realized by term-based vectorization approaches like TF-IDF analysis in conjunction with regular clustering algorithms for tabular data. However, term-based vectorizations (as well as its variances like latent semantic analysis) often fail to explore the semantic information in document texts, thus give unsatisfying clustering results. In contrast, text embeddings produced by pre-trained large language models embraces more sematic meaning of the corresponding texts, and give more semantic consistent results when applied to document clustering scenarios.    Read More Technology Blogs by SAP articles 

#SAP

#SAPTechnologyblog

You May Also Like

More From Author