Documentation

Clusterium is a Bayesian nonparametric toolkit for text clustering, analysis, and benchmarking that leverages state-of-the-art embedding models and statistical validation techniques.

Overview

Features

  • Dirichlet Process Clustering: Implements the Dirichlet Process for text clustering

  • Pitman-Yor Process Clustering: Implements the Pitman-Yor Process for text clustering with improved performance

  • Evaluation: Evaluates clustering results using a variety of metrics, including Silhouette Score, Similarity Analysis (intra/inter-cluster), Power-law Analysis, Outlier Detection, and Cluster Size Distribution

  • Visualization: Generates plots of cluster size distributions

Quick Start

# Install the package
pip install clusx

# Basic clustering with default parameters
clusx cluster --input your_data.txt

# Evaluate clustering results
clusx evaluate \
  --input your_data.txt \
  --dp-clusters output/clusters_output_dp.csv \
  --pyp-clusters output/clusters_output_pyp.csv

That’s it! The tool uses optimized default parameters and saves all outputs to the output directory.

For interactive visualization during evaluation, add the --show-plot option:

clusx evaluate \
  --input your_data.txt \
  --dp-clusters output/clusters_output_dp.csv \
  --pyp-clusters output/clusters_output_pyp.csv \
  --show-plot

Python API Example

from clusx.clustering import DirichletProcess, PitmanYorProcess
from clusx.clustering.utils import load_data

# Load data
texts = load_data("your_data.txt")

# Perform clustering with default parameters
dp = DirichletProcess(alpha=1.0, kappa=0.8)
clusters_dp = dp.fit_predict(texts)

pyp = PitmanYorProcess(alpha=0.8, kappa=0.6, sigma=0.3)
clusters_pyp = pyp.fit_predict(texts)

# Print number of clusters found
print(f"DP found {len(set(clusters_dp))} clusters")
print(f"PYP found {len(set(clusters_pyp))} clusters")

For more advanced usage, including saving results and evaluation, see the Usage Guide.

Note

For detailed installation instructions, please see the Installation Guide. And for usage instructions, use cases, examples, and advanced configuration options, please see the Usage Guide.


Full Table of Contents

The User Guide

This part of the documentation, which is mostly prose, begins with some background information about Clusterium, then focuses on step-by-step instructions for getting the most out of Clusterium.

The Community Guide

This part of the documentation, which is mostly prose, details the Clusterium ecosystem and community.

The API Documentation / Guide

If you are looking for information on a specific function, class, method, or algorithm, this part of the documentation is for you.

The Contributor Guide

If you want to contribute to the project, this part of the documentation is for you.

Support

Should you have any question, any remark, or if you find a bug, or if there is something you can’t do with the Clusterium, please open an issue.

Project Information

Clusterium is released under the MIT License, its documentation lives at Read the Docs, the code on GitHub, and the latest release on PyPI. It’s rigorously tested on Python 3.11+.

If you’d like to contribute to Clusterium you’re most welcome!

Indices and tables