Documentation
Clusterium is a Bayesian nonparametric toolkit for text clustering, analysis, and benchmarking that leverages state-of-the-art embedding models and statistical validation techniques.
Overview
Features
Dirichlet Process Clustering: Implements the Dirichlet Process for text clustering
Pitman-Yor Process Clustering: Implements the Pitman-Yor Process for text clustering with improved performance
Evaluation: Evaluates clustering results using a variety of metrics, including Silhouette Score, Similarity Analysis (intra/inter-cluster), Power-law Analysis, Outlier Detection, and Cluster Size Distribution
Visualization: Generates plots of cluster size distributions
Quick Start
# Install the package
pip install clusx
# Basic clustering with default parameters
clusx cluster --input your_data.txt
# Evaluate clustering results
clusx evaluate \
--input your_data.txt \
--dp-clusters output/clusters_output_dp.csv \
--pyp-clusters output/clusters_output_pyp.csv
That’s it! The tool uses optimized default parameters and saves all outputs to the output directory.
For interactive visualization during evaluation, add the --show-plot option:
clusx evaluate \
--input your_data.txt \
--dp-clusters output/clusters_output_dp.csv \
--pyp-clusters output/clusters_output_pyp.csv \
--show-plot
Python API Example
from clusx.clustering import DirichletProcess, PitmanYorProcess
from clusx.clustering.utils import load_data
# Load data
texts = load_data("your_data.txt")
# Perform clustering with default parameters
dp = DirichletProcess(alpha=1.0, kappa=0.8)
clusters_dp = dp.fit_predict(texts)
pyp = PitmanYorProcess(alpha=0.8, kappa=0.6, sigma=0.3)
clusters_pyp = pyp.fit_predict(texts)
# Print number of clusters found
print(f"DP found {len(set(clusters_dp))} clusters")
print(f"PYP found {len(set(clusters_pyp))} clusters")
For more advanced usage, including saving results and evaluation, see the Usage Guide.
Note
For detailed installation instructions, please see the Installation Guide. And for usage instructions, use cases, examples, and advanced configuration options, please see the Usage Guide.
Full Table of Contents
The User Guide
This part of the documentation, which is mostly prose, begins with some background information about Clusterium, then focuses on step-by-step instructions for getting the most out of Clusterium.
The Community Guide
This part of the documentation, which is mostly prose, details the Clusterium ecosystem and community.
The API Documentation / Guide
If you are looking for information on a specific function, class, method, or algorithm, this part of the documentation is for you.
The Contributor Guide
If you want to contribute to the project, this part of the documentation is for you.
Support
Should you have any question, any remark, or if you find a bug, or if there is something you can’t do with the Clusterium, please open an issue.
Project Information
Clusterium is released under the MIT License, its documentation lives at Read the Docs, the code on GitHub, and the latest release on PyPI. It’s rigorously tested on Python 3.11+.
If you’d like to contribute to Clusterium you’re most welcome!