API Reference

The `clusx` module

The top-level module for clusx.

This module tracks the version of the package as the base package info used by various functions within the package.

Refer to the documentation for details on the use of this package.

The `main` module

Entry point for direct module execution.

This module serves as the main entry point when the package is executed directly using python -m clusx.

It initializes the command-line interface and passes control to the main CLI function.

When executed with python -m clusx, this module will initialize the CLI and handle command-line arguments through the main function in the cli module.

See also

clusx.cli: Contains the main CLI implementation

clusx.__main__.init() → None[source]

Run clusx.cli.main() when current file is executed by an interpreter.

This function ensures that the CLI main function is only executed when this file is run directly, not when imported as a module.

The sys.exit() function is called with the return value of clusx.cli.main(), following standard UNIX program conventions for exit codes.

The `cli` module

Command-line interface for the Clusterium.

This module provides a command-line interface for clustering text data, benchmarking clustering results, and generating reports. It handles command-line arguments, environment configuration, and execution of the appropriate toolkit functionality based on user commands.

class clusx.cli.RichGroup(*args, **kwargs)[source]

Custom Click group that displays a banner before the help text.

format_help(ctx, formatter)[source]

Writes the help into the formatter if it exists.

This method is called by Click when the help text is requested.

clusx.cli.common_options(func: Callable) → Callable[source]: Common options for all clusx CLI commands.

clusx.cli.main(args: list[str] | None = None) → int[source]

Main entry point for the clusx CLI.

Parameters:: args – Command line arguments (uses sys.argv if None)
Returns:: Exit code (0 for success, non-zero for errors)
Return type:: int

The `errors` module

Errors for the clusx package.

exception clusx.errors.ClusterIntegrityError[source]

Error raised when a cluster assignments file has integrity issues.

This error indicates that the cluster assignments file is corrupted, was created with errors, or is missing critical information needed for further processing.

exception clusx.errors.ClusxError[source]: Base class for all Clusx errors.

exception clusx.errors.EvaluationError[source]: Error raised when evaluation fails.

exception clusx.errors.MissingClusterColumnError(file_path: str)[source]

Error raised when a cluster assignments file is missing the cluster column.

This error indicates that the file does not contain a column that starts with Cluster_ (such as Cluster_PYP or Cluster_DP), which is required for identifying cluster assignments.

See also

ClusterIntegrityError: Parent class for integrity errors
MissingParametersError: Related error for missing parameters

exception clusx.errors.MissingParametersError(file_path: str, missing_params: list[str])[source]

Error raised when a cluster assignments file is missing required parameters.

This error indicates that the file is missing one or more of the required parameters (alpha, sigma, variance) needed for further processing.

Parameters:

file_path (str) – Path to the file with missing parameters
missing_params (list[str]) – List of parameter names that are missing

See also

ClusterIntegrityError: Parent class for integrity errors
MissingClusterColumnError: Related error for missing cluster columns

exception clusx.errors.VisualizationError[source]: Error raised when a visualization fails.

The `evaluation` module

Evaluation module for clustering quality assessment.

This module provides tools for evaluating the quality and characteristics of clusters generated by Bayesian nonparametric clustering algorithms. It implements established metrics for cluster validation in the context of text data clustering, with a focus on power-law analysis and similarity-based metrics.

Key components:

ClusterEvaluator: Main class for evaluating clustering results
NumpyEncoder: Custom JSON encoder for handling NumPy data types
save_evaluation_report(): Function to save evaluation results to JSON

The evaluation process assesses:

Cluster cohesion and separation (silhouette score)
Intra-cluster vs. inter-cluster similarity
Power-law characteristics of cluster size distributions
Potential outliers in the clustering results
Cluster size distribution

This module is typically used after running clustering with the Dirichlet Process and Pitman-Yor Process models to compare their performance and understand the statistical properties of the generated clusters.

class clusx.evaluation.ClusterEvaluator(texts: list[str], embeddings: numpy.ndarray, cluster_assignments: list[int], model_name: str, alpha: float, sigma: float, kappa: float, random_state: int | None = None)[source]

Evaluates the quality and characteristics of text clusters using metrics.

This class provides methods to assess clustering results through various metrics:

Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters
Similarity Metrics: Evaluates intra-cluster vs inter-cluster similarity
Power-law Analysis: Determines if cluster sizes follow a power-law distribution
Outlier Detection: Identifies potential outliers in the clustering results
Cluster Size Distribution: Calculates the distribution of cluster sizes

Used for post-processing analysis of Bayesian nonparametric clustering results.

Note

Parameters like alpha and sigma in clustering algorithms significantly impact the resulting cluster distributions.

calculate_cluster_size_distribution() → dict[str, int][source]

Calculate the distribution of cluster sizes across all clusters.

This method counts the number of texts assigned to each cluster and returns a mapping of cluster IDs to their respective sizes. The distribution is useful for:

Analyzing the balance of cluster assignments
Identifying dominant vs. minor clusters
Providing input for power-law distribution analysis
Visualizing the cluster size distribution

The cluster IDs are converted to strings in the returned dictionary to ensure compatibility with JSON serialization.

Returns:: Dictionary mapping cluster IDs (as strings) to their sizes, where size represents the number of texts in each cluster
Return type:: dict[str, int]

calculate_silhouette_score() → float[source]

Calculate the silhouette score for the clustering data.

This method calculates the silhouette score only for valid clusters (those with ≥2 samples). Invalid clusters are excluded from the calculation.

Cosine distance is used because the data is represented by text embeddings.

The silhouette score measures how similar an object is to its own cluster compared to other clusters. The score ranges from -1 to 1, where:

A high value (close to 1) indicates the object is well-matched to its cluster
A value near 0 indicates the object is on or very close to the decision boundary
A negative value indicates the object might be assigned to the wrong cluster

This method handles edge cases:

Returns 0.0 if there are fewer than 2 valid clusters
An error occurs during calculation

Returns:: Silhouette score as a float between -1 and 1, or 0.0 if calculation is not possible
Return type:: float

calculate_similarity_metrics() → dict[str, float | numpy.floating | dict[str, int]][source]

Calculate cluster-aware similarity metrics.

This method computes three key metrics using cosine similarity:

Intra-cluster similarity: Average similarity between texts in the same cluster (higher values indicate more cohesive clusters)
Inter-cluster similarity: Average similarity between texts in different clusters (lower values indicate better separation between clusters)
Silhouette-like score: Difference between intra-cluster and inter-cluster similarity (similar to silhouette score but calculated differently)

The method handles edge cases:

Only considers clusters with ≥2 members for intra-similarity
Uses matrix operations for O(n) complexity
Handles edge cases with proper numerical stability

Returns:

Dictionary with the following keys:

intra_cluster_similarity: Average similarity within clusters
inter_cluster_similarity: Average similarity between clusters
silhouette_like_score: Difference between intra and inter similarity
valid_cluster_ratio: Fraction of valid clusters
analyzed_pairs: Number of analyzed intra and inter cluster pairs (intra: intra-cluster pairs, inter: inter-cluster pairs)

Return type:

dict[str, Union[float, numpy.floating]]

detect_powerlaw_distribution() → dict[str, Any][source]

Detect if the cluster size distribution follows a power-law.

This method analyzes the distribution of cluster sizes to determine if it follows a power-law distribution, which is common in many natural language datasets and indicates scale-free properties. The analysis includes:

Collecting the size of each cluster
Validating if there are enough clusters (at least 5) for meaningful analysis
Fitting a power-law distribution using the powerlaw package
Comparing the power-law fit to an exponential distribution

The method handles edge cases:

Returns null values if there are fewer than 5 clusters
Handles errors in the powerlaw fitting process
Validates the fitted parameters to avoid NaN values

Returns:

A dictionary with power-law parameters:

alpha: Power-law exponent (higher values indicate steeper distribution)
xmin: Minimum value for which power-law holds
is_powerlaw: Boolean indicating if distribution follows power-law
sigma_error: Standard error of the alpha estimate
p_value: P-value from comparison with exponential distribution

Return type:

dict[str, Any]

find_outliers(n_neighbors: int = 5) → dict[str, float][source]

Find potential outliers in each cluster using nearest neighbors.

Parameters:: n_neighbors – Number of neighbors to consider (default: 5)
Returns:: Dictionary with outlier metrics
Return type:: dict[str, float]

generate_report() → dict[str, Any][source]

Generate a comprehensive evaluation report.

Returns:: Dictionary containing all evaluation metrics and metadata
Return type:: dict[str, Any]

class clusx.evaluation.NumpyEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Custom JSON encoder that handles NumPy data types.

This encoder converts NumPy types to their Python equivalents for proper JSON serialization. It’s used when saving evaluation reports to ensure all NumPy values are properly converted to standard Python types.

Conversions:

numpy.ndarray → list
numpy.single → float
numpy.double → float
numpy.intc → int
numpy.int_ → int
numpy.bool_ → bool
Other NumPy types → Python equivalents via the item() method when available

default(o)[source]: Convert NumPy types to their Python equivalents for JSON serialization.

clusx.evaluation.save_evaluation_report(report: dict[str, Any], output_dir: str, filename: str = 'evaluation_report.json') → str[source]

Save the evaluation report to a JSON file.

This function serializes the evaluation report to a JSON file, handling NumPy data types through the NumpyEncoder. The report contains comprehensive metrics about the clustering quality, including silhouette scores, similarity metrics, power-law analysis, and outlier detection.

If serialization issues occur, the function attempts to save a simplified version of the report with only basic metrics.

Parameters:

report – Dictionary containing the evaluation report for different clustering models
output_dir – Directory to save the report
filename – Name of the output file (default: “evaluation_report.json”)

Returns:

Path to the saved report file

Return type:

str

Raises:

TypeError – If JSON serialization fails even after simplification attempts

The `logging` module

Logging configuration for Clusterium.

This module provides standardized logging functionality for the Clusterium package, including configuration setup and logger retrieval. It ensures consistent log formatting across all components of the package and simplifies the process of obtaining properly configured logger instances.

The module offers two main functions:

setup_logging: Configures the root logger with appropriate formatting and level
get_logger: Returns a logger instance with the specified name

Typical usage:

>>> from clusx.logging import get_logger
>>> logger = get_logger(__name__)
>>> logger.info("Processing started")

clusx.logging.get_logger(name: str) → Logger[source]

Get a logger with the specified name.

Parameters:: name (str) – The name for the logger (typically __name__).
Returns:: A configured logger instance ready for use.
Return type:: logging.Logger

clusx.logging.setup_logging(level: int | None = None) → None[source]

Set up logging configuration for the application.

This function configures the root logger with a standardized format that includes timestamp, log level, and message. It’s typically called once at the start of the application to ensure consistent logging behavior across all modules.

The timestamp format is ISO-like (YYYY-MM-DD HH:MM:SS) for better readability and sorting in log files.

Parameters:: level – The logging level (defaults to logging.INFO if None).

The `utils` module

Utility functions for the clusx package.

clusx.utils.to_numpy(embedding: EmbeddingTensor) → NDArray[np.float32][source]

Convert a tensor to a numpy array.

If embedding is already a numpy array (or compatible), it is returned as is. Otherwise, it is converted to a numpy array.

Parameters:: embedding (EmbeddingTensor) – The tensor to convert. Can be a PyTorch tensor or a numpy array.
Returns:: The input converted to a numpy array. If the input is already a numpy array (or compatible), it is returned as is.
Return type:: numpy.ndarray

The `version` module

Version information.

This module provides package metadata through a cascading resolution strategy.

The metadata is resolved in the following order: 1. Installed package metadata (via importlib.metadata) 2. pyproject.toml (for development environments) 3. Fallback defaults

clusx.version.get_metadata() → dict[str, str][source]: Retrieve package metadata using a cascading resolution strategy.

The `visualization` module

Visualization module for Clusterium.

This module provides functions for visualizing clustering results and evaluation metrics.

clusx.visualization.MIN_DATASET_SIZE = 10: Minimum dataset size for which visualizations are considered safe.

Note

Visualizations may not be meaningful or could be misleading when applied to datasets smaller than this threshold.

clusx.visualization.get_model_colors(model_names: list[str]) → dict[str, Any][source]

Generate consistent colors for models using academically popular colormaps.

Selects appropriate colormaps based on visualization best practices for clustering:

For typical case (≤10 models): Uses ‘Set1’ which provides distinct, balanced hues that ensure clear differentiation among groups.
For more models: Uses ‘tab20’ which provides up to 20 distinct colors, with alpha variation for cases beyond 20 models to maintain visual distinction.

This approach follows standard practices in clustering visualization where colormap selection is based on the number of clusters to ensure optimal visual clarity and accessibility.

Parameters:: model_names (list[str]) – List of model names to generate colors for
Returns:: Dictionary mapping model names to their assigned colors
Return type:: dict

clusx.visualization.is_small_dataset(reports: dict[str, dict[str, Any]], min_size: int) → bool[source]

Check if the dataset is considered small based on the number of texts.

Parameters:

reports (dict) – Dictionary mapping model names to their evaluation reports.
min_size (int) – Minimum number of texts threshold.

Returns:

True if the dataset is considered small, False otherwise.

Return type:

bool

Notes

A dataset is considered small if:

It’s empty (no reports) or not a dictionary
No reports have ‘cluster_stats’
No reports have ‘num_texts’ in their ‘cluster_stats’
Any report has fewer than min_size texts (assuming we have the same dataset for all reports)

clusx.visualization.plot_cluster_counts(reports, ax: Axes)[source]

Plot the number of clusters for each model.

Parameters:

reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.

Return type:

None

clusx.visualization.plot_cluster_size_distribution(reports, ax: Axes)[source]

Plot cluster size distributions for each model.

Parameters:

reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.

Returns:

The function modifies the provided axes in-place.

Return type:

None

clusx.visualization.plot_outliers(reports, ax: Axes)[source]

Plot outlier scores distribution.

Parameters:

reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.

Return type:

None

clusx.visualization.plot_powerlaw_fit(reports, ax: Axes)[source]

Plot power-law fit for cluster size distributions.

Parameters:

reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.

Returns:

The function plots directly on the provided axes.

Return type:

None

clusx.visualization.plot_silhouette_scores(reports, ax: Axes)[source]

Plot silhouette scores for each model.

Parameters:

reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.

Return type:

None

clusx.visualization.plot_similarity_metrics(reports, ax: Axes)[source]

Plot similarity metrics for each model.

Parameters:

reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.

Return type:

None

clusx.visualization.render_error_message(ax: Axes, plot_title: str, error, small_dataset: bool, min_size: int)[source]

Display appropriate error message on the plot.

Parameters:

ax (Axes) – Matplotlib axes to display the error message on.
plot_title (str) – Title of the plot.
error (Exception) – The exception that was raised.
small_dataset (bool) – Whether the dataset is considered small.
min_size (int) – Minimum dataset size threshold.

Returns:

This function modifies the provided axes in-place.

Return type:

None

clusx.visualization.safe_plot(title: str | None = None, min_dataset_size: int = 10)[source]

Decorator for safely executing plotting functions with error handling.

Parameters:

title (str or None) – Title for the plot. If None, the function name will be used.
min_dataset_size (int) – Minimum dataset size threshold for small dataset detection. Default is MIN_DATASET_SIZE.

Returns:

Decorated function that handles errors and provides visual feedback.

Return type:

collections.abc.Callable

Examples

>>> @safe_plot(title="My Custom Plot")
>>> def plot_my_visualization(reports, ax):
>>>     # Your plotting code here
>>>     # No need for try/except blocks
>>>     ax.plot(data)
>>>     ax.set_title("My Plot")
>>>
>>> # Usage remains the same as the original function
>>> plot_my_visualization(reports, ax)

Notes

The decorated function must accept ‘reports’ and ‘ax’ as its first two arguments
The decorator automatically sets the plot title
For small datasets, a specific message is displayed
All exceptions are logged with detailed error messages

clusx.visualization.visualize_evaluation_dashboard(reports: dict[str, dict[str, Any]], output_dir: str, filename: str = 'evaluation_dashboard.png', show_plot: bool = False) → str[source]

Generate a comprehensive dashboard visualization of evaluation metrics.

This creates a 3x2 grid of plots showing:

Cluster size distribution (log-log scale)
Silhouette score comparison
Similarity metrics comparison
Power-law fit visualization
Outlier distribution
Number of clusters comparison

Parameters:

reports (dict[str, dict[str, Any]]) – Dictionary mapping model names to their evaluation reports.
output_dir (str) – Directory to save the visualization.
filename (str) – Name of the output file. Default is evaluation_dashboard.png
show_plot (bool) – Whether to display the plot interactively. Default is False.

Returns:

Path to the saved visualization file.

Return type:

str

The `clustering` module

The `models` module

Clustering models for text data using Dirichlet Process and Pitman-Yor Process.

This module implements nonparametric Bayesian clustering algorithms for text data, specifically the Dirichlet Process and Pitman-Yor Process. These methods automatically determine the appropriate number of clusters based on the data.

The implementation uses the Chinese Restaurant Process formulation with von Mises-Fisher distribution for modeling document embeddings on the unit hypersphere.

Classes

DirichletProcess: Implements clustering using the Dirichlet Process with concentration parameter alpha and precision parameter kappa.
PitmanYorProcess: Extends DirichletProcess with an additional discount parameter for more flexible power-law behavior in cluster size distributions.

Notes

Both implementations follow a scikit-learn compatible API with fit(), predict(), and fit_predict() methods. The Pitman-Yor Process is generally better suited for text data as it can model the power-law distributions common in natural language.

class clusx.clustering.models.DirichletProcess(alpha: float, kappa: float, model_name: str | None = 'all-MiniLM-L6-v2', random_state: int | None = None)[source]

DP clustering implementation for text data using von Mises-Fisher distribution.

This implementation uses a Chinese Restaurant Process (CRP) formulation with Bayesian inference to cluster text data. It combines the CRP prior with a likelihood model based on von Mises-Fisher distributions in the embedding space, which is particularly suitable for directional data like normalized text embeddings.

The model uses a concentration parameter alpha to control the propensity to create new clusters, and a precision parameter kappa to control the concentration of points around cluster means in the von Mises-Fisher distribution.

alpha

Concentration parameter for new cluster creation.

Type:: float

kappa

Precision parameter for the von Mises-Fisher distribution.

Type:: float

model

Sentence transformer model used for text embeddings.

Type:: SentenceTransformer

random_state

Random state for reproducibility.

Type:: numpy.random.Generator

clusters

List of cluster assignments for each processed text.

Type:: list of int

cluster_params

Dictionary of cluster parameters for each cluster. Contains ‘mean’ (centroid) and ‘count’ (number of points).

Type:: dict

global_mean

Global mean of all document embeddings.

Type:: Optional[EmbeddingTensor]

next_id

Next available cluster ID.

Type:: int

embeddings_

Document embeddings after fitting.

Type:: Optional[EmbeddingTensor]

labels_

Cluster assignments after fitting.

Type:: Optional[NDArray[np.int64]]

text_embeddings

Cache of text to embedding mappings.

Type:: dict[str, EmbeddingTensor]

embedding_dim

Dimension of the embedding vectors.

Type:: Optional[int]

__init__(alpha: float, kappa: float, model_name: str | None = 'all-MiniLM-L6-v2', random_state: int | None = None)[source]

Initialize a Dirichlet Process model with von Mises-Fisher likelihood.

Parameters:

alpha (float) – Concentration parameter for new cluster creation. Higher values lead to more clusters.
kappa (float) – Precision parameter for the von Mises-Fisher distribution. Higher values lead to tighter, more concentrated clusters.
model_name (Optional[str]) – Name of the sentence transformer model to use. Default is “all-MiniLM-L6-v2”.
random_state (Optional[int]) – Random seed for reproducibility. If None, fresh, unpredictable entropy will be pulled from the OS.

__weakref__: list of weak references to the object

assign_cluster(embedding: EmbeddingTensor) → tuple[int, np.ndarray][source]

Assign a document embedding to a cluster using Bayesian inference.

This method computes probabilities for assigning the document to each existing cluster or creating a new one, then samples a cluster assignment from this probability distribution. The probabilities combine the CRP prior and the von Mises-Fisher likelihood.

Parameters:

embedding (EmbeddingTensor) – Document embedding vector.

Returns:

cluster_id (int) – The assigned cluster ID.
probs (np.ndarray) – Probability distribution over clusters used for assignment.

fit(documents, _y=None)[source]

Train the clustering model on the given text data.

This method processes each document in the input, computing its embedding and assigning it to a cluster using Bayesian inference with the Chinese Restaurant Process. It supports both text inputs and pre-computed embeddings.

Parameters:

documents (Union[list[str], list[EmbeddingTensor]]) – The text documents or embeddings to cluster.
_y – Ignored. Added for compatibility with scikit-learn API.

Returns:

self – The fitted estimator.

Return type:

object

Note

After fitting, the following attributes are set:

embeddings_Optional[EmbeddingTensor]
The document embeddings.
labels_NDArray[np.int64]
The cluster assignments for each document.
clusterslist
List of cluster IDs for each document.
cluster_paramsdict
Dictionary containing parameters for each cluster.

fit_predict(documents, _y=None)[source]

Fit the model and predict cluster labels for documents.

Parameters:

documents (Union[list[str], list[EmbeddingTensor]]) – The text documents or embeddings to cluster.
_y – This parameter exists only for compatibility with scikit-learn API.

Returns:

labels – Cluster labels for each document.

Return type:

NDArray[np.int64]

Notes

This method is a convenience function that calls fit() followed by returning the cluster labels from the fitting process.

get_embedding(text: str | list[str]) → EmbeddingTensor[source]

Get the embedding for a text or list of texts with caching.

This method computes embeddings for text inputs using the sentence transformer model. It implements caching to avoid recomputing embeddings for previously seen texts. The method can handle both single text strings and lists of texts.

Parameters:: text (str or list of str) – Text or list of texts to embed.
Returns:: The normalized embedding vector(s) for the text. If input is a single string, returns a single embedding vector. If input is a list, returns an array of embedding vectors.
Return type:: EmbeddingTensor

log_crp_prior(cluster_id: int | None = None) → float[source]

Calculate the Chinese Restaurant Process prior probability.

The Chinese Restaurant Process (CRP) is a stochastic process that defines a probability distribution over partitions of items. In the context of clustering, it provides a prior probability for assigning a document to an existing cluster or creating a new one.

The CRP is a key component of Bayesian nonparametric models, particularly the Dirichlet Process. For more details, see [1], [2].

Parameters:: cluster_id (Optional[int]) – The cluster ID. If provided, calculate prior for an existing cluster. If None, calculate prior for a new cluster.
Returns:: Log probability of the cluster under the CRP prior.
Return type:: float

References

log_likelihood(embedding: EmbeddingTensor) → tuple[dict[int, float], float][source]

Calculate log-likelihoods for an embedding across all clusters.

This method computes the log-likelihood of a document embedding under each existing cluster’s von Mises-Fisher distribution, as well as under a potential new cluster.

Parameters:

embedding (EmbeddingTensor) – Document embedding vector.

Returns:

existing_likelihoods (dict[int, float]) – Dictionary mapping cluster IDs to their log-likelihoods.
new_cluster_likelihood (float) – Log-likelihood for a new cluster.

predict(documents)[source]

Predict the closest cluster for each sample in documents.

Parameters:: documents (Union[list[str], list[EmbeddingTensor]]) – The text documents or embeddings to predict clusters for.
Returns:: labels – Cluster labels for each document. Returns -1 if no clusters exist yet.
Return type:: NDArray[np.int64]

Note

This method computes the most likely cluster assignment for each document based on the von Mises-Fisher likelihood, without updating the cluster parameters. It supports both text inputs and pre-computed embeddings.

class clusx.clustering.models.PitmanYorProcess(alpha: float, kappa: float, sigma: float, model_name: str | None = 'all-MiniLM-L6-v2', random_state: int | None = None)[source]

Pitman-Yor Process clustering for text data using von Mises-Fisher distribution.

alpha

Concentration parameter for new cluster creation.

Type:: float

kappa

Precision parameter for the von Mises-Fisher distribution.

Type:: float

sigma

Discount parameter controlling power-law behavior (0 ≤ σ < 1).

Type:: float

model

Sentence transformer model used for text embeddings.

Type:: SentenceTransformer

random_state

Random state for reproducibility.

Type:: numpy.random.Generator

clusters

List of cluster assignments for each processed text.

Type:: list[int]

cluster_params

Dictionary of cluster parameters for each cluster. Contains ‘mean’ (centroid) and ‘count’ (number of points).

Type:: dict

global_mean

Global mean of all document embeddings.

Type:: Optional[EmbeddingTensor]

next_id

Next available cluster ID.

Type:: int

embeddings_

Document embeddings after fitting.

Type:: Optional[EmbeddingTensor]

labels_

Cluster assignments after fitting.

Type:: Optional[NDArray[np.int64]]

text_embeddings

Cache of text to embedding mappings.

Type:: dict[str, EmbeddingTensor]

embedding_dim

Dimension of the embedding vectors.

Type:: Optional[int]

Notes

The Pitman-Yor Process is a generalization of the Dirichlet Process that introduces a discount parameter (sigma) to control the power-law behavior of the cluster size distribution. It is particularly effective for modeling natural language phenomena that exhibit power-law distributions, such as word frequencies or topic distributions.

This implementation extends the DirichletProcess class, adding the sigma parameter and modifying the cluster assignment probabilities according to the Pitman-Yor Process while maintaining the von Mises-Fisher likelihood model for directional text embeddings.

The mathematical foundation of the Pitman-Yor Process involves two key parameters:

The concentration parameter alpha (α > -σ), controlling the overall tendency to create new clusters
The discount parameter sigma (0 ≤ σ < 1), controlling the power-law behavior

As σ approaches 1, the distribution exhibits heavier tails (more small clusters), while σ = 0 reduces to the standard Dirichlet Process.

__init__(alpha: float, kappa: float, sigma: float, model_name: str | None = 'all-MiniLM-L6-v2', random_state: int | None = None)[source]

Initialize a PYP clustering model with von Mises-Fisher likelihood.

Parameters:

alpha (float) – Concentration parameter for the Pitman-Yor Process. Higher values encourage formation of more clusters. Must satisfy: α > -σ.
kappa (float) – Precision parameter for the von Mises-Fisher distribution. Higher values lead to tighter, more concentrated clusters.
sigma (float) – Discount parameter for the Pitman-Yor Process (0 ≤ σ < 1). Controls the power-law behavior. Higher values create more power-law-like cluster size distributions. When σ=0, the model reduces to a Dirichlet Process.
model_name (Optional[str]) – Name of the sentence transformer model to use. Default is “all-MiniLM-L6-v2”.
random_state (Optional[int]) – Random seed for reproducibility. If None, fresh, unpredictable entropy will be pulled from the OS.

Raises:

ValueError – If sigma ∉ [0.0, 1.0) or if alpha ≤ -sigma.

Notes

The mathematical requirement for the Pitman-Yor Process is:

The discount parameter σ must be in [0,1)
The concentration parameter α must satisfy α > -σ

The constraint α > -σ ensures that the numerator in the new table probability calculation (α + K*σ) remains positive even when K=0. This is essential for proper probabilistic behavior of the model.

log_pyp_prior(cluster_id: int | None = None) → float[source]

Calculate the Pitman-Yor Process prior probability.

Parameters:: cluster_id (int or None) – The cluster ID. If provided, calculate prior for an existing cluster. If None, calculate prior for a new cluster.
Returns:: Log probability of the cluster under the PYP prior.
Return type:: float

Notes

The Pitman-Yor Process generalizes the Chinese Restaurant Process with the introduction of a discount parameter σ. The probability of a new customer (document) joining an existing table (cluster) k or starting a new table is:

P(existing cluster k) = (n_k - σ) / (n + α) P(new cluster) = (α + K*σ) / (n + α)

where: - n_k is the number of customers at table k - n is the total number of customers - K is the current number of tables - σ is the discount parameter - α is the concentration parameter

The `utils` module

Utility functions for data loading, saving, and visualization.

clusx.clustering.utils.get_embeddings(texts: list[str]) → ndarray[source]

Get embeddings for a list of texts.

Parameters:: texts – List of text strings
Returns:: Numpy array of embeddings

clusx.clustering.utils.is_csv_file(input_file: str) → bool[source]

Determine if a file is a CSV file based on extension and content.

Parameters:: input_file – Path to the input file
Returns:: True if the file is likely a CSV, False otherwise
Return type:: bool

clusx.clustering.utils.load_cluster_assignments(csv_path: str) → tuple[list[int], dict[str, float]][source]

Load cluster assignments and parameters from a CSV file.

Parameters:

csv_path – Path to the CSV file containing cluster assignments

Returns:

A tuple containing:

List of cluster assignments (clustered texts)
Dictionary of parameters (alpha, sigma, kappa)

Return type:

tuple[list[int], dict[str, float]]

Raises:

MissingClusterColumnError – If no cluster column is found in the file
MissingParametersError – If required parameters are missing in the file

clusx.clustering.utils.load_data(input_file: str, column: str | None = None) → list[str][source]

Load text data from a file. Supports text files and CSV files.

Parameters:

input_file – Path to the input file (text or CSV)
column – Column name containing the text data (required for CSV files)

Returns:

A list of texts

Return type:

list[str]

Raises:

ValueError – If a CSV file is provided without specifying a column

clusx.clustering.utils.load_parameters_from_json(json_path: str) → dict[str, float][source]

Load clustering parameters from a JSON file.

Parameters:: json_path – Path to the JSON file containing clustering results
Returns:: A dictionary of parameters (alpha, sigma, kappa)
Return type:: dict[str, float]

clusx.clustering.utils.save_clusters_to_csv(output_file: str, texts: list[str], clusters: list[int], model_name: str, alpha: float, sigma: float, kappa: float) → None[source]

Save clustering results to a CSV file.

Parameters:

output_file – Path to the output CSV file
texts – List of text strings
clusters – List of cluster assignments
model_name – Name of the clustering model
alpha – Concentration parameter
sigma – Discount parameter
kappa – Kappa parameter for likelihood model

clusx.clustering.utils.save_clusters_to_json(output_file: str, texts: list[str], clusters: list[int], model_name: str, alpha: float, sigma: float, kappa: float) → None[source]

Save clustering results to a JSON file.

Parameters:

output_file – Path to the output JSON file
texts – List of text strings
clusters – List of cluster assignments
model_name – Name of the clustering model
alpha – Concentration parameter
sigma – Discount parameter
kappa – Kappa parameter for likelihood model

API Reference

The clusx module

The __main__ module

The cli module

The errors module

The evaluation module

The logging module

The utils module

The version module

The visualization module

The clustering module

The models module

Classes

The utils module

The `clusx` module

The `main` module

The `cli` module

The `errors` module

The `evaluation` module

The `logging` module

The `utils` module

The `version` module

The `visualization` module

The `clustering` module

The `models` module

The `utils` module