API Reference
The clusx module
The top-level module for clusx.
This module tracks the version of the package as the base package info used by various functions within the package.
Refer to the documentation for details on the use of this package.
The __main__ module
Entry point for direct module execution.
This module serves as the main entry point when the package is executed directly
using python -m clusx.
It initializes the command-line interface and passes control to the main CLI function.
When executed with python -m clusx, this module will initialize the CLI and
handle command-line arguments through the main function in the cli module.
See also
clusx.cliContains the main CLI implementation
- clusx.__main__.init() None[source]
Run clusx.cli.main() when current file is executed by an interpreter.
This function ensures that the CLI main function is only executed when this file is run directly, not when imported as a module.
The
sys.exit()function is called with the return value ofclusx.cli.main(), following standard UNIX program conventions for exit codes.
The cli module
Command-line interface for the Clusterium.
This module provides a command-line interface for clustering text data, benchmarking clustering results, and generating reports. It handles command-line arguments, environment configuration, and execution of the appropriate toolkit functionality based on user commands.
- class clusx.cli.RichGroup(*args, **kwargs)[source]
Custom Click group that displays a banner before the help text.
The errors module
Errors for the clusx package.
- exception clusx.errors.ClusterIntegrityError[source]
Error raised when a cluster assignments file has integrity issues.
This error indicates that the cluster assignments file is corrupted, was created with errors, or is missing critical information needed for further processing.
- exception clusx.errors.MissingClusterColumnError(file_path: str)[source]
Error raised when a cluster assignments file is missing the cluster column.
This error indicates that the file does not contain a column that starts with
Cluster_(such as Cluster_PYP or Cluster_DP), which is required for identifying cluster assignments.See also
ClusterIntegrityErrorParent class for integrity errors
MissingParametersErrorRelated error for missing parameters
- exception clusx.errors.MissingParametersError(file_path: str, missing_params: list[str])[source]
Error raised when a cluster assignments file is missing required parameters.
This error indicates that the file is missing one or more of the required parameters (alpha, sigma, variance) needed for further processing.
- Parameters:
See also
ClusterIntegrityErrorParent class for integrity errors
MissingClusterColumnErrorRelated error for missing cluster columns
The evaluation module
Evaluation module for clustering quality assessment.
This module provides tools for evaluating the quality and characteristics of clusters generated by Bayesian nonparametric clustering algorithms. It implements established metrics for cluster validation in the context of text data clustering, with a focus on power-law analysis and similarity-based metrics.
Key components:
ClusterEvaluator: Main class for evaluating clustering resultsNumpyEncoder: Custom JSON encoder for handling NumPy data typessave_evaluation_report(): Function to save evaluation results to JSON
The evaluation process assesses:
Cluster cohesion and separation (silhouette score)
Intra-cluster vs. inter-cluster similarity
Power-law characteristics of cluster size distributions
Potential outliers in the clustering results
Cluster size distribution
This module is typically used after running clustering with the Dirichlet Process and Pitman-Yor Process models to compare their performance and understand the statistical properties of the generated clusters.
- class clusx.evaluation.ClusterEvaluator(texts: list[str], embeddings: numpy.ndarray, cluster_assignments: list[int], model_name: str, alpha: float, sigma: float, kappa: float, random_state: int | None = None)[source]
Evaluates the quality and characteristics of text clusters using metrics.
This class provides methods to assess clustering results through various metrics:
Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters
Similarity Metrics: Evaluates intra-cluster vs inter-cluster similarity
Power-law Analysis: Determines if cluster sizes follow a power-law distribution
Outlier Detection: Identifies potential outliers in the clustering results
Cluster Size Distribution: Calculates the distribution of cluster sizes
Used for post-processing analysis of Bayesian nonparametric clustering results.
Note
Parameters like alpha and sigma in clustering algorithms significantly impact the resulting cluster distributions.
- calculate_cluster_size_distribution() dict[str, int][source]
Calculate the distribution of cluster sizes across all clusters.
This method counts the number of texts assigned to each cluster and returns a mapping of cluster IDs to their respective sizes. The distribution is useful for:
Analyzing the balance of cluster assignments
Identifying dominant vs. minor clusters
Providing input for power-law distribution analysis
Visualizing the cluster size distribution
The cluster IDs are converted to strings in the returned dictionary to ensure compatibility with JSON serialization.
- calculate_silhouette_score() float[source]
Calculate the silhouette score for the clustering data.
This method calculates the silhouette score only for valid clusters (those with ≥2 samples). Invalid clusters are excluded from the calculation.
Cosine distance is used because the data is represented by text embeddings.
The silhouette score measures how similar an object is to its own cluster compared to other clusters. The score ranges from -1 to 1, where:
A high value (close to 1) indicates the object is well-matched to its cluster
A value near 0 indicates the object is on or very close to the decision boundary
A negative value indicates the object might be assigned to the wrong cluster
This method handles edge cases:
Returns 0.0 if there are fewer than 2 valid clusters
An error occurs during calculation
- Returns:
Silhouette score as a float between -1 and 1, or 0.0 if calculation is not possible
- Return type:
- calculate_similarity_metrics() dict[str, float | numpy.floating | dict[str, int]][source]
Calculate cluster-aware similarity metrics.
This method computes three key metrics using cosine similarity:
Intra-cluster similarity: Average similarity between texts in the same cluster (higher values indicate more cohesive clusters)
Inter-cluster similarity: Average similarity between texts in different clusters (lower values indicate better separation between clusters)
Silhouette-like score: Difference between intra-cluster and inter-cluster similarity (similar to silhouette score but calculated differently)
The method handles edge cases:
Only considers clusters with ≥2 members for intra-similarity
Uses matrix operations for O(n) complexity
Handles edge cases with proper numerical stability
- Returns:
Dictionary with the following keys:
intra_cluster_similarity: Average similarity within clustersinter_cluster_similarity: Average similarity between clusterssilhouette_like_score: Difference between intra and inter similarityvalid_cluster_ratio: Fraction of valid clustersanalyzed_pairs: Number of analyzed intra and inter cluster pairs (intra: intra-cluster pairs,inter: inter-cluster pairs)
- Return type:
dict[str, Union[float, numpy.floating]]
- detect_powerlaw_distribution() dict[str, Any][source]
Detect if the cluster size distribution follows a power-law.
This method analyzes the distribution of cluster sizes to determine if it follows a power-law distribution, which is common in many natural language datasets and indicates scale-free properties. The analysis includes:
Collecting the size of each cluster
Validating if there are enough clusters (at least 5) for meaningful analysis
Fitting a power-law distribution using the powerlaw package
Comparing the power-law fit to an exponential distribution
The method handles edge cases:
Returns null values if there are fewer than 5 clusters
Handles errors in the powerlaw fitting process
Validates the fitted parameters to avoid NaN values
- Returns:
A dictionary with power-law parameters:
alpha: Power-law exponent (higher values indicate steeper distribution)xmin: Minimum value for which power-law holdsis_powerlaw: Boolean indicating if distribution follows power-lawsigma_error: Standard error of the alpha estimatep_value: P-value from comparison with exponential distribution
- Return type:
- class clusx.evaluation.NumpyEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]
Custom JSON encoder that handles NumPy data types.
This encoder converts NumPy types to their Python equivalents for proper JSON serialization. It’s used when saving evaluation reports to ensure all NumPy values are properly converted to standard Python types.
Conversions:
numpy.bool_→boolOther NumPy types → Python equivalents via the item() method when available
- clusx.evaluation.save_evaluation_report(report: dict[str, Any], output_dir: str, filename: str = 'evaluation_report.json') str[source]
Save the evaluation report to a JSON file.
This function serializes the evaluation report to a JSON file, handling NumPy data types through the NumpyEncoder. The report contains comprehensive metrics about the clustering quality, including silhouette scores, similarity metrics, power-law analysis, and outlier detection.
If serialization issues occur, the function attempts to save a simplified version of the report with only basic metrics.
- Parameters:
report – Dictionary containing the evaluation report for different clustering models
output_dir – Directory to save the report
filename – Name of the output file (default: “evaluation_report.json”)
- Returns:
Path to the saved report file
- Return type:
- Raises:
TypeError – If JSON serialization fails even after simplification attempts
The logging module
Logging configuration for Clusterium.
This module provides standardized logging functionality for the Clusterium package, including configuration setup and logger retrieval. It ensures consistent log formatting across all components of the package and simplifies the process of obtaining properly configured logger instances.
The module offers two main functions:
setup_logging: Configures the root logger with appropriate formatting and level
get_logger: Returns a logger instance with the specified name
Typical usage:
>>> from clusx.logging import get_logger
>>> logger = get_logger(__name__)
>>> logger.info("Processing started")
- clusx.logging.get_logger(name: str) Logger[source]
Get a logger with the specified name.
- Parameters:
name (str) – The name for the logger (typically
__name__).- Returns:
A configured logger instance ready for use.
- Return type:
- clusx.logging.setup_logging(level: int | None = None) None[source]
Set up logging configuration for the application.
This function configures the root logger with a standardized format that includes timestamp, log level, and message. It’s typically called once at the start of the application to ensure consistent logging behavior across all modules.
The timestamp format is ISO-like (YYYY-MM-DD HH:MM:SS) for better readability and sorting in log files.
- Parameters:
level – The logging level (defaults to logging.INFO if None).
The utils module
Utility functions for the clusx package.
- clusx.utils.to_numpy(embedding: EmbeddingTensor) NDArray[np.float32][source]
Convert a tensor to a numpy array.
If embedding is already a numpy array (or compatible), it is returned as is. Otherwise, it is converted to a numpy array.
- Parameters:
embedding (EmbeddingTensor) – The tensor to convert. Can be a PyTorch tensor or a numpy array.
- Returns:
The input converted to a numpy array. If the input is already a numpy array (or compatible), it is returned as is.
- Return type:
The version module
Version information.
This module provides package metadata through a cascading resolution strategy.
The metadata is resolved in the following order: 1. Installed package metadata (via importlib.metadata) 2. pyproject.toml (for development environments) 3. Fallback defaults
The visualization module
Visualization module for Clusterium.
This module provides functions for visualizing clustering results and evaluation metrics.
- clusx.visualization.MIN_DATASET_SIZE = 10
Minimum dataset size for which visualizations are considered safe.
Note
Visualizations may not be meaningful or could be misleading when applied to datasets smaller than this threshold.
- clusx.visualization.get_model_colors(model_names: list[str]) dict[str, Any][source]
Generate consistent colors for models using academically popular colormaps.
Selects appropriate colormaps based on visualization best practices for clustering:
For typical case (≤10 models): Uses ‘Set1’ which provides distinct, balanced hues that ensure clear differentiation among groups.
For more models: Uses ‘tab20’ which provides up to 20 distinct colors, with alpha variation for cases beyond 20 models to maintain visual distinction.
This approach follows standard practices in clustering visualization where colormap selection is based on the number of clusters to ensure optimal visual clarity and accessibility.
- clusx.visualization.is_small_dataset(reports: dict[str, dict[str, Any]], min_size: int) bool[source]
Check if the dataset is considered small based on the number of texts.
- Parameters:
- Returns:
True if the dataset is considered small, False otherwise.
- Return type:
Notes
A dataset is considered small if:
It’s empty (no reports) or not a dictionary
No reports have ‘cluster_stats’
No reports have ‘num_texts’ in their ‘cluster_stats’
Any report has fewer than min_size texts (assuming we have the same dataset for all reports)
- clusx.visualization.plot_cluster_counts(reports, ax: Axes)[source]
Plot the number of clusters for each model.
- Parameters:
reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.
- Return type:
None
- clusx.visualization.plot_cluster_size_distribution(reports, ax: Axes)[source]
Plot cluster size distributions for each model.
- Parameters:
reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.
- Returns:
The function modifies the provided axes in-place.
- Return type:
None
- clusx.visualization.plot_outliers(reports, ax: Axes)[source]
Plot outlier scores distribution.
- Parameters:
reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.
- Return type:
None
- clusx.visualization.plot_powerlaw_fit(reports, ax: Axes)[source]
Plot power-law fit for cluster size distributions.
- Parameters:
reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.
- Returns:
The function plots directly on the provided axes.
- Return type:
None
- clusx.visualization.plot_silhouette_scores(reports, ax: Axes)[source]
Plot silhouette scores for each model.
- Parameters:
reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.
- Return type:
None
- clusx.visualization.plot_similarity_metrics(reports, ax: Axes)[source]
Plot similarity metrics for each model.
- Parameters:
reports (dict) – Dictionary mapping model names to their evaluation reports.
ax (Axes) – Matplotlib axes to plot on.
- Return type:
None
- clusx.visualization.render_error_message(ax: Axes, plot_title: str, error, small_dataset: bool, min_size: int)[source]
Display appropriate error message on the plot.
- Parameters:
- Returns:
This function modifies the provided axes in-place.
- Return type:
None
- clusx.visualization.safe_plot(title: str | None = None, min_dataset_size: int = 10)[source]
Decorator for safely executing plotting functions with error handling.
- Parameters:
title (str or None) – Title for the plot. If None, the function name will be used.
min_dataset_size (int) – Minimum dataset size threshold for small dataset detection. Default is
MIN_DATASET_SIZE.
- Returns:
Decorated function that handles errors and provides visual feedback.
- Return type:
Examples
>>> @safe_plot(title="My Custom Plot") >>> def plot_my_visualization(reports, ax): >>> # Your plotting code here >>> # No need for try/except blocks >>> ax.plot(data) >>> ax.set_title("My Plot") >>> >>> # Usage remains the same as the original function >>> plot_my_visualization(reports, ax)
Notes
The decorated function must accept ‘reports’ and ‘ax’ as its first two arguments
The decorator automatically sets the plot title
For small datasets, a specific message is displayed
All exceptions are logged with detailed error messages
- clusx.visualization.visualize_evaluation_dashboard(reports: dict[str, dict[str, Any]], output_dir: str, filename: str = 'evaluation_dashboard.png', show_plot: bool = False) str[source]
Generate a comprehensive dashboard visualization of evaluation metrics.
This creates a 3x2 grid of plots showing:
Cluster size distribution (log-log scale)
Silhouette score comparison
Similarity metrics comparison
Power-law fit visualization
Outlier distribution
Number of clusters comparison
- Parameters:
reports (dict[str, dict[str, Any]]) – Dictionary mapping model names to their evaluation reports.
output_dir (str) – Directory to save the visualization.
filename (str) – Name of the output file. Default is
evaluation_dashboard.pngshow_plot (bool) – Whether to display the plot interactively. Default is
False.
- Returns:
Path to the saved visualization file.
- Return type:
The clustering module
The models module
Clustering models for text data using Dirichlet Process and Pitman-Yor Process.
This module implements nonparametric Bayesian clustering algorithms for text data, specifically the Dirichlet Process and Pitman-Yor Process. These methods automatically determine the appropriate number of clusters based on the data.
The implementation uses the Chinese Restaurant Process formulation with von Mises-Fisher distribution for modeling document embeddings on the unit hypersphere.
Classes
DirichletProcessImplements clustering using the Dirichlet Process with concentration parameter alpha and precision parameter kappa.
PitmanYorProcessExtends DirichletProcess with an additional discount parameter for more flexible power-law behavior in cluster size distributions.
Notes
Both implementations follow a scikit-learn compatible API with fit(),
predict(), and fit_predict() methods. The Pitman-Yor Process is generally
better suited for text data as it can model the power-law distributions common in
natural language.
- class clusx.clustering.models.DirichletProcess(alpha: float, kappa: float, model_name: str | None = 'all-MiniLM-L6-v2', random_state: int | None = None)[source]
DP clustering implementation for text data using von Mises-Fisher distribution.
This implementation uses a Chinese Restaurant Process (CRP) formulation with Bayesian inference to cluster text data. It combines the CRP prior with a likelihood model based on von Mises-Fisher distributions in the embedding space, which is particularly suitable for directional data like normalized text embeddings.
The model uses a concentration parameter alpha to control the propensity to create new clusters, and a precision parameter kappa to control the concentration of points around cluster means in the von Mises-Fisher distribution.
- model
Sentence transformer model used for text embeddings.
- Type:
SentenceTransformer
- random_state
Random state for reproducibility.
- Type:
- cluster_params
Dictionary of cluster parameters for each cluster. Contains ‘mean’ (centroid) and ‘count’ (number of points).
- Type:
- global_mean
Global mean of all document embeddings.
- Type:
Optional[EmbeddingTensor]
- embeddings_
Document embeddings after fitting.
- Type:
Optional[EmbeddingTensor]
- labels_
Cluster assignments after fitting.
- Type:
Optional[NDArray[np.int64]]
- __init__(alpha: float, kappa: float, model_name: str | None = 'all-MiniLM-L6-v2', random_state: int | None = None)[source]
Initialize a Dirichlet Process model with von Mises-Fisher likelihood.
- Parameters:
alpha (float) – Concentration parameter for new cluster creation. Higher values lead to more clusters.
kappa (float) – Precision parameter for the von Mises-Fisher distribution. Higher values lead to tighter, more concentrated clusters.
model_name (Optional[str]) – Name of the sentence transformer model to use. Default is “all-MiniLM-L6-v2”.
random_state (Optional[int]) – Random seed for reproducibility. If None, fresh, unpredictable entropy will be pulled from the OS.
- __weakref__
list of weak references to the object
- assign_cluster(embedding: EmbeddingTensor) tuple[int, np.ndarray][source]
Assign a document embedding to a cluster using Bayesian inference.
This method computes probabilities for assigning the document to each existing cluster or creating a new one, then samples a cluster assignment from this probability distribution. The probabilities combine the CRP prior and the von Mises-Fisher likelihood.
- Parameters:
embedding (EmbeddingTensor) – Document embedding vector.
- Returns:
cluster_id (int) – The assigned cluster ID.
probs (np.ndarray) – Probability distribution over clusters used for assignment.
- fit(documents, _y=None)[source]
Train the clustering model on the given text data.
This method processes each document in the input, computing its embedding and assigning it to a cluster using Bayesian inference with the Chinese Restaurant Process. It supports both text inputs and pre-computed embeddings.
- Parameters:
- Returns:
self – The fitted estimator.
- Return type:
Note
After fitting, the following attributes are set:
embeddings_Optional[EmbeddingTensor]The document embeddings.
labels_NDArray[np.int64]The cluster assignments for each document.
clusterslistList of cluster IDs for each document.
cluster_paramsdictDictionary containing parameters for each cluster.
- fit_predict(documents, _y=None)[source]
Fit the model and predict cluster labels for documents.
- Parameters:
- Returns:
labels – Cluster labels for each document.
- Return type:
NDArray[np.int64]
Notes
This method is a convenience function that calls
fit()followed by returning the cluster labels from the fitting process.
- get_embedding(text: str | list[str]) EmbeddingTensor[source]
Get the embedding for a text or list of texts with caching.
This method computes embeddings for text inputs using the sentence transformer model. It implements caching to avoid recomputing embeddings for previously seen texts. The method can handle both single text strings and lists of texts.
- log_crp_prior(cluster_id: int | None = None) float[source]
Calculate the Chinese Restaurant Process prior probability.
The Chinese Restaurant Process (CRP) is a stochastic process that defines a probability distribution over partitions of items. In the context of clustering, it provides a prior probability for assigning a document to an existing cluster or creating a new one.
The CRP is a key component of Bayesian nonparametric models, particularly the Dirichlet Process. For more details, see [1], [2].
- Parameters:
cluster_id (Optional[int]) – The cluster ID. If provided, calculate prior for an existing cluster. If None, calculate prior for a new cluster.
- Returns:
Log probability of the cluster under the CRP prior.
- Return type:
References
- log_likelihood(embedding: EmbeddingTensor) tuple[dict[int, float], float][source]
Calculate log-likelihoods for an embedding across all clusters.
This method computes the log-likelihood of a document embedding under each existing cluster’s von Mises-Fisher distribution, as well as under a potential new cluster.
- Parameters:
embedding (EmbeddingTensor) – Document embedding vector.
- Returns:
existing_likelihoods (dict[int, float]) – Dictionary mapping cluster IDs to their log-likelihoods.
new_cluster_likelihood (float) – Log-likelihood for a new cluster.
- predict(documents)[source]
Predict the closest cluster for each sample in documents.
- Parameters:
documents (Union[list[str], list[EmbeddingTensor]]) – The text documents or embeddings to predict clusters for.
- Returns:
labels – Cluster labels for each document. Returns -1 if no clusters exist yet.
- Return type:
NDArray[np.int64]
Note
This method computes the most likely cluster assignment for each document based on the von Mises-Fisher likelihood, without updating the cluster parameters. It supports both text inputs and pre-computed embeddings.
- class clusx.clustering.models.PitmanYorProcess(alpha: float, kappa: float, sigma: float, model_name: str | None = 'all-MiniLM-L6-v2', random_state: int | None = None)[source]
Pitman-Yor Process clustering for text data using von Mises-Fisher distribution.
- model
Sentence transformer model used for text embeddings.
- Type:
SentenceTransformer
- random_state
Random state for reproducibility.
- Type:
- cluster_params
Dictionary of cluster parameters for each cluster. Contains ‘mean’ (centroid) and ‘count’ (number of points).
- Type:
- global_mean
Global mean of all document embeddings.
- Type:
Optional[EmbeddingTensor]
- embeddings_
Document embeddings after fitting.
- Type:
Optional[EmbeddingTensor]
- labels_
Cluster assignments after fitting.
- Type:
Optional[NDArray[np.int64]]
Notes
The Pitman-Yor Process is a generalization of the Dirichlet Process that introduces a discount parameter (sigma) to control the power-law behavior of the cluster size distribution. It is particularly effective for modeling natural language phenomena that exhibit power-law distributions, such as word frequencies or topic distributions.
This implementation extends the DirichletProcess class, adding the sigma parameter and modifying the cluster assignment probabilities according to the Pitman-Yor Process while maintaining the von Mises-Fisher likelihood model for directional text embeddings.
The mathematical foundation of the Pitman-Yor Process involves two key parameters:
The concentration parameter alpha (α > -σ), controlling the overall tendency to create new clusters
The discount parameter sigma (0 ≤ σ < 1), controlling the power-law behavior
As σ approaches 1, the distribution exhibits heavier tails (more small clusters), while σ = 0 reduces to the standard Dirichlet Process.
- __init__(alpha: float, kappa: float, sigma: float, model_name: str | None = 'all-MiniLM-L6-v2', random_state: int | None = None)[source]
Initialize a PYP clustering model with von Mises-Fisher likelihood.
- Parameters:
alpha (float) – Concentration parameter for the Pitman-Yor Process. Higher values encourage formation of more clusters. Must satisfy: α > -σ.
kappa (float) – Precision parameter for the von Mises-Fisher distribution. Higher values lead to tighter, more concentrated clusters.
sigma (float) – Discount parameter for the Pitman-Yor Process (0 ≤ σ < 1). Controls the power-law behavior. Higher values create more power-law-like cluster size distributions. When σ=0, the model reduces to a Dirichlet Process.
model_name (Optional[str]) – Name of the sentence transformer model to use. Default is “all-MiniLM-L6-v2”.
random_state (Optional[int]) – Random seed for reproducibility. If None, fresh, unpredictable entropy will be pulled from the OS.
- Raises:
ValueError – If sigma ∉ [0.0, 1.0) or if alpha ≤ -sigma.
Notes
The mathematical requirement for the Pitman-Yor Process is:
The discount parameter σ must be in [0,1)
The concentration parameter α must satisfy α > -σ
The constraint α > -σ ensures that the numerator in the new table probability calculation (α + K*σ) remains positive even when K=0. This is essential for proper probabilistic behavior of the model.
- log_pyp_prior(cluster_id: int | None = None) float[source]
Calculate the Pitman-Yor Process prior probability.
- Parameters:
cluster_id (int or None) – The cluster ID. If provided, calculate prior for an existing cluster. If None, calculate prior for a new cluster.
- Returns:
Log probability of the cluster under the PYP prior.
- Return type:
Notes
The Pitman-Yor Process generalizes the Chinese Restaurant Process with the introduction of a discount parameter σ. The probability of a new customer (document) joining an existing table (cluster) k or starting a new table is:
P(existing cluster k) = (n_k - σ) / (n + α) P(new cluster) = (α + K*σ) / (n + α)
where: - n_k is the number of customers at table k - n is the total number of customers - K is the current number of tables - σ is the discount parameter - α is the concentration parameter
The utils module
Utility functions for data loading, saving, and visualization.
- clusx.clustering.utils.get_embeddings(texts: list[str]) ndarray[source]
Get embeddings for a list of texts.
- Parameters:
texts – List of text strings
- Returns:
Numpy array of embeddings
- clusx.clustering.utils.is_csv_file(input_file: str) bool[source]
Determine if a file is a CSV file based on extension and content.
- Parameters:
input_file – Path to the input file
- Returns:
True if the file is likely a CSV, False otherwise
- Return type:
- clusx.clustering.utils.load_cluster_assignments(csv_path: str) tuple[list[int], dict[str, float]][source]
Load cluster assignments and parameters from a CSV file.
- Parameters:
csv_path – Path to the CSV file containing cluster assignments
- Returns:
- A tuple containing:
List of cluster assignments (clustered texts)
Dictionary of parameters (alpha, sigma, kappa)
- Return type:
- Raises:
MissingClusterColumnError – If no cluster column is found in the file
MissingParametersError – If required parameters are missing in the file
- clusx.clustering.utils.load_data(input_file: str, column: str | None = None) list[str][source]
Load text data from a file. Supports text files and CSV files.
- Parameters:
input_file – Path to the input file (text or CSV)
column – Column name containing the text data (required for CSV files)
- Returns:
A list of texts
- Return type:
- Raises:
ValueError – If a CSV file is provided without specifying a column
- clusx.clustering.utils.load_parameters_from_json(json_path: str) dict[str, float][source]
Load clustering parameters from a JSON file.
- clusx.clustering.utils.save_clusters_to_csv(output_file: str, texts: list[str], clusters: list[int], model_name: str, alpha: float, sigma: float, kappa: float) None[source]
Save clustering results to a CSV file.
- Parameters:
output_file – Path to the output CSV file
texts – List of text strings
clusters – List of cluster assignments
model_name – Name of the clustering model
alpha – Concentration parameter
sigma – Discount parameter
kappa – Kappa parameter for likelihood model
- clusx.clustering.utils.save_clusters_to_json(output_file: str, texts: list[str], clusters: list[int], model_name: str, alpha: float, sigma: float, kappa: float) None[source]
Save clustering results to a JSON file.
- Parameters:
output_file – Path to the output JSON file
texts – List of text strings
clusters – List of cluster assignments
model_name – Name of the clustering model
alpha – Concentration parameter
sigma – Discount parameter
kappa – Kappa parameter for likelihood model