Changelog

This file contains a brief summary of new features and dependency changes or releases, in reverse chronological order.

0.6.0 - 2025-03-16

Features

  • Implemented von Mises-Fisher distribution for text embeddings, replacing Gaussian likelihood for better directional similarity

  • Improved cluster initialization and updates with proper normalization techniques

  • Added kappa parameter for explicit control over cluster cohesion

  • Implemented global mean embedding as base measure for new clusters in CRP models

Breaking Changes

  • Completely redesigned Dirichlet Process and Pitman-Yor Process implementations with incompatible APIs

  • Removed variance parameter, replaced with more theoretically sound kappa parameter

  • Changed cluster assignment methodology to use normalized embeddings and proper directional statistics

  • Modified method signatures across clustering models for better scikit-learn compatibility

Bug Fixes

  • Fixed incorrect Gaussian likelihood calculation that caused bias against new clusters

  • Resolved numerical stability issues by implementing consistent log-space calculations

  • Fixed “singleton cluster dominance” issue with proper cluster mean initialization

  • Corrected PYP prior calculation to handle edge cases with small clusters appropriately

Improvements

  • Rewritten clustering core algorithm to properly handle directional text embeddings on the unit hypersphere

  • Optimized embedding processing with efficient normalization and similarity calculations

  • Enhanced API with scikit-learn compatible fit(), predict(), and fit_predict() methods

  • Improved theoretical soundness with proper Bayesian inference for cluster assignments

Improved Documentation

  • Enhanced methodological framework with academic foundations and mathematical rigor

  • Added detailed parameter tuning guidelines with practical ranges

0.5.0 - 2025-03-15

Features

  • Implement to_numpy helper function to convert PyTorch tensors to NumPy arrays.

  • Add ClusterIntegrityError, MissingClusterColumnError, and MissingParametersError for better error handling.

  • Enhance plotting functions with error handling:

    • Handle visualization-specific errors and properly report them.

    • Implemented safe_plot decorator for error handling in plotting functions.

    • Updated plotting functions to raise VisualizationError for missing or invalid data.

    • Improved documentation for new functionalities and added examples.

    • Removed deprecated plotting functions and streamlined visualization dashboard code.

Breaking Changes

  • Refactored load_cluster_assignments function:

    • Now raises specific custom exceptions (MissingClusterColumnError and MissingParametersError) instead of generic ValueError

    • Requires all parameters (alpha, sigma, variance) to be present in the CSV file

    • Removed fallback mechanism to extract parameters from filename

    • More specific cluster column detection (looking for cluster_ prefix)

    • Improved docstring with better description of function behavior and exceptions

Bug Fixes

  • Fix critical issues in similarity metrics calculation:

    • Properly handle singleton clusters instead of reporting misleading 0.0 values

    • Optimize computation for large datasets with sparse clusters

    • Add robust handling for edge cases with no valid cluster pairs

    • Implement correct averaging when mixing singleton and non-singleton clusters

    • Fix silent failures on datasets with predominantly singleton clusters

Improvements

  • Select appropriate colormaps based on visualization best practices for clustering.

  • Redesign progress bar on clustering to be more informative and less noisy.

  • Enhance silhouette score calculation to handle singleton clusters properly:

    • Now calculates scores using only valid clusters (≥2 samples) rather than returning 0.0 when any singleton exists

    • Preserves valuable evaluation data that would otherwise be discarded

    • Provides detailed logging about what proportion of data contributed to the score

    • Aligns with academic best practices in cluster validation literature

Improved Documentation

  • Fix code smells and style issues.

  • Introduced pylint to the CI workflow.

  • Added new “Methodological Framework” documentation explaining theoretical decisions behind implementation choices.

Trivial/Internal Changes

  • Amend and improve installation documentation.

0.4.0 - 2025-03-13

Features

  • Updated the application interface to support both text files (each line treated as a clustering candidate) and CSV files.

  • Added --show-plot/--no-show-plot option to the evaluate command to control whether plots are displayed interactively. Default is --no-show-plot to better support automation and headless environments.

Breaking Changes

  • Removed the “answer” field from *_dp.json and *_pyp.json outputs, with corresponding updates to code, documentation, and tests.

  • CSV inputs now require an explicit column name; otherwise, the program will exit with an error.

  • Changed default parameter values to optimal settings:

    • Dirichlet Process: α=0.5 (was 5.0)

    • Pitman-Yor Process: α=0.3 (was 5.0), σ=0.3 (was 0.5)

    • Variance: 0.3 (was 0.1)

Bug Fixes

  • Fixed critical parameter handling in CLI interface for Dirichlet Process and Pitman-Yor Process:

    • Separated --dp-alpha and --pyp-alpha parameters with appropriate help text

    • Added proper validation for parameter ranges (DP: α > 0, PYP: α > -σ, 0 ≤ σ < 1)

    • Updated documentation to clarify that using the same α value for both models leads to dramatically different clustering behaviors

    • Added recommended parameter ranges in help text (DP: α ∈ [0.1, 5.0], PYP: α ∈ [0.1, 2.0], σ ∈ [0.1, 0.7])

Improvements

  • The resulted JSON output file no longer created as it was identical to the Dirichlet Process JSON output file.

  • Default parameter values now set to optimal values based on extensive testing, providing better out-of-the-box clustering performance.

  • Improved visualization handling with non-interactive plot generation by default, making the tool more suitable for automated pipelines and CI/CD environments.

Improved Documentation

  • Amend and improve usage documentation.

  • Amend and improve API documentation.

  • Updated documentation to reflect new default parameter values and their effects on clustering.

  • Enhanced documentation with clear examples of interactive vs. non-interactive visualization options in both CLI and Python API.

Trivial/Internal Changes

  • Improve cascading metadata resolution in clusx.version module.

  • Refactor type hints to use built-in types.

  • Remove embedding cache functionality as it is not helpful for the current implementation. It will be re-implemented in the future.

0.3.3 - 2025-03-12

Trivial/Internal Changes

  • Fix CD workflow with release artifact upload.

0.3.2 - 2025-03-12

Improved Documentation

  • Amend project documentation.

Trivial/Internal Changes

  • Add checksum generation and verification to CD workflow.

0.3.1 - 2025-03-12

Trivial/Internal Changes

  • Fix publishing to PyPI.

0.3.0 - 2025-03-12

Bug Fixes

  • Implement Proper Bayesian Inference: Implements log CRP/PYP priors and Gaussian likelihoods instead of heuristic similarity scoring. Fixes incorrect probabilistic model through valid posterior sampling.

  • PYP Initialization: Properly initializes cluster parameters via parent class. Fixes PYP initialization bug.

Improvements

  • Embedding Efficiency: Precomputes and caches all embeddings upfront (text_embeddings dict). Fixes O(N²) embedding calls.

  • Reproducibility: Add random_state for controlled sampling via np.random.RandomState. Addresses non-determinism.

Trivial/Internal Changes

  • Change project name.

Improved Documentation

  • Add initial project documentation.

0.2.0 - 2025-03-11

Features

  • Migrate to Dirichlet & Pitman-Yor Process.

  • Add comprehensive evaluation dashboard and power-law analysis.

  • Add integration and unit tests for clustering models.

Breaking Changes

  • Drop support for DBSCAN clustering.

  • Drop support for custom embedding model.

0.1.0 - 2025-03-10

  • Initial release.