Skip to content

Clustering Module

This module defines abstractions and implementations for clustering algorithms used by a few of the strategies.

Bases: ABC

Abstract base class for clustering algorithms.

Source code in cogitator/clustering.py
class BaseClusterer(ABC):
    """Abstract base class for clustering algorithms."""

    @abstractmethod
    def cluster(
        self, embeddings: np.ndarray, n_clusters: int, **kwargs: Any
    ) -> Tuple[np.ndarray, np.ndarray]:
        """Clusters the given embeddings into a specified number of clusters.

        Args:
            embeddings: A NumPy array where each row is an embedding vector.
            n_clusters: The desired number of clusters.
            **kwargs: Additional keyword arguments specific to the clustering implementation.

        Returns:
            A tuple containing:
                - A NumPy array of cluster labels assigned to each embedding.
                - A NumPy array of cluster centers.
        """
        ...

cluster(embeddings, n_clusters, **kwargs) abstractmethod

Clusters the given embeddings into a specified number of clusters.

Parameters:

Name Type Description Default
embeddings ndarray

A NumPy array where each row is an embedding vector.

required
n_clusters int

The desired number of clusters.

required
**kwargs Any

Additional keyword arguments specific to the clustering implementation.

{}

Returns:

Type Description
Tuple[ndarray, ndarray]

A tuple containing: - A NumPy array of cluster labels assigned to each embedding. - A NumPy array of cluster centers.

Source code in cogitator/clustering.py
@abstractmethod
def cluster(
    self, embeddings: np.ndarray, n_clusters: int, **kwargs: Any
) -> Tuple[np.ndarray, np.ndarray]:
    """Clusters the given embeddings into a specified number of clusters.

    Args:
        embeddings: A NumPy array where each row is an embedding vector.
        n_clusters: The desired number of clusters.
        **kwargs: Additional keyword arguments specific to the clustering implementation.

    Returns:
        A tuple containing:
            - A NumPy array of cluster labels assigned to each embedding.
            - A NumPy array of cluster centers.
    """
    ...

Bases: BaseClusterer

A clustering implementation using the K-Means algorithm from scikit-learn.

Source code in cogitator/clustering.py
class KMeansClusterer(BaseClusterer):
    """A clustering implementation using the K-Means algorithm from scikit-learn."""

    def cluster(
        self, embeddings: np.ndarray, n_clusters: int, **kwargs: Any
    ) -> Tuple[np.ndarray, np.ndarray]:
        """Clusters embeddings using K-Means.

        Args:
            embeddings: The embeddings to cluster (shape: [n_samples, n_features]).
            n_clusters: The number of clusters to form.
            **kwargs: Additional arguments for `sklearn.cluster.KMeans`.
                Supported args include `random_seed` (or `seed`) and `n_init`.

        Returns:
            A tuple containing:
                - labels (np.ndarray): Integer labels array (shape: [n_samples,]).
                - centers (np.ndarray): Coordinates of cluster centers (shape: [n_clusters, n_features]).

        Raises:
            ValueError: If `n_clusters` is invalid or embeddings are incompatible.
        """
        random_seed = kwargs.get("random_seed") or kwargs.get("seed")
        n_init = kwargs.get("n_init", "auto")
        kmeans = KMeans(
            n_clusters=n_clusters,
            random_state=random_seed,
            n_init=n_init,
            init="k-means++",
        )
        labels = kmeans.fit_predict(embeddings)
        return labels, kmeans.cluster_centers_

cluster(embeddings, n_clusters, **kwargs)

Clusters embeddings using K-Means.

Parameters:

Name Type Description Default
embeddings ndarray

The embeddings to cluster (shape: [n_samples, n_features]).

required
n_clusters int

The number of clusters to form.

required
**kwargs Any

Additional arguments for sklearn.cluster.KMeans. Supported args include random_seed (or seed) and n_init.

{}

Returns:

Type Description
Tuple[ndarray, ndarray]

A tuple containing: - labels (np.ndarray): Integer labels array (shape: [n_samples,]). - centers (np.ndarray): Coordinates of cluster centers (shape: [n_clusters, n_features]).

Raises:

Type Description
ValueError

If n_clusters is invalid or embeddings are incompatible.

Source code in cogitator/clustering.py
def cluster(
    self, embeddings: np.ndarray, n_clusters: int, **kwargs: Any
) -> Tuple[np.ndarray, np.ndarray]:
    """Clusters embeddings using K-Means.

    Args:
        embeddings: The embeddings to cluster (shape: [n_samples, n_features]).
        n_clusters: The number of clusters to form.
        **kwargs: Additional arguments for `sklearn.cluster.KMeans`.
            Supported args include `random_seed` (or `seed`) and `n_init`.

    Returns:
        A tuple containing:
            - labels (np.ndarray): Integer labels array (shape: [n_samples,]).
            - centers (np.ndarray): Coordinates of cluster centers (shape: [n_clusters, n_features]).

    Raises:
        ValueError: If `n_clusters` is invalid or embeddings are incompatible.
    """
    random_seed = kwargs.get("random_seed") or kwargs.get("seed")
    n_init = kwargs.get("n_init", "auto")
    kmeans = KMeans(
        n_clusters=n_clusters,
        random_state=random_seed,
        n_init=n_init,
        init="k-means++",
    )
    labels = kmeans.fit_predict(embeddings)
    return labels, kmeans.cluster_centers_