Skip to content

How does PHATE handle new data? #147

@Mr-Byun

Description

@Mr-Byun

I am trying to perform k-means classification on the potential distance of the query dataset.
I simply called the extend_to_data function on the query dataset to do so.
However, I don't think the function gives me the potential distance.

    def extend_to_data(self, data, **kwargs):
        """Build transition matrix from new data to the graph

        Creates a transition matrix such that `Y` can be approximated by
        a linear combination of landmarks. Any
        transformation of the landmarks can be trivially applied to `Y` by
        performing

        `transform_Y = transitions.dot(transform)`

        Parameters
        ----------

        Y: array-like, [n_samples_y, n_features]
            new data for which an affinity matrix is calculated
            to the existing data. `n_features` must match
            either the ambient or PCA dimensions

        Returns
        -------

        transitions : array-like, [n_samples_y, self.data.shape[0]]
            Transition matrix from `Y` to `self.data`
        """
        kernel = self.build_kernel_to_data(data, **kwargs)
        if sparse.issparse(kernel):
            pnm = sparse.hstack(
                [
                    sparse.csr_matrix(kernel[:, self.clusters == i].sum(axis=1))
                    for i in np.unique(self.clusters)
                ]
            )
        else:
            pnm = np.array(
                [
                    np.sum(kernel[:, self.clusters == i], axis=1).T
                    for i in np.unique(self.clusters)
                ]
            ).transpose()
        pnm = normalize(pnm, norm="l1", axis=1)
        return pnm

Rather, it gives me the transition matrix, which I think is the diffusion probability matrix (transitioned optimal_t times).

So, to transform the transition matrix to the informational distance, I copied from the _calculate_potential function:

        c = (1 - self.gamma) / 2
        self._diff_potential = ((diff_op_t) ** c) / c

My attempt of mapping a query data on the reference dataset.

phate.data <- Embeddings(reference.seurat, 'symphony')

phate.ref <- phate(
    phate.data,
    gamma = 0, knn = 10,
    ndim = 3, mds.solver = 'smacof', npca = NULL,
    knn.dist.method = 'euclidean', mds.dist.method = 'euclidean', seed = 333
)

reference.seurat[['phate']] <- CreateDimReducObject(embeddings=phate.ref$embedding, key='phate_', assay='RNA')
km <- kmeans(phate.ref$operator$diff_potential, centers = 7)
reference.seurat$phate.k <- as.character(km$cluster)

query_phate <- phate.ref$operator$transform(Embeddings(query.seurat, 'symphony'))
query.seurat[['phate']] <- CreateDimReducObject(embeddings=query_phate, key='phate_', assay='RNA')

query_diff_transform<- phate.ref$operator$graph$extend_to_data(Embeddings(query.seurat, 'symphony'))
query_diff_potential <- query_diff_transform^(0.5) / 0.5 # Because gamma = 0
query.seurat$phate.k <- clue::cl_predict(km, newdata=as.matrix(query_diff_potential), type='class_ids')

After merging reference.seurat and query.seurat, I visualized the phate dimensions and phate.k clusters.
The query.seurat points overlapped on the reference.seurat points, however, the phate.k position was a little off.

Reference:
image

Query:
image

  1. Did I make a mistake? Also,
  2. is there a direct way to obtain the potential distance matrix of newdata (query)?, or
  3. is reference-based mapping with PHATE just not feasible?

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions