-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Describe the bug
Setting both knn_dist='precomputed_affinity' and random_landmarking=True is not supported at the moment, and the current error message is not informative. Adding support for this combination would be valuable, as it would allow non-Euclidean methods like RF-PHATE and RF-AE (which rely on their own affinity matrices) to scale more easily to millions of data points.
To Reproduce
import scipy.sparse as sp
import numpy as np
from phate import PHATE
seed = 42
np.random.seed(seed)
# Generate a random 10k x 10k sparse affinity matrix
n = 10_000
density = 0.0005
# random non-negative affinities
A = sp.random(n, n, density=density, data_rvs=lambda k: np.random.rand(k))
# ensure strictly positive diagonal
A.setdiag(np.random.rand(n) + 1e-3)
print("Random affinity matrix:", A.shape, "nnz:", A.nnz)
# PHATE with precomputed affinity and random landmarking
phate_operator = PHATE(
n_jobs=-1,
random_state=seed,
random_landmarking=True,
knn_dist="precomputed_affinity"
)
emb_train = phate_operator.fit_transform(A)
Expected behavior
Ideally, using precomputed affinities with random landmarking should behave the same as in the default spectral-clustering mode. I haven’t looked deeply into the PHATE/graphtools internals, but I suspect the issue lies in how points are assigned to random landmarks. If the assignment step relies solely on Euclidean distances, that would explain why precomputed affinity matrices fail. The central question is: Once we randomly select N_land landmark, can we build a transition matrix from points to landmarks using the input affinity matrix? This is not straightforward, but can be useful in large-scale applications.
For now, we should at least have a more informative error/warning.
Actual behavior
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File ~/Projects/RF-GAP-Python/.venv/lib/python3.12/site-packages/graphtools/graphs.py:1129, in LandmarkGraph.landmark_op(self)
1128 try:
-> 1129 return self._landmark_op
1130 except AttributeError:
AttributeError: 'TraditionalLandmarkGraph' object has no attribute '_landmark_op'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[4], line 26
19 # PHATE with precomputed affinity and random landmarking
20 phate_operator = PHATE(
21 n_jobs=-1,
22 random_state=seed,
23 random_landmarking=True,
24 knn_dist="precomputed_affinity"
25 )
---> 26 emb_train = phate_operator.fit_transform(A)
File ~/Projects/RF-GAP-Python/.venv/lib/python3.12/site-packages/phate/phate.py:1033, in PHATE.fit_transform(self, X, **kwargs)
1012 """Computes the diffusion operator and the position of the cells in the
1013 embedding space
1014
(...) 1030 The cells embedded in a lower dimensional space using PHATE
1031 """
1032 with _logger.log_task("PHATE"):
-> 1033 self.fit(X)
1034 embedding = self.transform(**kwargs)
1035 return embedding
File ~/Projects/RF-GAP-Python/.venv/lib/python3.12/site-packages/phate/phate.py:929, in PHATE.fit(self, X)
919 warnings.warn(
920 f"Graph is disconnected with {self.graph.n_connected_components} "
921 f"connected components. This may indicate that your knn parameter "
(...) 925 RuntimeWarning,
926 )
928 # landmark op doesn't build unless forced
--> 929 self.diff_op
930 return self
File ~/Projects/RF-GAP-Python/.venv/lib/python3.12/site-packages/phate/phate.py:318, in PHATE.diff_op(self)
316 if self.graph is not None:
317 if isinstance(self.graph, graphtools.graphs.LandmarkGraph):
--> 318 diff_op = self.graph.landmark_op
319 else:
320 diff_op = self.graph.diff_op
File ~/Projects/RF-GAP-Python/.venv/lib/python3.12/site-packages/graphtools/graphs.py:1131, in LandmarkGraph.landmark_op(self)
1129 return self._landmark_op
1130 except AttributeError:
-> 1131 self.build_landmark_op()
1132 return self._landmark_op
File ~/Projects/RF-GAP-Python/.venv/lib/python3.12/site-packages/graphtools/graphs.py:1212, in LandmarkGraph.build_landmark_op(self)
1210 distances = euclidean_distances(data, data[landmark_indices])
1211 else:
-> 1212 distances = cdist(data, data[landmark_indices], metric=self.distance)
1213 self._clusters = np.argmin(distances, axis=1)
1215 else:
File ~/Projects/RF-GAP-Python/.venv/lib/python3.12/site-packages/scipy/spatial/distance.py:3111, in cdist(XA, XB, metric, out, **kwargs)
3108 sB = XB.shape
3110 if len(s) != 2:
-> 3111 raise ValueError('XA must be a 2-dimensional array.')
3112 if len(sB) != 2:
3113 raise ValueError('XB must be a 2-dimensional array.')
ValueError: XA must be a 2-dimensional array.
System information:
Output of phate.__version__:
'2.0.0'
Output of pd.show_versions():
INSTALLED VERSIONS
------------------
commit : 9c8bc3e55188c8aff37207a74f1dd144980b8874
python : 3.12.3
python-bits : 64
OS : Darwin
OS-release : 25.1.0
Version : Darwin Kernel Version 25.1.0: Mon Oct 20 19:34:05 PDT 2025; root:xnu-12377.41.6~2/RELEASE_ARM64_T6041
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : C.UTF-8
pandas : 2.3.3
numpy : 2.3.4
pytz : 2025.2
dateutil : 2.9.0.post0
pip : 25.3
Cython : None
sphinx : None
IPython : 9.7.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.10.0
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.6
lxml.etree : None
matplotlib : 3.10.7
numba : 0.62.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.16.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2025.2
qtpy : None
pyqt5 : None