One question about the reproducibility #2428
Replies: 3 comments
-
|
hdbscan is not deterministic, which is very furstrating.... if determinsm is what you need, try to replace with kmeans. |
Beta Was this translation helpful? Give feedback.
-
|
Hello @powerhorse1986 , What has been working best for me so far is setting always the same seed during the UMAP step to ensure the embeddings are always calculated the same way. After that, you can even save the embeddings as a .npy file, so you make sure you always use the same input for the HDBSCAN. Keeping the same parameters and using the exact same input across iterations should ensure the exact same HDBSCAN results. |
Beta Was this translation helpful? Give feedback.
-
|
As @carobs9 already mentioned, HDBSCAN is reproducible as long as you also fix the UMAP random_state. You can find more information about that here: https://maartengr.github.io/BERTopic/faq.html#why-are-the-results-not-consistent-between-runs |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Maarten,
We are doing a new project using BERTopic, which is really an awesome tool!
But I noticed that the reproducibility of BERTopic might be a problem.
For this new project, we performed topic modeling multiple times using BERTopic on more than 3000 abstracts. For the first ten times, BERTopic generated 4 topics, including one outliers. But for the 11th time, 24 topics were generated. All the parameters of UMAP and HDBSCAN were the same. Then I adjusted the parameter "min_cluster_size" of HDBSCAN and got 4 topics again.
I totally have no idea why this happened. Would you mind giving some hints? Thank you :)
Best,
Li
Beta Was this translation helpful? Give feedback.
All reactions