Replies: 1 comment
-
|
It's been a while since I touched that particular code, but I believe it was supposed to then create a new topic rather than "disappear" into noise. In other words, this indeed seems like a bug that should be fixed! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi!
While experimenting with
BERTopic.merge_models, I noticed something that might not align conceptually with how the noise cluster (-1) is usually interpreted.When comparing topic embeddings across models, the noise embedding from the baseline model is treated just like any other topic embedding. This means that, if a topic from model 2 is most similar to the noise embedding in model 1 (and similarity ≥ min_similarity), then that whole cluster from model 2 is re-assigned to noise in the merged model.
As a result, real topics from model 2 can “disappear” into noise.
Noise (-1) in HDBSCAN clustering is usually interpreted as “documents that do not belong to any cluster”. It’s not really a topic. So allowing an entire valid cluster from another model to be merged into noise seems counterintuitive.
Was the current behavior (merging into noise) intentional, or would it make more sense to exclude noise from the similarity matching?
Thanks a lot for the amazing library! Just wanted to clarify this detail because it can have a big effect when merging multiple models.
Beta Was this translation helpful? Give feedback.
All reactions