You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to use BERTopic in parallel with KeyBERT to run analysis on academic articles that focus on a single concept across 3 disciplinary domains. I am having trouble understanding if my results with BERTopic are adequate in regard to extremely low "coherence_c_npmi" scores (both negative) and high number of outliers (nearly one-third for a single academic paper and nearly one-half for a corpus of 20+ articles which is just from a single domain as a test). I have read a handful of Q&A entries that deal with both the coherence scores and outliers, but haven't seen anyone ask about these issues in relation to small datasets.
Overall I am wondering:
(a) Whether or not I am tuning the hyperparameters correctly for small data size/focus on single concept
(b) If I am implementing the best functions of BERTopic for the requirements of my project, or if my project is even feasible with BERTopic regarding its small data size
My next step will be to utilize additional functions like reduce_outliers(), merge_topics()/reduce_topics(), KeyBERTInspired, MaximumMarginalRelevance, etc. as 'refinement' steps but am wondering if I need to establish a better baseline before I do. Or, if these are recommended during the initial modeling stage. I have dabbled with these before I established the baseline hyperparameters listed below, but didn't see a major difference during the initial pass and felt a bit lost in which would be best when trying to do so as secondary, refinement passes which also didn't see a major shift.
Are there some essential aspects of BERTopic that I am missing that could help with the issues of extremely low "coherence_c_npmi" scores and high number of outliers? Or is it just not possible to use BERTopic on a single academic paper/small corpus?
Here are the hyperparameters, OCTIS scores and number of outliers:
Single Paper:
Hyperparameters:
embedder = SentenceTransformer("all-mpnet-base-v2")
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to use BERTopic in parallel with KeyBERT to run analysis on academic articles that focus on a single concept across 3 disciplinary domains. I am having trouble understanding if my results with BERTopic are adequate in regard to extremely low "coherence_c_npmi" scores (both negative) and high number of outliers (nearly one-third for a single academic paper and nearly one-half for a corpus of 20+ articles which is just from a single domain as a test). I have read a handful of Q&A entries that deal with both the coherence scores and outliers, but haven't seen anyone ask about these issues in relation to small datasets.
Overall I am wondering:
(a) Whether or not I am tuning the hyperparameters correctly for small data size/focus on single concept
(b) If I am implementing the best functions of BERTopic for the requirements of my project, or if my project is even feasible with BERTopic regarding its small data size
My next step will be to utilize additional functions like reduce_outliers(), merge_topics()/reduce_topics(), KeyBERTInspired, MaximumMarginalRelevance, etc. as 'refinement' steps but am wondering if I need to establish a better baseline before I do. Or, if these are recommended during the initial modeling stage. I have dabbled with these before I established the baseline hyperparameters listed below, but didn't see a major difference during the initial pass and felt a bit lost in which would be best when trying to do so as secondary, refinement passes which also didn't see a major shift.
Are there some essential aspects of BERTopic that I am missing that could help with the issues of extremely low "coherence_c_npmi" scores and high number of outliers? Or is it just not possible to use BERTopic on a single academic paper/small corpus?
Here are the hyperparameters, OCTIS scores and number of outliers:
Single Paper:
Hyperparameters:
embedder = SentenceTransformer("all-mpnet-base-v2")
OCTIS Scores:
"coherence_c_npmi": -0.15810996491444776,
"coherence_c_v": 0.5184086664210065,
"topic_diversity": 0.8857142857142857
Outliers:
Topic Count
-1 122
0 95
1 45
2 32
3 22
4 21
5 16
6 14
7 14
8 13
9 12
10 10
11 9
12 7
13 5
14 5
Corpus of 20+ academic articles:
Hyperparameters:
embedder = SentenceTransformer("all-mpnet-base-v2")
OCTIS Scores:
"coherence_c_npmi": -0.07885831072341212,
"coherence_c_v": 0.5023343410726403,
"topic_diversity": 0.6077694235588973
Outliers:
*For brevity I have cut out the middle, but its pretty linear from Topic 15 - 100 without making big jumps
Topic Count
-1 5877
0 570
1 535
2 358
3 238
4 224
5 202
6 182
7 149
8 147
9 130
10 111
11 108
12 106
13 100
14 97
15 96
100 22
101 22
102 22
103 22
104 22
105 22
106 22
107 21
108 21
109 21
110 21
111 20
112 20
113 20
Beta Was this translation helpful? Give feedback.
All reactions