PersEval is a library for the evaluation of perspectivist classification models, developed to streamline the current fragmented perspectivist evaluation practices with a new unified framework.
The framework is presented in the paper PersEval: A Framework for Perspectivist Classification Evaluation, accepted at the main track of EMNLP 2025.
Data perspectivism goes beyond majority vote label aggregation by recognizing various perspectives as legitimate ground truths. However, current evaluation practices remain fragmented, making it difficult to compare perspectivist approaches and analyze their impact on different users and demographic subgroups. To address this gap, we introduce PersEval, the first unified framework for evaluating perspectivist models in NLP. A key innovation is its evaluation at the individual annotator level and its treatment of annotators and users as distinct entities, consistently with real-world scenarios. We demonstrate PersEval's capabilities through experiments with both Encoder-based and Decoder-based approaches, as well as an analysis of the effect of sociodemographic prompting. By considering global, text-, trait- and user-level evaluation metrics, we show that PersEval is a powerful tool for examining how models are influenced by user-specific information and identifying the biases this information may introduce.
Users can be represented in two ways:
-
Unnamed classification — by their identifier only.
-
Named classification — by a set of explicit metadata (e.g., traits).
To enable named classification, use the --named flag when running the script:
python main.py --named
If the flag is omitted, the script defaults to unnamed classification.
Mirroring real-world scenarios, we assume that the annotators who provided the bulk of the annotations to train the system, and the users for which the system is tested, are disjoint.
When explicit knowledge about the users is available (e.g. metadata), a model can attempt to learn biases toward such characteristics, without knowing any preference of the unkown users.
python main.py --named --adaptation false
However, we also assume two adaptation scenarios where few labels from test users are available:
Adaptation at training time: we assume minimal annotation from users has been obtained before training the system. A few annotations from test users are thus included in the training split.
python main.py --adaptation train
python main.py --named --adaptation train
Adaptation at inference time: we assume an already trained system has to be adapted to new users. A few test users instances can thus be used to adapt an existing model.
python main.py --adaptation test
python main.py --named --adaptation test
In the default setting, textual examples in the test split are disjoint from those in the training split.
We also provide a variant for which texts that are also found in test instances, but annotated by different users, can be included in the trainig split.
python main.py --extended
The difference between the task variants manifests in different training splits (or in some cases additional sets when using an adaptation set at inference time), while the test set depends uniquely on the chosen dataset and it remains the same across all task variants.
The library comes with five datasets.
| Dataset | Reference | Task | #Annotators | Metadata |
|---|---|---|---|---|
| Brexit | Akhtar et al. 2021 | Abusive language | 6 | Target an control group |
| EPIC | Frenda et al. 2023 | Irony | 74 | Gender, Nationality, Age/Generation |
| MHS | Sachdeva et al. 2022 | Hate Speech | 7,912 | Gender, Age/Generation, Education, Income, Ideology |
| MD-Agreement | Leonaredlli et al. 2022 | Offensiveness | 819 | --- |
| Dices | Aroyo et al. 2024 | AI Safety | 123 | Gender, Age/Generation, Education, Ethinicity |
BREXIT, EPIC and MHS are directly loaded from huggingface, while DICES and MD-Agreement are loaded form the "./data" folder.
Other than the dataset name, it is important to also specify the corresponding label.
python main.py --dataset-name DICES --label Q2_harmful_content_overall
Options for labels are listed below, as well as in the main.py file
# EPIC -> ["irony"]
# BREXIT -> ["hs", "offensiveness", "aggressiveness", "stereotype"]
# DICES -> ["Q2_harmful_content_overall"]
# MHS -> ["hateful"]
# MD -> ["offensiveness"]
We require models to output a label for each <user, text> tuple. Starting from these prediction, we compute standard classification metrics. Specifically precision, recall and F1 scores for each class, as well as their macro- and micro-average. This same metrics are computed also at more fine-grained levels:
- annotator-level: computed individually for each annotator and then averaged;
- text-level: computed individually for each text in the test set and then averaged;
- trait-level: computed for each trait and then averaged for each dimension.
We fine-tuned RoBERTa, customized implementing Focal Loss.
We added annotators' identifiers and their traits to the text embedding as a special token. The model input thus concatenates the annotator id, a special token for each of the annotator's traits, and the input text to classify. The model is then trained with a classification head to predict the binary label.
python main.py --model-name roberta-base --type encoder
We chose open-source models of medium size: Mixtral-8 7B and Llama-3.1 8B, both instruction tuned. We considered three possible settings:
Base zero: we prompt the models to classify the test set examples, with no additional information.
python main.py --model-name meta-llama/Meta-Llama-3.1-8B-Instruct --type llm
Perspective: we ask the models to impersonate each user’s trait. We use this variant to test models without adaptation with a named user representation. We prompt the model for each available user trait.
python main.py --named --model-name meta-llama/Meta-Llama-3.1-8B-Instruct --type llm
In-Prompt Augmentation We reproduced Salemi et al. (2024)’s approach, prompting the model with user-specific input selected via retrieval augmentation. We used this approach both giving information about the user’s trait value (Named with Adaptation-T) and without providing demographic information.
pyton main.py --model-name meta-llama/Meta-Llama-3.1-8B-Instruct --type LaMP --context
The required dependencies were verified on the following system:
- Operating System: Ubuntu 22.04.3 LTS (Jammy)
- Kernel: Linux 5.15.0-113-generic x86_64
- CPU: AMD EPYC-Rome Processor — 24 cores, 1 thread per core
- RAM: 503 GiB
- GPU: 4× NVIDIA A40 (46 GiB each)
- Python: 3.10.12
Performance and compatibility may vary on different hardware or OS versions.
Representing users through their sociodemographic traits could lead to the risk of stereotype propagation. So we evaluated the effect sociodemographic prompting through a qualitative analysis based on the following Questions:
- Q1: What is the contribution of each trait when ensembling the model’s outputs?
- Q2: Which demographic trait most significantly impacts the model’s label predictions in the presence of varying annotator characteristics?
- Q3: How similar is the distribution of models’ predictions to that of the annotators’ chosen labels
Results are placed at the "./qualitative_analysis" folder.
Note that the current version of the library does not support multi-GPU settings.
The In-Prompt Augmentation strategy is part of a comprehensive evaluation framework for personalization with LLMs (LaMP). The full framework is available at this repository.
If you use the library, please cite:
@inproceedings{lo-etal-2025-perseval,
title = "{PERSEVAL}: A Framework for Perspectivist Classification Evaluation",
author = "Lo, Soda Marem and
Casola, Silvia and
Sezerer, Erhan and
Basile, Valerio and
Sansonetti, Franco and
Uva, Antonio and
Bernardi, Davide",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1137/",
pages = "22345--22370",
ISBN = "979-8-89176-332-6",
abstract = "Data perspectivism goes beyond majority vote label aggregation by recognizing various perspectives as legitimate ground truths.However, current evaluation practices remain fragmented, making it difficult to compare perspectivist approaches and analyze their impact on different users and demographic subgroups. To address this gap, we introduce PersEval, the first unified framework for evaluating perspectivist models in NLP. A key innovation is its evaluation at the individual annotator level and its treatment of annotators and users as distinct entities, consistently with real-world scenarios. We demonstrate PersEval{'}s capabilities through experiments with both Encoder-based and Decoder-based approaches, as well as an analysis of the effect of sociodemographic prompting. By considering global, text-, trait- and user-level evaluation metrics, we show that PersEval is a powerful tool for examining how models are influenced by user-specific information and identifying the biases this information may introduce."
}
