This repository contains the code for our work "Sequential Data Augmentation for Generative Recommendation" (arXiv) which will be presented at WSDM 2026. It provides training scripts for generative recommendation models (e.g., SASRec [1], TIGER[2]) using GenPAS and data analysis codes for selecting appropriate hyperparameters.
We recommend using Python ≥ 3.8 with a virtual environment.
conda create -n genpas python=3.9
conda activate genpas
pip install -r requirements.txtFor the Amazon datasets (Beauty, Toys, and Sports) please use this google drive link.
For the MovieLens datasets (ML1M and ML20M), we provided the preprocessing codes for ML1M and ML20M.
All datasets should be stored under the data folder. Please prepare the data in the following format:
data/
├── beauty/
│ ├── training/ # training set
│ ├── evaluation/ # validation set
│ ├── testing/ # test set
│ └── sids # semantic IDs for TIGER
├── toys/
├── sports/
├── ml1m/
└── ml20m/To load sequences for a given split (training / evaluation / testing), refer to the jupyter notebook read_data.ipynb.
# Dataset and split
dataset = 'beauty'
data_type = 'training' # 'evaluation' or 'testing'
# you can load sequences as follows:
sequences = read_data(dataset, data_type)
# Now `sequences` contains the list of user interaction sequences for the split.
print(len(sequences), 'sequences loaded')To load the MovieLens-20M dataset to use for the TIGER experiments, please run data/ml20m/text_data_preparation.ipynb.
We provide training scripts for common generative recommendation baselines. Please find the full script in run.sh.
python src/train.py trainer=ddp experiment=sasrec_train_10_00_00_beauty.yaml logger=csvpython src/train.py trainer=ddp experiment=tiger_train_10_00_10_beauty.yaml logger=csvWe provide scripts that measure how augmentation impacts the target and input–target distributions in data_analysis.
Computes KL(p_valid || q_train) between the validation/test target distribution and the training target distribution.
python get_kl.py --dataset beauty --alpha 1.0 --beta 0.0Computes alignment and discrimination between the validation/test input-target distribution and the training input-target distribution.
python get_align_disc.py --dataset beauty --alpha 1.0 --beta 0.0 --gamma 0.0For large datasets with longer sequences (e.g., ML1M and ML20M), please use the variant using sampling:
python get_align_disc_sample.py --dataset ml1m --alpha 1.0 --beta 0.0 --gamma 0.0- Built with PyTorch and PyTorch Lightning
- Configuration management by Hydra
- Part of this repo is built on top of https://github.com/ashleve/lightning-hydra-template and https://github.com/snap-research/GRID
[1] Kang, Wang-Cheng, and McAuley, Julian. "Self-Attentive Sequential Recommendation." IEEE International Conference on Data Mining 18 (2018).
[2] Rajput, Shashank, et al. "Recommender systems with generative retrieval." Advances in Neural Information Processing Systems 36 (2023): 10299-10315.