Skip to content

snap-research/GenPAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sequential Data Augmentation for Generative Recommendation

This repository contains the code for our work "Sequential Data Augmentation for Generative Recommendation" (arXiv) which will be presented at WSDM 2026. It provides training scripts for generative recommendation models (e.g., SASRec [1], TIGER[2]) using GenPAS and data analysis codes for selecting appropriate hyperparameters.


📦 Environment Setup

We recommend using Python ≥ 3.8 with a virtual environment.

conda create -n genpas python=3.9
conda activate genpas
pip install -r requirements.txt

📁 Datasets

For the Amazon datasets (Beauty, Toys, and Sports) please use this google drive link.

For the MovieLens datasets (ML1M and ML20M), we provided the preprocessing codes for ML1M and ML20M.

All datasets should be stored under the data folder. Please prepare the data in the following format:

data/
├── beauty/
│   ├── training/              # training set
│   ├── evaluation/            # validation set
│   ├── testing/               # test set
│   └── sids                   # semantic IDs for TIGER
├── toys/
├── sports/
├── ml1m/
└── ml20m/

To load sequences for a given split (training / evaluation / testing), refer to the jupyter notebook read_data.ipynb.

# Dataset and split
dataset = 'beauty'
data_type = 'training'  # 'evaluation' or 'testing'

# you can load sequences as follows:
sequences = read_data(dataset, data_type)

# Now `sequences` contains the list of user interaction sequences for the split.
print(len(sequences), 'sequences loaded')

To load the MovieLens-20M dataset to use for the TIGER experiments, please run data/ml20m/text_data_preparation.ipynb.


🚀 Training Models

We provide training scripts for common generative recommendation baselines. Please find the full script in run.sh.

Run SASRec

python src/train.py trainer=ddp experiment=sasrec_train_10_00_00_beauty.yaml logger=csv

Run TIGER

python src/train.py trainer=ddp experiment=tiger_train_10_00_10_beauty.yaml logger=csv

📊 Analysis Tools

We provide scripts that measure how augmentation impacts the target and input–target distributions in data_analysis.

KL Divergence of Target Distribution

Computes KL(p_valid || q_train) between the validation/test target distribution and the training target distribution.

python get_kl.py --dataset beauty --alpha 1.0 --beta 0.0

Alignment & Discrimination of Input–Target Distribution

Computes alignment and discrimination between the validation/test input-target distribution and the training input-target distribution.

python get_align_disc.py --dataset beauty --alpha 1.0 --beta 0.0 --gamma 0.0

For large datasets with longer sequences (e.g., ML1M and ML20M), please use the variant using sampling:

python get_align_disc_sample.py --dataset ml1m  --alpha 1.0 --beta 0.0 --gamma 0.0

🤝 Acknowledgments

Bibliography

[1] Kang, Wang-Cheng, and McAuley, Julian. "Self-Attentive Sequential Recommendation." IEEE International Conference on Data Mining 18 (2018).

[2] Rajput, Shashank, et al. "Recommender systems with generative retrieval." Advances in Neural Information Processing Systems 36 (2023): 10299-10315.

About

Sequential Data Augmentation for Generative Recommendation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors