A web-app for the exploration of the embedding space of reagents used in reaction data. Described in our paper Curating Reagents in Chemical Reaction Data with an Interactive Reagent Space Map.
The app is a visual way of exploring the co-occurrence statistics of reagents in reactions. The app displays UMAP projections of reagent embeddings derived by decomposing the PMI matrix of reagents with singular value decomposition.
A PMI matrix contains pointwise mutual information scores. For two reagents a and b, their PMI score is derived from reagent occurrence counts. Factorising this matrix using SVD yields dense embeddings for reagents, which tend to be similar for two reagents if these reagents are encountered in similar contexts, i.e. together with the same other reagents. For example, two different palladium catalysts for Suzuki coupling will not be used together in a reaction, but they may be used with the same bases and solvents. Therefore, those two catalysts will get similar embeddings and will lie close together. Those embeddings are then projected on the 2D plane and the surface of the unit sphere by the UMAP algorithm. It's a dimensionality reduction algorithm that tries to preserve distance relations between original points when projecting them to a lower-dimensional space. The map of UMAP projections of reagent embeddings is displayed in the app.
This codebase uses uv for dependency management. Install uv (if not already installed). It is a faster alternative to pip and poetry.
curl -LsSf https://astral.sh/uv/install.sh | shVisit the uv installation guide for more information.
Run the following commands to install the environment for the app:
For development:
uv sync
pre-commit installFor production:
uv sync --no-devActivate virtual environment:
source .venv/bin/activateRun the app with the following command
uv run gunicorn src.main:server -b 127.0.0.1:8050The app will be running on http://localhost:8050. By default, it shows the map of USPTO reagent embeddings determined by AAM
reading the infomation from data/default/uspto_aam_rgs_min_count_100_d_50.csv.
Users can also upload their own reagent data, prepared with the appropriate scripts in the way described below.
You can run the app in docker using the provided Dockerfile.
docker build -t reagent-emb-vis .docker run -p 8050:8050 reagent-emb-visThe app will be available at http://localhost:8050.
Note: The Dockerfile includes system dependencies (X11 libraries) required for RDKit's molecular rendering functionality. If you encounter import errors related to libXrender.so.1 or similar libraries, these dependencies resolve the issue.
The file data/standard_reagents.csv contains the information about ~600 reagents that occur in USPTO, with their roles and names.
The entries in the file are ordered by occurrence frequency in the descending order.
We download the USPTO dataset using rxnutils.
Warning: rxnutils may have to be installed in a separate virtual environment because it is incompatible with uv.
To install it in a separate environment, run the following commands:
python3 -m venv .venv
source .venv/bin/activate
pip install reaction-utilsUsing the environment with rxnutils, execute the following commands from the data directory:
python -m rxnutils.data.uspto.downloadpython -m rxnutils.data.uspto.combineIt downloads the file data/uspto_data.csv. Then, we do the initial filtering of this dataset with the following command executed from the project directory:
python3 -m rxnutils.pipeline.runner --pipeline uspto/pipeline.yml --data data/uspto_data.csv --output data/uspto_filtered.csvFinally, we extract the reagents from the filtered dataset. Run the following command using the project's uv environment:
python3 scripts/prepare_reagents.py -i data/uspto_filtered.csv --output_dir uspto_aam_reagents -c ReactionSmiles --reagents aam --fragment_grouping cxsmiles --canonicalization remove_aam --n_jobs 9 --min_reagent_occurrences 1 --verboseThe script prepare_reagents.py as various options. For example, it can determine reagents either by atom mapping or by fingerprints.
The embeddings for reagents are calculated using the script build_embeddings.py based on a file with reagents that are used in their respective reagents.
The input file must contain reagent SMILES sets for some reaction in every row, and those SMILES must be separated by some separator. e.g. ;.
Example:
CCO;c1ccccc1
[H-].[Na+];C1CCOC1
NNEvery row in this file contains reagents for some reaction in the dataset of interest. The reactions themselves are not relevant.
The script prepare_reagents.py prepares a suitable input for build_embeddings.py.
The app uses coordinates in a CSV file, which is prepared using the build_embeddings.py script.
Run the following command:
python3 scripts/build_embeddings.py -i <PATH TO THE TEXT FILE WITH REAGENT SMILES> --standard data/standard_reagents.csv --min_count <MINIMAL OCCURENCE COUNT FOR REAGENTS TO BE CONSIDERED> -o <PATH TO THE OUTPUT CSV FILE> -d <DIMENSONALITY OF REAGENT EMBEDDINGS>For more information, run python3 build_embeddings.py --help.
The default reagent embeddings were built with the following command:
python3 scripts/build_embeddings.py -i data/uspto_aam_reagents/reagents-1128297.txt --standard data/standard_reagents.csv -d 50 -o data/uspto_aam_rgs_min_count_100_d_50.csv --min_count 100Upload a CSV file build by the build_embeddings.py script.
For the insights about reagents in USPTO and to reproduce the figures in the paper please follow the notebook notebooks/results.ipynb.
@inproceedings{andronov2024,
title={Curating Reagents in Chemical Reaction Data with an Interactive Reagent Space Map},
author={Andronov, Mikhail and Andronova, Natalia and Wand, Michael and Schmidhuber, J{\"u}rgen and Clevert, Djork-Arn{\'e}},
booktitle={International Workshop on AI in Drug Discovery},
pages={21--35},
year={2024},
publisher={Springer Nature Switzerland},
address={Cham},
doi={10.1007/978-3-031-72381-0_3}
}