Visualization of the reagent space

A web-app for the exploration of the embedding space of reagents used in reaction data. Described in our paper Curating Reagents in Chemical Reaction Data with an Interactive Reagent Space Map.

The app is a visual way of exploring the co-occurrence statistics of reagents in reactions. The app displays UMAP projections of reagent embeddings derived by decomposing the PMI matrix of reagents with singular value decomposition.

A PMI matrix contains pointwise mutual information scores. For two reagents a and b, their PMI score is derived from reagent occurrence counts. Factorising this matrix using SVD yields dense embeddings for reagents, which tend to be similar for two reagents if these reagents are encountered in similar contexts, i.e. together with the same other reagents. For example, two different palladium catalysts for Suzuki coupling will not be used together in a reaction, but they may be used with the same bases and solvents. Therefore, those two catalysts will get similar embeddings and will lie close together. Those embeddings are then projected on the 2D plane and the surface of the unit sphere by the UMAP algorithm. It's a dimensionality reduction algorithm that tries to preserve distance relations between original points when projecting them to a lower-dimensional space. The map of UMAP projections of reagent embeddings is displayed in the app.

Prerequisites

This codebase uses uv for dependency management. Install uv (if not already installed). It is a faster alternative to pip and poetry.

curl -LsSf https://astral.sh/uv/install.sh | sh

Visit the uv installation guide for more information.

Environment installation

Run the following commands to install the environment for the app:

For development:

uv sync
pre-commit install

For production:

uv sync --no-dev

Activate virtual environment:

source .venv/bin/activate

App usage

Run the app with the following command

uv run gunicorn src.main:server -b 127.0.0.1:8050

The app will be running on http://localhost:8050. By default, it shows the map of USPTO reagent embeddings determined by AAM reading the infomation from data/default/uspto_aam_rgs_min_count_100_d_50.csv. Users can also upload their own reagent data, prepared with the appropriate scripts in the way described below.

Running in docker

You can run the app in docker using the provided Dockerfile.

Build the Docker image

docker build -t reagent-emb-vis .

Run the Docker container

docker run -p 8050:8050 reagent-emb-vis

The app will be available at http://localhost:8050.

Note: The Dockerfile includes system dependencies (X11 libraries) required for RDKit's molecular rendering functionality. If you encounter import errors related to libXrender.so.1 or similar libraries, these dependencies resolve the issue.

Standard USPTO reagents

The file data/standard_reagents.csv contains the information about ~600 reagents that occur in USPTO, with their roles and names. The entries in the file are ordered by occurrence frequency in the descending order.

Dataset

We download the USPTO dataset using rxnutils.
Warning: rxnutils may have to be installed in a separate virtual environment because it is incompatible with uv. To install it in a separate environment, run the following commands:

python3 -m venv .venv
source .venv/bin/activate
pip install reaction-utils

Using the environment with rxnutils, execute the following commands from the data directory:

python -m rxnutils.data.uspto.download

python -m rxnutils.data.uspto.combine

It downloads the file data/uspto_data.csv. Then, we do the initial filtering of this dataset with the following command executed from the project directory:

python3 -m rxnutils.pipeline.runner --pipeline uspto/pipeline.yml --data data/uspto_data.csv --output data/uspto_filtered.csv

Finally, we extract the reagents from the filtered dataset. Run the following command using the project's uv environment:

python3 scripts/prepare_reagents.py -i data/uspto_filtered.csv --output_dir uspto_aam_reagents -c ReactionSmiles --reagents aam --fragment_grouping cxsmiles --canonicalization remove_aam --n_jobs 9 --min_reagent_occurrences 1 --verbose

The script prepare_reagents.py as various options. For example, it can determine reagents either by atom mapping or by fingerprints.

Reagent embeddings preparation

The embeddings for reagents are calculated using the script build_embeddings.py based on a file with reagents that are used in their respective reagents. The input file must contain reagent SMILES sets for some reaction in every row, and those SMILES must be separated by some separator. e.g. ;. Example:

CCO;c1ccccc1
[H-].[Na+];C1CCOC1
NN

Every row in this file contains reagents for some reaction in the dataset of interest. The reactions themselves are not relevant. The script prepare_reagents.py prepares a suitable input for build_embeddings.py.

The app uses coordinates in a CSV file, which is prepared using the build_embeddings.py script.

Run the following command:

python3 scripts/build_embeddings.py -i <PATH TO THE TEXT FILE WITH REAGENT SMILES> --standard data/standard_reagents.csv --min_count <MINIMAL OCCURENCE COUNT FOR REAGENTS TO BE CONSIDERED> -o <PATH TO THE OUTPUT CSV FILE> -d <DIMENSONALITY OF REAGENT EMBEDDINGS>

For more information, run python3 build_embeddings.py --help.

The default reagent embeddings were built with the following command:

python3 scripts/build_embeddings.py -i data/uspto_aam_reagents/reagents-1128297.txt --standard data/standard_reagents.csv -d 50 -o data/uspto_aam_rgs_min_count_100_d_50.csv --min_count 100

Upload a CSV file build by the build_embeddings.py script.

Reports

For the insights about reagents in USPTO and to reproduce the figures in the paper please follow the notebook notebooks/results.ipynb.

Citation

@inproceedings{andronov2024,
  title={Curating Reagents in Chemical Reaction Data with an Interactive Reagent Space Map},
  author={Andronov, Mikhail and Andronova, Natalia and Wand, Michael and Schmidhuber, J{\"u}rgen and Clevert, Djork-Arn{\'e}},
  booktitle={International Workshop on AI in Drug Discovery},
  pages={21--35},
  year={2024},
  publisher={Springer Nature Switzerland},
  address={Cham},
  doi={10.1007/978-3-031-72381-0_3}
}

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
data		data
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visualization of the reagent space

Prerequisites

Environment installation

App usage

Running in docker

Build the Docker image

Run the Docker container

Standard USPTO reagents

Dataset

Reagent embeddings preparation

Reports

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Visualization of the reagent space

Prerequisites

Environment installation

App usage

Running in docker

Build the Docker image

Run the Docker container

Standard USPTO reagents

Dataset

Reagent embeddings preparation

Reports

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages