PanCORE — Pan-Chemical Omniscale Representation Engine

A Transformer-based seq2seq framework that maps chemical structures (SMILES) into a continuous, meaningful latent space.
The name PanCORE stands for Pan-Chemical Omniscale Representation Engine — reflecting its goal of covering a broad chemical space across multiple molecular scales.

Key capabilities:

Broad Chemical Space — Trained on large-scale datasets covering diverse low-to-medium-sized molecules.
Latent Representation — Encodes SMILES strings into fixed-length latent vectors for downstream tasks (property prediction, similarity search, conditional generation).
Generative Capability — Robust decoder for SMILES reconstruction/canonicalization and novel molecule generation from the latent manifold.
Curriculum Training — Supports normal, phased, and dynamic bucket-based curriculum strategies with distributed training (DDP).

Note

This repository is under construction and will be officially released by Mizuno group.
Please contact tadahaya[at]gmail.com before publishing any results using this repository.

Authors

Zehao Li — main contributor
Tadahaya Mizuno — correspondence

Repository Structure

pan-core/
├── sample/
│   └── config.yml            # Example configuration file
├── src/pan_core/
│   ├── main.py               # CLI entry point and library functions
│   ├── train.py              # HF Trainer subclass (HF_Trainer)
│   ├── config.py             # ModelConfig + TrainArgs (YAML → dataclasses)
│   ├── common/
│   │   ├── tokenizer.py      # SMILESTokenizer (HuggingFace-compatible)
│   │   ├── data_handler.py   # Datasets, samplers, and preprocessing utilities
│   │   └── utils.py          # Shared utilities (KLLoss, move_to_cpu, etc.)
│   └── models/
│       └── transformer/
│           ├── model.py      # PanCoreTransformerModel
│           └── layer.py      # Attention, TransformerBlock, SwiGLU, etc.

Installation

pip install -e .

Data Preprocessing

PanCORE expects SMILES data in one of two formats at training time:

Mode	Format	Description
`on_the_fly`	CSV or Parquet	Raw SMILES tokenized at access time; data held as Arrow buffers (CoW-safe across DataLoader workers)
`memmap`	`.npy` memmap	Pre-tokenized int16 arrays, fastest for large datasets

on_the_fly memory note: SMILES strings are stored as Polars Series (Apache Arrow buffers) rather than Python lists. Arrow buffers are read-only and shared across all fork-based DataLoader workers without Copy-on-Write page duplication — keeping worker memory overhead near zero regardless of dataset size.

If preprocessing_fn is provided to OnTheFlyDataset, it is applied per item at access time (not at load time). For expensive transforms such as RDKit canonicalization, prefer pre-processing the file once with raw2randomORcanonical and passing the result directly.

Step 1 — Prepare a SMILES file

The source file must contain at least one SMILES column.
Accepted formats: .csv, .csv.gz, and .parquet.
For seq2seq training (random → canonical), two columns are expected:

random,canonical
C1=CC=CC=C1,c1ccccc1
...

The column names are configurable via csv_column_name in config.yml.

Step 2 — Generate Random/Canonical SMILES Pairs

Use raw2randomORcanonical from data_handler.py.
The output format is determined by output_path: use .parquet for Parquet, .csv.gz for gzip-compressed CSV, or .csv for plain CSV.

from pan_core.common.data_handler import raw2randomORcanonical
import pandas as pd

smiles = pd.read_csv("molecules.csv")["smiles"]

# Parquet output (recommended for large datasets)
raw2randomORcanonical(
    allsmiles=smiles,
    randomize=True,
    canonicalize=True,
    output_path="data/processed.parquet",
    max_workers=8
)

# Gzip CSV output
raw2randomORcanonical(
    allsmiles=smiles,
    randomize=True,
    canonicalize=True,
    output_path="data/processed.csv.gz",
    max_workers=8
)

For data augmentation (multiple random SMILES per molecule), the output format is inferred from the input extension (.parquet → _augmented.parquet, .csv.gz → _augmented.csv.gz, etc.):

from pan_core.common.data_handler import RandomAugmentation

RandomAugmentation(
    csvpath="data/processed.parquet",
    num_random=10,
    max_workers=8
)

Step 3 — Tokenize and Convert to Memmap (recommended for large datasets)

Accepts .csv, .csv.gz, or .parquet as input.

from pan_core.common.data_handler import split_and_tokenize

split_and_tokenize(
    csv_path="data/processed.parquet",
    token_path="vocab/normal_tokens.txt",
    thresholds=[0, 64, 128, 256, 512],  # pure SMILES token counts (BOS/EOS excluded)
    max_length=512,                      # used only when do_split=False
    randomize=True,
    canonicalize=True,
    do_split=True,          # split into length-bucketed files (for dynamic curriculum)
    smiles_format="parquet", # sidecar SMILES output format: "csv" (default) or "parquet"
    analyze_stride=100,      # record 1-in-N rows for distribution analysis (default 100)
    max_workers=8
)

thresholds: Values represent pure SMILES token counts excluding BOS/EOS.
For example, thresholds=[0, 64, 128, 256] creates three buckets; each bucket's NPY stores sequences padded to threshold_upper + 2 tokens (adding BOS and EOS).

Output (do_split=True): One .npy memmap file per bucket, each with shape (N, 2, threshold_upper+2) where [:, 0, :] is source and [:, 1, :] is target, stored as int16.
Shorter buckets use smaller arrays — the shape varies per bucket.

Output (do_split=False): A single .npy with shape (N, 2, max_length).

A SMILES sidecar file (.csv.gz or .parquet depending on smiles_format) is also written per bucket when do_split=True.

Memory: Processing is bounded to O(max_workers) RAM regardless of dataset size — suitable for 100M+ row inputs.

Step 4 — Per-bucket Augmentation (optional)

After split_and_tokenize, use augment_bins to generate augmented NPY files for selected buckets.
The original NPY and sidecar files are not modified.

from pan_core.common.data_handler import augment_bins

augment_bins(
    base_dir="data/",
    stem="processed",           # stem used by split_and_tokenize
    token_path="vocab/normal_tokens.txt",
    thresholds=[0, 64, 128, 256, 512],
    bin_num_random={0: 10, 2: 5},   # bin index → random SMILES per molecule
    sidecar_format="parquet",        # must match smiles_format used in split_and_tokenize
    max_workers=8,
)

Argument	Description
`base_dir`	Directory where `split_and_tokenize` wrote its outputs
`stem`	File stem (same as derived from `csv_path` in `split_and_tokenize`)
`bin_num_random`	Dict mapping bucket index → number of random SMILES to generate per molecule
`sidecar_format`	`"parquet"` or `"csv"` — must match `smiles_format` used in `split_and_tokenize`

Output per bucket i with n = bin_num_random[i]:

File	Shape / Format	Description
`{stem}_{lo}to{hi}_aug{n}.npy`	`(M, 2, hi+2)` int16	Augmented memmap; `[:, 0, :]` random, `[:, 1, :]` canonical
`{stem}_{lo}to{hi}_aug{n}.parquet`	two columns	`random` + `canonical` sidecar for the augmented pairs

M may be less than N × n if some molecules fail RDKit augmentation.

Expected Input Summary

Field	Type	Description
Source SMILES	`str`	Random (non-canonical) SMILES string
Target SMILES	`str`	Canonical SMILES string (RDKit output)
SMILES file	`.csv` / `.csv.gz` / `.parquet`	Two-column file with `random` and `canonical` columns
Vocabulary file	`.txt`	One token per line; default vocab size 185
Memmap file	`.npy`	`int16`, shape `(N, 2, L)` — src at `[:, 0, :]`, tgt at `[:, 1, :]`; `L = threshold_upper + 2` per bucket when `do_split=True`
Latent array	`.npy`	`float32`, shape `(N, E)` — for `--latent_decode` mode

Configuration

All runs are controlled by a single YAML file.
See sample/config.yml for a fully annotated example.

Top-level sections

Section	Description
`system`	Environment (`hpc`/`normal`), seed, world size, workers, bf16
`logging`	WandB settings, log interval
`model`	Architecture: VAE flag, embedding dim, Transformer layers/heads, RoPE, latent pooling
`training`	Strategy, optimizer, ZClip, checkpointing, validation, evaluation, data

Training strategies

`strategy.type`	Description
`normal`	Standard single-phase training
`phased`	Multi-phase curriculum — each phase adds a new data split; earlier data is replayed at `replay_ratio`
`dynamic`	Bucket-based curriculum — sampling probabilities per bucket interpolate from `start_sampling_probabilities` to `end_sampling_probabilities` over training

Key model parameters (`model` section)

Parameter	Default	Description
`use_vae`	`false`	Enable VAE mode (adds KL loss with `beta` weighting)
`common.embedding_dim`	`512`	Hidden/latent dimension
`common.dropout`	`0.2`	Dropout probability
`Transformer.n_layer`	`8`	Number of encoder/decoder layers
`Transformer.n_head`	`8`	Number of attention heads
`Transformer.n_positions`	`2048`	Maximum sequence length
`Transformer.latent.pooling_mode`	`MHA`	Latent pooling: `MHA` or `concat`
`Transformer.latent.cross_attn`	`L-T`	Cross-attention mode: `L-T`, `S-T`, or blank
`Transformer.latent.latent_conditioning`	`adaln-zero`	Decoder conditioning: `add_once`, `adaln`, `adaln-zero`, or blank

Running

The main entry point is pan_core.main (or python -m pan_core.main if installed).

python -m pan_core.main --config path/to/config.yml [MODE FLAGS] [OPTIONS]

Global arguments

Argument	Required	Description
`--config`	Yes	Path to the YAML configuration file
`--output_dir`	No	Output directory for checkpoints/logs/results. Defaults to `<config_dir>/output/`

Mode: Training (`--do_train`)

Runs the full training pipeline: config → tokenizer → model → data → train → save.

python -m pan_core.main \
    --config sample/config.yml \
    --output_dir runs/exp01 \
    --do_train

Resumes automatically from the latest checkpoint if one exists in --output_dir.
Final model is saved to <output_dir>/final_model/.
Tokenizer vocabulary is saved alongside the model.

Mode: Generation / Evaluation (`--do_eval_gen`)

Runs autoregressive greedy decoding (or teacher-forcing) over the evaluation dataset and saves results.

python -m pan_core.main \
    --config sample/config.yml \
    --output_dir runs/exp01 \
    --do_eval_gen \
    --checkpoint runs/exp01/final_model \
    --csv_output runs/exp01/gen_results.csv

Argument	Default	Description
`--checkpoint`	`<output_dir>/final_model`	Path to the model checkpoint to load
`--encode_only`	`false`	Only run the encoder; saves latent vectors, skips decoding
`--latent_decode`	`false`	Decode from pre-computed latent arrays (skips encoding; expects `.npy` at `eval_data.path`)
`--teacher_forcing`	`false`	Use teacher-forced forward pass instead of autoregressive decoding
`--csv_output`	`<output_dir>/gen_results.csv`	Output CSV path. Columns: `judge`, `input`, `generated`, `target`
`--encode_output`	`<output_dir>/encode_latents.npy`	Output `.npy` path for latent vectors (used when `--encode_only`)
`--need_logits`	`false`	Also collect and save full output logits
`--logits_output`	`<output_dir>/gen_output_logits.npy`	Output `.npy` path for logits

Output CSV columns:

Column	Description
`judge`	`True` if `generated == target` (perfect reconstruction)
`input`	Decoded source SMILES
`generated`	Model-generated SMILES
`target`	Ground-truth canonical SMILES

Mode: Accuracy Progress (`--get_acc_progress`)

Evaluates all checkpoint-N subdirectories in a directory and produces a training-step accuracy curve.

python -m pan_core.main \
    --config sample/config.yml \
    --output_dir runs/exp01 \
    --get_acc_progress \
    --checkpoint_dir runs/exp01

Argument	Required	Description
`--checkpoint_dir`	Yes	Directory containing `checkpoint-N` subdirectories

Output: <output_dir>/acc_progress.csv with columns perfect_accuracy, partial_accuracy, and (for phased/dynamic strategies) phase{i}_data_perfect_accuracy per phase. Per-checkpoint CSVs are also saved.

Mode: Hidden State Extraction (`--get_hidden`)

Runs a full forward pass over the evaluation dataset and saves all intermediate hidden states and attention scores.

python -m pan_core.main \
    --config sample/config.yml \
    --output_dir runs/exp01 \
    --get_hidden \
    --checkpoint runs/exp01/final_model \
    --hidden_output runs/exp01/hidden_output.pkl

Argument	Default	Description
`--checkpoint`	`<output_dir>/final_model`	Path to the model checkpoint
`--hidden_output`	`<output_dir>/hidden_output.pkl`	Output `.pkl` path for hidden states

Output: A pickle file containing a nested dict:

{
    "outputs": {
        "encoder": [array_layer0, array_layer1, ...],  # each: [N, L, E]
        "decoder": [array_layer0, array_layer1, ...],
    },
    "attention_scores": {
        "encoder": [array_layer0, array_layer1, ...],  # each: [N, H, L, L]
        "decoder": [array_layer0, array_layer1, ...],
    }
}

Note: torch.compile is intentionally disabled in this mode because fullgraph=True is incompatible with intermediate tensor returns (need_hidden=True).

Using as a library

All four pipeline functions can be imported and called directly without the CLI:

from pan_core.main import (
    run_training,
    run_generation_evaluation,
    run_accuracy_progress,
    get_hidden_outputs,
)

# Example: encode a custom DataFrame and retrieve latents
results, training_args = run_generation_evaluation(
    config="path/to/config.yml",
    output_dir="runs/exp01",
    checkpoint_path="runs/exp01/final_model",
    encode_only=True,
    latent_decode=False,
    encode_output_file="latents.npy",
)

run_generation_evaluation and get_hidden_outputs both accept a pre-loaded (ModelConfig, TrainArgs) tuple as the config argument, which avoids re-parsing the YAML when called in a loop.

References

Inspired by clmpy (Shumpei Nemoto, Mizuno group)

Contact

For questions or comments, open a GitHub issue or contact:

takuho2002[at]outlook.jp
tadahaya[at]gmail.com (lead contact)

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
notebooks		notebooks
sample		sample
src/pan_core		src/pan_core
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PanCORE — Pan-Chemical Omniscale Representation Engine

Note

Authors

Repository Structure

Installation

Data Preprocessing

Step 1 — Prepare a SMILES file

Step 2 — Generate Random/Canonical SMILES Pairs

Step 3 — Tokenize and Convert to Memmap (recommended for large datasets)

Step 4 — Per-bucket Augmentation (optional)

Expected Input Summary

Configuration

Top-level sections

Training strategies

Key model parameters (`model` section)

Running

Global arguments

Mode: Training (`--do_train`)

Mode: Generation / Evaluation (`--do_eval_gen`)

Mode: Accuracy Progress (`--get_acc_progress`)

Mode: Hidden State Extraction (`--get_hidden`)

Using as a library

References

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PanCORE — Pan-Chemical Omniscale Representation Engine

Note

Authors

Repository Structure

Installation

Data Preprocessing

Step 1 — Prepare a SMILES file

Step 2 — Generate Random/Canonical SMILES Pairs

Step 3 — Tokenize and Convert to Memmap (recommended for large datasets)

Step 4 — Per-bucket Augmentation (optional)

Expected Input Summary

Configuration

Top-level sections

Training strategies

Key model parameters (model section)

Running

Global arguments

Mode: Training (--do_train)

Mode: Generation / Evaluation (--do_eval_gen)

Mode: Accuracy Progress (--get_acc_progress)

Mode: Hidden State Extraction (--get_hidden)

Using as a library

References

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Key model parameters (`model` section)

Mode: Training (`--do_train`)

Mode: Generation / Evaluation (`--do_eval_gen`)

Mode: Accuracy Progress (`--get_acc_progress`)

Mode: Hidden State Extraction (`--get_hidden`)

Packages