A Transformer-based seq2seq framework that maps chemical structures (SMILES) into a continuous, meaningful latent space.
The name PanCORE stands for Pan-Chemical Omniscale Representation Engine — reflecting its goal of covering a broad chemical space across multiple molecular scales.
Key capabilities:
- Broad Chemical Space — Trained on large-scale datasets covering diverse low-to-medium-sized molecules.
- Latent Representation — Encodes SMILES strings into fixed-length latent vectors for downstream tasks (property prediction, similarity search, conditional generation).
- Generative Capability — Robust decoder for SMILES reconstruction/canonicalization and novel molecule generation from the latent manifold.
- Curriculum Training — Supports
normal,phased, anddynamicbucket-based curriculum strategies with distributed training (DDP).
This repository is under construction and will be officially released by Mizuno group.
Please contact tadahaya[at]gmail.com before publishing any results using this repository.
- Zehao Li — main contributor
- Tadahaya Mizuno — correspondence
pan-core/
├── sample/
│ └── config.yml # Example configuration file
├── src/pan_core/
│ ├── main.py # CLI entry point and library functions
│ ├── train.py # HF Trainer subclass (HF_Trainer)
│ ├── config.py # ModelConfig + TrainArgs (YAML → dataclasses)
│ ├── common/
│ │ ├── tokenizer.py # SMILESTokenizer (HuggingFace-compatible)
│ │ ├── data_handler.py # Datasets, samplers, and preprocessing utilities
│ │ └── utils.py # Shared utilities (KLLoss, move_to_cpu, etc.)
│ └── models/
│ └── transformer/
│ ├── model.py # PanCoreTransformerModel
│ └── layer.py # Attention, TransformerBlock, SwiGLU, etc.
pip install -e .PanCORE expects SMILES data in one of two formats at training time:
| Mode | Format | Description |
|---|---|---|
on_the_fly |
CSV or Parquet | Raw SMILES tokenized at access time; data held as Arrow buffers (CoW-safe across DataLoader workers) |
memmap |
.npy memmap |
Pre-tokenized int16 arrays, fastest for large datasets |
on_the_flymemory note: SMILES strings are stored as Polars Series (Apache Arrow buffers) rather than Python lists. Arrow buffers are read-only and shared across all fork-based DataLoader workers without Copy-on-Write page duplication — keeping worker memory overhead near zero regardless of dataset size.If
preprocessing_fnis provided toOnTheFlyDataset, it is applied per item at access time (not at load time). For expensive transforms such as RDKit canonicalization, prefer pre-processing the file once withraw2randomORcanonicaland passing the result directly.
The source file must contain at least one SMILES column.
Accepted formats: .csv, .csv.gz, and .parquet.
For seq2seq training (random → canonical), two columns are expected:
random,canonical
C1=CC=CC=C1,c1ccccc1
...
The column names are configurable via csv_column_name in config.yml.
Use raw2randomORcanonical from data_handler.py.
The output format is determined by output_path: use .parquet for Parquet, .csv.gz for gzip-compressed CSV, or .csv for plain CSV.
from pan_core.common.data_handler import raw2randomORcanonical
import pandas as pd
smiles = pd.read_csv("molecules.csv")["smiles"]
# Parquet output (recommended for large datasets)
raw2randomORcanonical(
allsmiles=smiles,
randomize=True,
canonicalize=True,
output_path="data/processed.parquet",
max_workers=8
)
# Gzip CSV output
raw2randomORcanonical(
allsmiles=smiles,
randomize=True,
canonicalize=True,
output_path="data/processed.csv.gz",
max_workers=8
)For data augmentation (multiple random SMILES per molecule), the output format is inferred from the input extension (.parquet → _augmented.parquet, .csv.gz → _augmented.csv.gz, etc.):
from pan_core.common.data_handler import RandomAugmentation
RandomAugmentation(
csvpath="data/processed.parquet",
num_random=10,
max_workers=8
)Accepts .csv, .csv.gz, or .parquet as input.
from pan_core.common.data_handler import split_and_tokenize
split_and_tokenize(
csv_path="data/processed.parquet",
token_path="vocab/normal_tokens.txt",
thresholds=[0, 64, 128, 256, 512], # pure SMILES token counts (BOS/EOS excluded)
max_length=512, # used only when do_split=False
randomize=True,
canonicalize=True,
do_split=True, # split into length-bucketed files (for dynamic curriculum)
smiles_format="parquet", # sidecar SMILES output format: "csv" (default) or "parquet"
analyze_stride=100, # record 1-in-N rows for distribution analysis (default 100)
max_workers=8
)thresholds: Values represent pure SMILES token counts excluding BOS/EOS.
For example, thresholds=[0, 64, 128, 256] creates three buckets; each bucket's NPY stores sequences padded to threshold_upper + 2 tokens (adding BOS and EOS).
Output (do_split=True): One .npy memmap file per bucket, each with shape (N, 2, threshold_upper+2) where [:, 0, :] is source and [:, 1, :] is target, stored as int16.
Shorter buckets use smaller arrays — the shape varies per bucket.
Output (do_split=False): A single .npy with shape (N, 2, max_length).
A SMILES sidecar file (.csv.gz or .parquet depending on smiles_format) is also written per bucket when do_split=True.
Memory: Processing is bounded to O(max_workers) RAM regardless of dataset size — suitable for 100M+ row inputs.
After split_and_tokenize, use augment_bins to generate augmented NPY files for selected buckets.
The original NPY and sidecar files are not modified.
from pan_core.common.data_handler import augment_bins
augment_bins(
base_dir="data/",
stem="processed", # stem used by split_and_tokenize
token_path="vocab/normal_tokens.txt",
thresholds=[0, 64, 128, 256, 512],
bin_num_random={0: 10, 2: 5}, # bin index → random SMILES per molecule
sidecar_format="parquet", # must match smiles_format used in split_and_tokenize
max_workers=8,
)| Argument | Description |
|---|---|
base_dir |
Directory where split_and_tokenize wrote its outputs |
stem |
File stem (same as derived from csv_path in split_and_tokenize) |
bin_num_random |
Dict mapping bucket index → number of random SMILES to generate per molecule |
sidecar_format |
"parquet" or "csv" — must match smiles_format used in split_and_tokenize |
Output per bucket i with n = bin_num_random[i]:
| File | Shape / Format | Description |
|---|---|---|
{stem}_{lo}to{hi}_aug{n}.npy |
(M, 2, hi+2) int16 |
Augmented memmap; [:, 0, :] random, [:, 1, :] canonical |
{stem}_{lo}to{hi}_aug{n}.parquet |
two columns | random + canonical sidecar for the augmented pairs |
M may be less than N × n if some molecules fail RDKit augmentation.
| Field | Type | Description |
|---|---|---|
| Source SMILES | str |
Random (non-canonical) SMILES string |
| Target SMILES | str |
Canonical SMILES string (RDKit output) |
| SMILES file | .csv / .csv.gz / .parquet |
Two-column file with random and canonical columns |
| Vocabulary file | .txt |
One token per line; default vocab size 185 |
| Memmap file | .npy |
int16, shape (N, 2, L) — src at [:, 0, :], tgt at [:, 1, :]; L = threshold_upper + 2 per bucket when do_split=True |
| Latent array | .npy |
float32, shape (N, E) — for --latent_decode mode |
All runs are controlled by a single YAML file.
See sample/config.yml for a fully annotated example.
| Section | Description |
|---|---|
system |
Environment (hpc/normal), seed, world size, workers, bf16 |
logging |
WandB settings, log interval |
model |
Architecture: VAE flag, embedding dim, Transformer layers/heads, RoPE, latent pooling |
training |
Strategy, optimizer, ZClip, checkpointing, validation, evaluation, data |
strategy.type |
Description |
|---|---|
normal |
Standard single-phase training |
phased |
Multi-phase curriculum — each phase adds a new data split; earlier data is replayed at replay_ratio |
dynamic |
Bucket-based curriculum — sampling probabilities per bucket interpolate from start_sampling_probabilities to end_sampling_probabilities over training |
| Parameter | Default | Description |
|---|---|---|
use_vae |
false |
Enable VAE mode (adds KL loss with beta weighting) |
common.embedding_dim |
512 |
Hidden/latent dimension |
common.dropout |
0.2 |
Dropout probability |
Transformer.n_layer |
8 |
Number of encoder/decoder layers |
Transformer.n_head |
8 |
Number of attention heads |
Transformer.n_positions |
2048 |
Maximum sequence length |
Transformer.latent.pooling_mode |
MHA |
Latent pooling: MHA or concat |
Transformer.latent.cross_attn |
L-T |
Cross-attention mode: L-T, S-T, or blank |
Transformer.latent.latent_conditioning |
adaln-zero |
Decoder conditioning: add_once, adaln, adaln-zero, or blank |
The main entry point is pan_core.main (or python -m pan_core.main if installed).
python -m pan_core.main --config path/to/config.yml [MODE FLAGS] [OPTIONS]| Argument | Required | Description |
|---|---|---|
--config |
Yes | Path to the YAML configuration file |
--output_dir |
No | Output directory for checkpoints/logs/results. Defaults to <config_dir>/output/ |
Runs the full training pipeline: config → tokenizer → model → data → train → save.
python -m pan_core.main \
--config sample/config.yml \
--output_dir runs/exp01 \
--do_train- Resumes automatically from the latest checkpoint if one exists in
--output_dir. - Final model is saved to
<output_dir>/final_model/. - Tokenizer vocabulary is saved alongside the model.
Runs autoregressive greedy decoding (or teacher-forcing) over the evaluation dataset and saves results.
python -m pan_core.main \
--config sample/config.yml \
--output_dir runs/exp01 \
--do_eval_gen \
--checkpoint runs/exp01/final_model \
--csv_output runs/exp01/gen_results.csv| Argument | Default | Description |
|---|---|---|
--checkpoint |
<output_dir>/final_model |
Path to the model checkpoint to load |
--encode_only |
false |
Only run the encoder; saves latent vectors, skips decoding |
--latent_decode |
false |
Decode from pre-computed latent arrays (skips encoding; expects .npy at eval_data.path) |
--teacher_forcing |
false |
Use teacher-forced forward pass instead of autoregressive decoding |
--csv_output |
<output_dir>/gen_results.csv |
Output CSV path. Columns: judge, input, generated, target |
--encode_output |
<output_dir>/encode_latents.npy |
Output .npy path for latent vectors (used when --encode_only) |
--need_logits |
false |
Also collect and save full output logits |
--logits_output |
<output_dir>/gen_output_logits.npy |
Output .npy path for logits |
Output CSV columns:
| Column | Description |
|---|---|
judge |
True if generated == target (perfect reconstruction) |
input |
Decoded source SMILES |
generated |
Model-generated SMILES |
target |
Ground-truth canonical SMILES |
Evaluates all checkpoint-N subdirectories in a directory and produces a training-step accuracy curve.
python -m pan_core.main \
--config sample/config.yml \
--output_dir runs/exp01 \
--get_acc_progress \
--checkpoint_dir runs/exp01| Argument | Required | Description |
|---|---|---|
--checkpoint_dir |
Yes | Directory containing checkpoint-N subdirectories |
Output: <output_dir>/acc_progress.csv with columns perfect_accuracy, partial_accuracy, and (for phased/dynamic strategies) phase{i}_data_perfect_accuracy per phase. Per-checkpoint CSVs are also saved.
Mode: Hidden State Extraction (--get_hidden)
Runs a full forward pass over the evaluation dataset and saves all intermediate hidden states and attention scores.
python -m pan_core.main \
--config sample/config.yml \
--output_dir runs/exp01 \
--get_hidden \
--checkpoint runs/exp01/final_model \
--hidden_output runs/exp01/hidden_output.pkl| Argument | Default | Description |
|---|---|---|
--checkpoint |
<output_dir>/final_model |
Path to the model checkpoint |
--hidden_output |
<output_dir>/hidden_output.pkl |
Output .pkl path for hidden states |
Output: A pickle file containing a nested dict:
{
"outputs": {
"encoder": [array_layer0, array_layer1, ...], # each: [N, L, E]
"decoder": [array_layer0, array_layer1, ...],
},
"attention_scores": {
"encoder": [array_layer0, array_layer1, ...], # each: [N, H, L, L]
"decoder": [array_layer0, array_layer1, ...],
}
}Note:
torch.compileis intentionally disabled in this mode becausefullgraph=Trueis incompatible with intermediate tensor returns (need_hidden=True).
All four pipeline functions can be imported and called directly without the CLI:
from pan_core.main import (
run_training,
run_generation_evaluation,
run_accuracy_progress,
get_hidden_outputs,
)
# Example: encode a custom DataFrame and retrieve latents
results, training_args = run_generation_evaluation(
config="path/to/config.yml",
output_dir="runs/exp01",
checkpoint_path="runs/exp01/final_model",
encode_only=True,
latent_decode=False,
encode_output_file="latents.npy",
)run_generation_evaluation and get_hidden_outputs both accept a pre-loaded (ModelConfig, TrainArgs) tuple as the config argument, which avoids re-parsing the YAML when called in a loop.
- Inspired by clmpy (Shumpei Nemoto, Mizuno group)
For questions or comments, open a GitHub issue or contact:
- takuho2002[at]outlook.jp
- tadahaya[at]gmail.com (lead contact)