Skip to content

Lzh-Function/pan-core

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

274 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PanCORE — Pan-Chemical Omniscale Representation Engine

A Transformer-based seq2seq framework that maps chemical structures (SMILES) into a continuous, meaningful latent space.
The name PanCORE stands for Pan-Chemical Omniscale Representation Engine — reflecting its goal of covering a broad chemical space across multiple molecular scales.

Key capabilities:

  • Broad Chemical Space — Trained on large-scale datasets covering diverse low-to-medium-sized molecules.
  • Latent Representation — Encodes SMILES strings into fixed-length latent vectors for downstream tasks (property prediction, similarity search, conditional generation).
  • Generative Capability — Robust decoder for SMILES reconstruction/canonicalization and novel molecule generation from the latent manifold.
  • Curriculum Training — Supports normal, phased, and dynamic bucket-based curriculum strategies with distributed training (DDP).

Note

This repository is under construction and will be officially released by Mizuno group.
Please contact tadahaya[at]gmail.com before publishing any results using this repository.


Authors


Repository Structure

pan-core/
├── sample/
│   └── config.yml            # Example configuration file
├── src/pan_core/
│   ├── main.py               # CLI entry point and library functions
│   ├── train.py              # HF Trainer subclass (HF_Trainer)
│   ├── config.py             # ModelConfig + TrainArgs (YAML → dataclasses)
│   ├── common/
│   │   ├── tokenizer.py      # SMILESTokenizer (HuggingFace-compatible)
│   │   ├── data_handler.py   # Datasets, samplers, and preprocessing utilities
│   │   └── utils.py          # Shared utilities (KLLoss, move_to_cpu, etc.)
│   └── models/
│       └── transformer/
│           ├── model.py      # PanCoreTransformerModel
│           └── layer.py      # Attention, TransformerBlock, SwiGLU, etc.

Installation

pip install -e .

Data Preprocessing

PanCORE expects SMILES data in one of two formats at training time:

Mode Format Description
on_the_fly CSV or Parquet Raw SMILES tokenized at access time; data held as Arrow buffers (CoW-safe across DataLoader workers)
memmap .npy memmap Pre-tokenized int16 arrays, fastest for large datasets

on_the_fly memory note: SMILES strings are stored as Polars Series (Apache Arrow buffers) rather than Python lists. Arrow buffers are read-only and shared across all fork-based DataLoader workers without Copy-on-Write page duplication — keeping worker memory overhead near zero regardless of dataset size.

If preprocessing_fn is provided to OnTheFlyDataset, it is applied per item at access time (not at load time). For expensive transforms such as RDKit canonicalization, prefer pre-processing the file once with raw2randomORcanonical and passing the result directly.

Step 1 — Prepare a SMILES file

The source file must contain at least one SMILES column.
Accepted formats: .csv, .csv.gz, and .parquet.
For seq2seq training (random → canonical), two columns are expected:

random,canonical
C1=CC=CC=C1,c1ccccc1
...

The column names are configurable via csv_column_name in config.yml.

Step 2 — Generate Random/Canonical SMILES Pairs

Use raw2randomORcanonical from data_handler.py.
The output format is determined by output_path: use .parquet for Parquet, .csv.gz for gzip-compressed CSV, or .csv for plain CSV.

from pan_core.common.data_handler import raw2randomORcanonical
import pandas as pd

smiles = pd.read_csv("molecules.csv")["smiles"]

# Parquet output (recommended for large datasets)
raw2randomORcanonical(
    allsmiles=smiles,
    randomize=True,
    canonicalize=True,
    output_path="data/processed.parquet",
    max_workers=8
)

# Gzip CSV output
raw2randomORcanonical(
    allsmiles=smiles,
    randomize=True,
    canonicalize=True,
    output_path="data/processed.csv.gz",
    max_workers=8
)

For data augmentation (multiple random SMILES per molecule), the output format is inferred from the input extension (.parquet_augmented.parquet, .csv.gz_augmented.csv.gz, etc.):

from pan_core.common.data_handler import RandomAugmentation

RandomAugmentation(
    csvpath="data/processed.parquet",
    num_random=10,
    max_workers=8
)

Step 3 — Tokenize and Convert to Memmap (recommended for large datasets)

Accepts .csv, .csv.gz, or .parquet as input.

from pan_core.common.data_handler import split_and_tokenize

split_and_tokenize(
    csv_path="data/processed.parquet",
    token_path="vocab/normal_tokens.txt",
    thresholds=[0, 64, 128, 256, 512],  # pure SMILES token counts (BOS/EOS excluded)
    max_length=512,                      # used only when do_split=False
    randomize=True,
    canonicalize=True,
    do_split=True,          # split into length-bucketed files (for dynamic curriculum)
    smiles_format="parquet", # sidecar SMILES output format: "csv" (default) or "parquet"
    analyze_stride=100,      # record 1-in-N rows for distribution analysis (default 100)
    max_workers=8
)

thresholds: Values represent pure SMILES token counts excluding BOS/EOS.
For example, thresholds=[0, 64, 128, 256] creates three buckets; each bucket's NPY stores sequences padded to threshold_upper + 2 tokens (adding BOS and EOS).

Output (do_split=True): One .npy memmap file per bucket, each with shape (N, 2, threshold_upper+2) where [:, 0, :] is source and [:, 1, :] is target, stored as int16.
Shorter buckets use smaller arrays — the shape varies per bucket.

Output (do_split=False): A single .npy with shape (N, 2, max_length).

A SMILES sidecar file (.csv.gz or .parquet depending on smiles_format) is also written per bucket when do_split=True.

Memory: Processing is bounded to O(max_workers) RAM regardless of dataset size — suitable for 100M+ row inputs.

Step 4 — Per-bucket Augmentation (optional)

After split_and_tokenize, use augment_bins to generate augmented NPY files for selected buckets.
The original NPY and sidecar files are not modified.

from pan_core.common.data_handler import augment_bins

augment_bins(
    base_dir="data/",
    stem="processed",           # stem used by split_and_tokenize
    token_path="vocab/normal_tokens.txt",
    thresholds=[0, 64, 128, 256, 512],
    bin_num_random={0: 10, 2: 5},   # bin index → random SMILES per molecule
    sidecar_format="parquet",        # must match smiles_format used in split_and_tokenize
    max_workers=8,
)
Argument Description
base_dir Directory where split_and_tokenize wrote its outputs
stem File stem (same as derived from csv_path in split_and_tokenize)
bin_num_random Dict mapping bucket index → number of random SMILES to generate per molecule
sidecar_format "parquet" or "csv" — must match smiles_format used in split_and_tokenize

Output per bucket i with n = bin_num_random[i]:

File Shape / Format Description
{stem}_{lo}to{hi}_aug{n}.npy (M, 2, hi+2) int16 Augmented memmap; [:, 0, :] random, [:, 1, :] canonical
{stem}_{lo}to{hi}_aug{n}.parquet two columns random + canonical sidecar for the augmented pairs

M may be less than N × n if some molecules fail RDKit augmentation.

Expected Input Summary

Field Type Description
Source SMILES str Random (non-canonical) SMILES string
Target SMILES str Canonical SMILES string (RDKit output)
SMILES file .csv / .csv.gz / .parquet Two-column file with random and canonical columns
Vocabulary file .txt One token per line; default vocab size 185
Memmap file .npy int16, shape (N, 2, L) — src at [:, 0, :], tgt at [:, 1, :]; L = threshold_upper + 2 per bucket when do_split=True
Latent array .npy float32, shape (N, E) — for --latent_decode mode

Configuration

All runs are controlled by a single YAML file.
See sample/config.yml for a fully annotated example.

Top-level sections

Section Description
system Environment (hpc/normal), seed, world size, workers, bf16
logging WandB settings, log interval
model Architecture: VAE flag, embedding dim, Transformer layers/heads, RoPE, latent pooling
training Strategy, optimizer, ZClip, checkpointing, validation, evaluation, data

Training strategies

strategy.type Description
normal Standard single-phase training
phased Multi-phase curriculum — each phase adds a new data split; earlier data is replayed at replay_ratio
dynamic Bucket-based curriculum — sampling probabilities per bucket interpolate from start_sampling_probabilities to end_sampling_probabilities over training

Key model parameters (model section)

Parameter Default Description
use_vae false Enable VAE mode (adds KL loss with beta weighting)
common.embedding_dim 512 Hidden/latent dimension
common.dropout 0.2 Dropout probability
Transformer.n_layer 8 Number of encoder/decoder layers
Transformer.n_head 8 Number of attention heads
Transformer.n_positions 2048 Maximum sequence length
Transformer.latent.pooling_mode MHA Latent pooling: MHA or concat
Transformer.latent.cross_attn L-T Cross-attention mode: L-T, S-T, or blank
Transformer.latent.latent_conditioning adaln-zero Decoder conditioning: add_once, adaln, adaln-zero, or blank

Running

The main entry point is pan_core.main (or python -m pan_core.main if installed).

python -m pan_core.main --config path/to/config.yml [MODE FLAGS] [OPTIONS]

Global arguments

Argument Required Description
--config Yes Path to the YAML configuration file
--output_dir No Output directory for checkpoints/logs/results. Defaults to <config_dir>/output/

Mode: Training (--do_train)

Runs the full training pipeline: config → tokenizer → model → data → train → save.

python -m pan_core.main \
    --config sample/config.yml \
    --output_dir runs/exp01 \
    --do_train
  • Resumes automatically from the latest checkpoint if one exists in --output_dir.
  • Final model is saved to <output_dir>/final_model/.
  • Tokenizer vocabulary is saved alongside the model.

Mode: Generation / Evaluation (--do_eval_gen)

Runs autoregressive greedy decoding (or teacher-forcing) over the evaluation dataset and saves results.

python -m pan_core.main \
    --config sample/config.yml \
    --output_dir runs/exp01 \
    --do_eval_gen \
    --checkpoint runs/exp01/final_model \
    --csv_output runs/exp01/gen_results.csv
Argument Default Description
--checkpoint <output_dir>/final_model Path to the model checkpoint to load
--encode_only false Only run the encoder; saves latent vectors, skips decoding
--latent_decode false Decode from pre-computed latent arrays (skips encoding; expects .npy at eval_data.path)
--teacher_forcing false Use teacher-forced forward pass instead of autoregressive decoding
--csv_output <output_dir>/gen_results.csv Output CSV path. Columns: judge, input, generated, target
--encode_output <output_dir>/encode_latents.npy Output .npy path for latent vectors (used when --encode_only)
--need_logits false Also collect and save full output logits
--logits_output <output_dir>/gen_output_logits.npy Output .npy path for logits

Output CSV columns:

Column Description
judge True if generated == target (perfect reconstruction)
input Decoded source SMILES
generated Model-generated SMILES
target Ground-truth canonical SMILES

Mode: Accuracy Progress (--get_acc_progress)

Evaluates all checkpoint-N subdirectories in a directory and produces a training-step accuracy curve.

python -m pan_core.main \
    --config sample/config.yml \
    --output_dir runs/exp01 \
    --get_acc_progress \
    --checkpoint_dir runs/exp01
Argument Required Description
--checkpoint_dir Yes Directory containing checkpoint-N subdirectories

Output: <output_dir>/acc_progress.csv with columns perfect_accuracy, partial_accuracy, and (for phased/dynamic strategies) phase{i}_data_perfect_accuracy per phase. Per-checkpoint CSVs are also saved.


Mode: Hidden State Extraction (--get_hidden)

Runs a full forward pass over the evaluation dataset and saves all intermediate hidden states and attention scores.

python -m pan_core.main \
    --config sample/config.yml \
    --output_dir runs/exp01 \
    --get_hidden \
    --checkpoint runs/exp01/final_model \
    --hidden_output runs/exp01/hidden_output.pkl
Argument Default Description
--checkpoint <output_dir>/final_model Path to the model checkpoint
--hidden_output <output_dir>/hidden_output.pkl Output .pkl path for hidden states

Output: A pickle file containing a nested dict:

{
    "outputs": {
        "encoder": [array_layer0, array_layer1, ...],  # each: [N, L, E]
        "decoder": [array_layer0, array_layer1, ...],
    },
    "attention_scores": {
        "encoder": [array_layer0, array_layer1, ...],  # each: [N, H, L, L]
        "decoder": [array_layer0, array_layer1, ...],
    }
}

Note: torch.compile is intentionally disabled in this mode because fullgraph=True is incompatible with intermediate tensor returns (need_hidden=True).


Using as a library

All four pipeline functions can be imported and called directly without the CLI:

from pan_core.main import (
    run_training,
    run_generation_evaluation,
    run_accuracy_progress,
    get_hidden_outputs,
)

# Example: encode a custom DataFrame and retrieve latents
results, training_args = run_generation_evaluation(
    config="path/to/config.yml",
    output_dir="runs/exp01",
    checkpoint_path="runs/exp01/final_model",
    encode_only=True,
    latent_decode=False,
    encode_output_file="latents.npy",
)

run_generation_evaluation and get_hidden_outputs both accept a pre-loaded (ModelConfig, TrainArgs) tuple as the config argument, which avoids re-parsing the YAML when called in a loop.


References

  • Inspired by clmpy (Shumpei Nemoto, Mizuno group)

Contact

For questions or comments, open a GitHub issue or contact:

  • takuho2002[at]outlook.jp
  • tadahaya[at]gmail.com (lead contact)

About

a repository for Wide Chemical Language Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors