Skip to content

SDF-based dataset support#32

Draft
sfluegel05 wants to merge 21 commits intodevfrom
feature/sdf-support
Draft

SDF-based dataset support#32
sfluegel05 wants to merge 21 commits intodevfrom
feature/sdf-support

Conversation

@sfluegel05
Copy link
Contributor

Add support for the SDF-based dataset (ChEB-AI/python-chebai#147). This includes mostly:

  • allowing Chem.Mol objects as input to _read_data functions
  • allowing Chem.Mol as input to read_property (also: the _read_data function now returns the augmented molecule dictionary which gets passed to read_property, avoiding a complete recalculation for each property)
  • using a standard sanitize_molecule function to ensure consistent SMILES parsing

One issue I came across while testing this: The AugAtomNumHs property (or any property inheriting the FrozenPropertyAlias) doesn't allow new tokens to be created. @aditya0by0 How should I add new tokens here? Should I first build a dataset with the non-augmented version of each property and then use those to created an augmented dataset?

The above exception was the direct cause of the following exception:
    return self.compute_fn(*args)
  File "/home/staff/s/sifluegel/python-chebai/chebai/cli.py", line 46, in call_data_methods
    data.setup()
  File "/home/staff/s/sifluegel/python-chebai/chebai/preprocessing/datasets/base.py", line 524, in setup
    self._after_setup(**kwargs)
  File "/home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/datasets/chebi.py", line 199, in _after_setup
    self._setup_properties()                                                                                        File "/home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/datasets/chebi.py", line 168, in _setup_properties
    property.on_finish()                                                                                            File "/home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/properties/base.py", line 206, in on_finish                                                                                                              raise ValueError(                                                                                             ValueError: AugAtomNumHs attempted to add new tokens to a frozen encoder at /home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/bin/AtomNumHs/indices_one_hot.txt

@aditya0by0
Copy link
Member

aditya0by0 commented Feb 26, 2026

One issue I came across while testing this: The AugAtomNumHs property (or any property inheriting the FrozenPropertyAlias) doesn't allow new tokens to be created. @aditya0by0 How should I add new tokens here?

Yes, the below approach will be a intended/correct approach. FrozenPropertyAlias was designed to reuse the non-augmented versions of properties when working with augmented data. Because of this design, alias properties are not allowed to create new tokens they can only use tokens that already exist. This restriction is intentionally enforced in the logic of FrozenPropertyAlias.

Should I first build a dataset with the non-augmented version of each property and then use those to created an augmented dataset

class FrozenPropertyAlias(MolecularProperty, ABC):
"""
Wrapper base class for augmented graph properties that reuse existing molecular properties.
This allows an augmented property class (with an 'Aug' prefix in its name) to:
- Reuse the encoder and index files of the base property by removing the 'Aug' prefix from its name.
- Prevent adding new tokens to the encoder cache by freezing it (using MappingProxyType).
Usage:
Inherit from FrozenPropertyAlias and the desired base molecular property class,
and name the class with an 'Aug' prefix (e.g., 'AugAtomType').
Example:
```python
class AugAtomType(FrozenPropertyAlias, AtomType): ...
```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants