Warning
** The Database Zenodo record(s) will be taken down pending confirmation/amendments to the licensing agreement pertaining to the enclosed crystallographic data**
MOSAEC-DB is a database of metal-organic framework (MOF) and coordination polymer crystallographic information files (.cif) processed for atomistic simulations.
This repository collects the software applied during database construction, validation, and analysis.
Additionally, this repository will provide further information regarding the use of the database and changes to the database:
The MOSAEC-DB files, including all publicly-available crystal structures, scripts, and supplemental data, can be downloaded from the zenodo repository.
Initial crystal structures may be retrieved directly from the Cambridge Structural Database (CSD) through the ConQuest program or their provided Python API.
Retrieved CSD crystal structures were converted to P1 symmetry using our CSD-Cleaner code.
The SAMOSA solvent removal method was utilized in database construction. A publication outlining the details of this method is available. All *_full MOSAEC-DB structures were generated with default settings, while *_partial structures were generated with the --keep_bound option.
The results of the MOSAEC error checking algorithm were used to determine whether to include a structure in MOSAEC-DB. A preprint article outlining the details of this method is available, and the GitHub repository will be available to the public shortly (following its publication). Only structures which passed the MOSAEC error flagging routine were included in the final database.
Additional tools to check for problematic structures which may not have been caught by the structure error analyses are provided in structure_validation. This includes codes to search for overlapping and hypervalent atom sites.
Criterion based on pointwise distance distribution (PDD) scores were applied to identify duplicated and/or highly similar crystal structures with shared empirical formulas. The codes used to complete this analysis are provided in duplicates.
Descriptors characterizing the databases' geometric and chemical environments were generated using a number of standard libraries.
The codes generating the atomic property-weighted radial distribution function (AP-RDF) and revised autocorrelation function (RAC) are identical to those used in the ARC-MOF database.
Geometric properties were generated using the Zeo++ v0.3.0 software with default settings.
Any code used to generate descriptors which were not previously made available are provided in descriptors, including the atom-specific persistent homology descriptors included in the zenodo record.
Electrostatic potential-derived partial atomic charges were computed for as many MOSAEC-DB structures as possible using the previously reported REPEAT method. The most recent version of this code is available at the following repository.
Additionally, a broader set of MOSAEC-DB were assigned ML-predicted partial atomic charges using the MEPO-ML models that we reported in a recent publication.
Each crystal structure's dimensionality was calculated using a previously reported algorithm.
The CrystalNets package was applied to compute the net topology. A simple julia script was used to characterize MOSAEC-DB crystal structures.
Additional utilities unrelated to the MOSAEC-DB database construction and characterization processes are also provided in the zenodo record to facilitate simple file manipulations. These tools are available in zenodo alongside descriptions of their functions below.
Crystal structures that were unchanged by the database construction protocol due to their lack of solvent are outlined in corresponding text files (.gcd). Access to these structures is subject to the users' CSD license status, however the computation ready structure can be regenerated by applying the provided structure processing codes to the relevant CSD REFCODES. This script makes use of a CSD-Cleaner code that depends on the CSD Python API and pymatgen packages.
python get_unchanged_mofs.py --remove_disorder
Additionally, experimental functionality to regenerate the structure files containing partial atomic charges (REPEAT/MEPO-ML) is provided in these scripts. Consistency of these functions is subject to change with various versions of pymatgen and the CSD Python API, thus we cannot guarantee accuracy to the original, computed partial atomic charges. Testing was performed using Python 3.9.x, csd-python-api 3.1.0, and pymatgen 2024.5.31.
python get_unchanged_mofs.py --remove_disorder --write_repeat --write_mepoml
Subsets of MOSAEC-DB are arranged according to common conventions of porosity in prior databases, as well as a diverse sampling of several chemical and geometric descriptors. Sampling was achieved using farthest point sampling of the desired descriptor vector.
python get_subset.py ../subsets/______.txt
Information regarding future updates and additions to the database will be outlined in the GitHub repository established at the time of publication.
The CC BY 4.0 license applies to the utilization of the MOSAEC database. Follow the license guidelines regarding the use, sharing, adaptation, and attribution of this data.
Please cite the following article when using MOSAEC-DB.
Reach out to any of the following authors with any questions:
Marco Gibaldi: marco.gibaldi@uottawa.ca
Tom Woo: tom.woo@uottawa.ca
