Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
019f485
Outlined tutorials
cedriclim1 Jan 31, 2026
c4a5019
Added a README file for tomography/hpc
cedriclim1 Jan 31, 2026
617c687
Renamed notebooks to tutorials
cedriclim1 Jan 31, 2026
6d3c798
Finished tomography lite tutorial v1
cedriclim1 Feb 2, 2026
97a6699
Example tilt series and tilt angles .npy files (what's the best way t…
cedriclim1 Feb 2, 2026
eab93ee
Tomography scripts and notebook tutorials implemented
cedriclim1 Feb 2, 2026
a4616c0
Added HPC tutorials
cedriclim1 Feb 2, 2026
9133716
.gitignore update
cedriclim1 Feb 2, 2026
2227eb6
Added bash script for queuing a job
cedriclim1 Feb 3, 2026
8edbf70
Oops, added the actual run_job here
cedriclim1 Feb 3, 2026
cb9b851
Script check
cedriclim1 Feb 3, 2026
d294124
Added phantom back in
cedriclim1 Feb 19, 2026
f1ebced
Update directories, and updated tomography_02_full.ipynb to due to ch…
cedriclim1 Mar 3, 2026
770444f
Moved some stuff around - updated tomography_01_lite.ipynb. Hyperpara…
cedriclim1 Mar 3, 2026
bf13263
Added tensorboard logging into tomography_01_lite.ipynb; A little jan…
cedriclim1 Mar 3, 2026
1842620
Also added tensorboard cell in tomography_02
cedriclim1 Mar 3, 2026
bbc6712
Tomography_02_full.ipynb complete with loading and saving; need to ad…
cedriclim1 Mar 3, 2026
0fcad81
Saving and loading added to TomographyLite tutorial
cedriclim1 Mar 3, 2026
afc3593
Tomography_02_full.ipynb everything runs
cedriclim1 Mar 3, 2026
a329ccc
HPC scripts updated module loads due to cudatoolkit 13.0 upgrade errors
cedriclim1 Mar 3, 2026
0a23e3e
Updates to tomography_02_full using new ObjectConstraints, and update…
cedriclim1 Mar 5, 2026
2724178
DatasetConstraints implemented in tomgraphy_02_full.ipynb
cedriclim1 Mar 5, 2026
db4dc42
DatasetConstraints implemented in tomgraphy_02_full.ipynb
cedriclim1 Mar 5, 2026
09edbfd
Fix tomography_recon.py in hpc tutorial
cedriclim1 Mar 10, 2026
b1e26b5
New tilt series and tilt angles in tomo_recon.py
cedriclim1 Mar 10, 2026
22eb134
Multiple fixes to file
cedriclim1 Mar 10, 2026
c69eb92
FIx
cedriclim1 Mar 11, 2026
fc5507a
NCCL version update
cedriclim1 Mar 20, 2026
37eeffa
destroy param groups hpc
cedriclim1 Mar 21, 2026
61088e8
Took out environment
cedriclim1 Mar 21, 2026
e6e0975
Ignore cuBLAS warnings temporarily
cedriclim1 Mar 21, 2026
8073786
Added to README.md on HPC tutorials. Also some small edits to with D…
cedriclim1 Mar 23, 2026
357814f
Update README.md
cedriclim1 Mar 23, 2026
5c1cd6c
TomoLite and TomoFull notebooks updated
cedriclim1 Mar 23, 2026
8c9ba61
Updated README to be more thorough and more explanations for HPC stuf…
cedriclim1 Mar 24, 2026
db75b4a
Setting up DDP for obj_model print statement in TomographyBase now on…
cedriclim1 Mar 24, 2026
a96650c
Ignore previous commit; Updated tomography_01_lite.ipynb and tomograp…
cedriclim1 Mar 24, 2026
b0171fd
Created a plot_losses() function shown in tomography_02_full.ipynb
cedriclim1 Mar 24, 2026
7f1059b
Added tensorboard_nersc.ipynb and updated README.md
cedriclim1 Mar 24, 2026
f4835b7
Fixed Logs and Outputs to be only printed on global_rank 0
cedriclim1 Mar 24, 2026
fc4727d
run_job.sh and run.sh, more explicit which ones to change
cedriclim1 Mar 24, 2026
1f13ac1
More updates to the bash scripts
cedriclim1 Mar 24, 2026
05bc8a2
Small comment on destroying all process groups for in
cedriclim1 Mar 24, 2026
8b2a26f
Added for the HPC README.md that the first iteration will take a whil…
cedriclim1 Mar 24, 2026
ba28893
Added for the HPC README.md that the first iteration will take a whil…
cedriclim1 Mar 24, 2026
4d56e2f
Merge branch 'tomography_tutorials' of https://github.com/electronmic…
cedriclim1 Mar 24, 2026
8d742d0
workspace gitignore
cedriclim1 Mar 24, 2026
405cdd6
Stop tracking .code-workspace files
cedriclim1 Mar 24, 2026
e8744bd
Merge main into tomography_tutorials
arthurmccray Mar 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
.ipynb_checkpoints
*.zip
*.npy
# *.npy
.DS_Store
dvlp_notebooks
notebooks/runs/vols/
outputs/

# GIT ignore .code-workspace
*.code-workspace
Binary file added data/phantom.npy
Binary file not shown.
Binary file added data/tilt_angles_1_deg_tilt_axis.npy
Binary file not shown.
Binary file added data/tilt_series_1_deg_tilt_axis.npy
Binary file not shown.
Binary file removed notebooks/tomography/runs/logs/auxiliary_params.pt
Binary file not shown.
Binary file not shown.
829 changes: 0 additions & 829 deletions notebooks/tomography/tomo_01.ipynb

This file was deleted.

File renamed without changes.
File renamed without changes.
File renamed without changes.
84 changes: 84 additions & 0 deletions tutorials/tomography/hpc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# HPC (NERSC) Tomography Reconstructions

Included in this folder are two files:

- `run.sh`: Initializes all necessary parameters for `torch` to see all gpus across nodes using `torchrun`.
- `run_job.sh`: A sample job script for submitting to the Slurm scheduler.
- `tomography_recon.py`: Contains the full reconstruction script; similar to `Scripts/tomography_02_full.py`
- `tensorboard_nersc.ipynb`: A Jupyter notebook for visualizing tensorboard logs on NERSC.

**Note: The first iteration of the reconstruction takes some time to initialize due to `multiprocessing_context="spawn"` which is a slightly slower way to initialize the multiprocessing context but is more stable across different systems.**

# Compatibility Table

| System | Python Ver. | PyTorch Ver. | Status |
|--------|-------------|--------------|--------|
| NERSC Perlmutter | >=3.10 | >=2.10.0 | ✅ |

# Installing Conda Environments

Please refer to: https://docs.nersc.gov/development/languages/python/nersc-python/. It is advised to put your environments in `/global/common/software` for the fastest performance.

You will need to clone the quantem repository: `https://github.com/electronmicroscopy/quantem/` and install it in your conda environment using `pip install -e .`.

## Installation Steps

We recommend the installation order as follows:

1. Create a conda environment in `/global/common/software` using the `--prefix` flag:
```bash
conda create --prefix /global/common/software/your_env_name python=3.xx.xx
```
2. Install torch with CUDA support by looking at [PyTorch's official installation page](https://pytorch.org/get-started/locally/). **Note: Currently CUDA 13.0 is the default version of Torch that will be installed. You may want to specify the torch version if you need a specific one, see [here](https://pytorch.org/get-started/locally/) for more details**:

```bash
conda install pytorch torchvision torchaudio --index-url https://download.pytorch.org/whl/cuxxx
# replace xxx with your CUDA version, or remove this if CUDA 13.0 is wanted
```
3. Clone the `quantem` repository and install it in your conda environment:
```bash
git clone https://github.com/electronmicroscopy/quantem.git
cd quantem
pip install -e .
```
4. Edit `run.sh` (for interactive nodes) and `run_job.sh` (for batch jobs):
- `run.sh`:
- Replace `/global/common/software/mxxxx/user/conda/quantem` with the path to your conda environment
- `run_job.sh`:
- Replace `mxxxx` with your NERSC project ID
- Replace `'INSERT USER HERE'` with your NERSC username
- Replace `/global/common/software/mxxxx/user/conda/quantem` with the path to your conda environment
- Adjust the number of nodes and GPUs per node as needed
- Adjust the walltime as needed (default is 4 hours)
- *If* using this script for longer jobs, change the queue from `debug` to `regular` and adjust the walltime accordingly.
5. (Optional) Here is an example of how to allocate an interactive job on Perlmutter which will instantiate all the required variables for `torchrun` and `DDP`:

```bash
salloc -q interactive -C gpu -A mxxxx --nodes=4 --ntasks-per-node=1 --gpus-per-node=4 --cpus-per-task=128 --time=04:00:00
```

# Shifter Docker Environment

**3/22/2026 In Progress**

# FAQ

## Compatibility with HPC Systems

These tutorials have been heavily tested using NERSC's Perlmutter system. They should work on other HPC systems with similar configurations, but may require adjustments to the module loading and conda environment paths.

## How to run?

There are two ways to run the script across multiple nodes. On Perlmutter, you can launch an interactive job using `salloc`. The maximum amount of time for an interactive job is 4 hours, and the command for allocating one can be used here:

`salloc -q interactive -C gpu -A m5241 --nodes=4 --ntasks-per-node=1 --gpus-per-node=4 --cpus-per-task=32 --time=04:00:00`

Once the job is allocated, you can just run `sh run.sh` or `batch run.sh`.

Alternatively, you can submit a job using `sbatch run_job.sh`. Usual parameters can be adjusted within the file.

These files should work on these tutorial files, but will potentially require adjustments when using your own data.

## Tensorboard Visualization

See `tensorboard_nersc.ipynb` for an example of how to visualize tensorboard logs on NERSC. For different HPC systems, you may need to use a different method to visualize tensorboard logs via port-forwarding.
31 changes: 31 additions & 0 deletions tutorials/tomography/hpc/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
export OMP_NUM_THREADS=8

echo "=== Loading modules ==="
ml nccl/2.29.2-cu13
ml cudatoolkit/13.0
ml conda
echo "=== Loaded modules ==="
module list

echo "=== Activating conda env ==="
conda activate /global/common/software/mxxxx/user/conda/quantem # CHANGE: Replace with the path to your conda environment
echo "=== Active conda env ==="
echo "CONDA_DEFAULT_ENV: $CONDA_DEFAULT_ENV"
echo "CONDA_PREFIX: $CONDA_PREFIX"

echo "=== Python being used ==="
which python
python --version

echo "=== Key packages ==="
python -c "import torch; print('torch:', torch.__version__, '| CUDA:', torch.cuda.is_available())"

echo "=== NCCL ==="
echo "NCCL_HOME: $NCCL_HOME"

echo "=== Starting srun ==="
srun -l torchrun --nnodes=$SLURM_JOB_NUM_NODES --nproc-per-node=$SLURM_GPUS_PER_NODE --rdzv-backend=c10d --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT tomography_recon.py
21 changes: 21 additions & 0 deletions tutorials/tomography/hpc/run_job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash
#SBATCH -A mxxxx # CHANGE: Replace with your NERSC project ID
#SBATCH -C gpu
#SBATCH -q debug # CHANGE: Replace with 'regular' for longer jobs
#SBATCH -t 00:30:00 # CHANGE: Adjust walltime as needed
#SBATCH -N 4 # CHANGE: Adjust number of nodes as needed
#SBATCH --ntasks-per-node=1
#SBATCH -c 128
#SBATCH --gpus-per-node=4
#SBATCH --gpu-bind=none
#SBATCH --mail-type=begin,end,fail
#SBATCH --mail-user=xxx@xxx.xxx # CHANGE: Replace with your email address

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
export OMP_NUM_THREADS=8
ml nccl/2.29.2-cu13
ml conda
ml nccl
conda activate /global/common/software/mxxxx/'INSERT USER HERE'/conda/quantem # CHANGE: Replace 'INSERT USER HERE' with your NERSC username and update path to your conda environment
srun -l torchrun --nnodes=$SLURM_JOB_NUM_NODES --nproc-per-node=$SLURM_GPUS_PER_NODE --rdzv-backend=c10d --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT tomography_recon.py
58 changes: 58 additions & 0 deletions tutorials/tomography/hpc/tensorboard_nersc.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "8e4ba3b3",
"metadata": {},
"source": [
"# Tensorboard Visualizer on NERSC\n",
"\n",
"At NERSC, they have package helpers to display tensorboard logging. If NERSC is the platform you are using, go to `jupyter.nersc.gov` with the environment `pytorch-2.6.0`. Below is an example of how to visualize tensorboard logs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4ed42ca",
"metadata": {},
"outputs": [],
"source": [
"import nersc_tensorboard_helper\n",
"%load_ext tensorboard"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5aa635d1",
"metadata": {},
"outputs": [],
"source": [
"%tensorboard --logdir /global/homes/x/user/quantem-tutorials/outputs"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9965fcd8",
"metadata": {},
"outputs": [],
"source": [
"nersc_tensorboard_helper.tb_address()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
173 changes: 173 additions & 0 deletions tutorials/tomography/hpc/tomography_recon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
from quantem.tomography.tomography import Tomography
from quantem.tomography.dataset_models import TomographyINRDataset, DatasetConstraintParams
from quantem.tomography.object_models import ObjectINR, ObjConstraintParams
from quantem.tomography.logger_tomography import LoggerTomography
from quantem.core.ml.inr import HSiren
from quantem.core.ml.optimizer_mixin import SchedulerParams, OptimizerParams
import numpy as np

from quantem.core.utils.tomography_utils import fourier_binning
import torch

"""
Example script to run the tomography reconstruction on HPC (NERSC).

- Major difference is not needing to specify the device, as it will be automatically set by the DDP framework.

*Cedric Lim, 3/22/26*
"""

if __name__ == "__main__":

# Load Phantom Dataset
tilt_series = np.load('../../../data/tilt_series_1_deg_tilt_axis.npy')
tilt_angles = np.load('../../../data/tilt_angles_1_deg_tilt_axis.npy')

tilt_series = np.array([fourier_binning(img, (100, 100)) for img in tilt_series]) # Cropped down to 100x100 for speed

dset = TomographyINRDataset.from_data(
tilt_stack = tilt_series,
tilt_angles = tilt_angles,
)

# Initialize INR Model
model = HSiren(alpha = 1, winner_initialization = True)

# Initialize INR Object
obj_inr = ObjectINR.from_model(
shape = (100, 100, 100),
model = model,
)

# Define a logger
logger = LoggerTomography(
log_dir = "../../../outputs/tomography/tutorial_02_scripts/",
run_prefix = "inr_tomography_warmup_cosineanneal_hpc",
run_suffix = "",
log_images_every = 2,
)


# Initialize INR-Based Tomography Object
tomo_inr = Tomography.from_models(
dset = dset,
obj_model = obj_inr,
logger = logger,
verbose = False,
)

if tomo_inr.global_rank == 0:
print(f"Logs and Outputs will be saved at {logger.log_dir}")

# Define optimizer and scheduler parameters - only optimizing the object.

optimizer_params = {
"object": OptimizerParams.Adam(
lr = 1e-4,
),
"pose": OptimizerParams.Adam(
lr = 1e-2,
)
}
"""
All available scheduler params are in `core/ml/optimizer_mixin.py`

Scheduler types: 'cyclic', 'plateau', 'exp', 'gamma', 'linear', 'cosine_annealing'
Keyword arguments follow PyTorch scheduler documentation.
"""

scheduler_params = {
"object": SchedulerParams.Plateau(
mode = "min",
factor = 0.5,
patience = 10,
threshold = 1e-3,
min_lr = 1e-7,
),
"pose": SchedulerParams.Plateau(
mode = "min",
factor = 0.5,
patience = 10,
threshold = 1e-3,
min_lr = 1e-7,
)
}

"""
Defining the constraints that we want to apply to the object and dataset. In this case
adding a total-variational loss, enforcing positivity, and a shrinkage constraint.

For the dataset we can add a 1-D total-variational loss to the shifts and z-shifts.
However this may not be necessary depending on the dataset.
"""

obj_constraints = ObjConstraintParams.ObjINRConstraints(
positivity = True,
sparsity = 1e-6,
tv_vol = 1e-4,
)

## Dataset constraints not necessarily needed.

dataset_constraints = DatasetConstraintParams.BaseTomographyDatasetConstraints(
tv_shifts = 1e-6, # 1-D regularizer for the shift optimization
tv_zs = 1e-6, # 1-D regularizer for the z-shift optimization.
)


# Warmup Schedule for 10 epochs

num_samples_per_ray = [
(0, 20),
(1, 20),
(2, 40),
(3, 40),
(4, 60),
(4, 60),
(6, 80),
(7, 80),
(8, 100),
(9, 100),
]

tomo_inr.reconstruct(
num_iter = 10,
batch_size = 256,
optimizer_params = optimizer_params,
scheduler_params = scheduler_params,
obj_constraints = obj_constraints,
dset_constraints = dataset_constraints,
num_samples_per_ray = num_samples_per_ray,
num_workers = 32,
)

# Initialzie pose optimizer

optimizer_params = {
"pose": {
"type": "adam",
"lr": 1e-2,
}
}

# Define new schedulers for both optimizers using CosineAnnealing
scheduler_params = {
"object": {
"type": "cosine_annealing",
},
"pose": {
"type": "cosine_annealing",
}
}

# Reconstruct

tomo_inr.reconstruct(
num_iter = 100,
optimizer_params = optimizer_params,
scheduler_params = scheduler_params,
num_samples_per_ray = 100,
)

# Clean up distributed training, good practice to put this at the end of the script.
torch.distributed.destroy_process_group()
Loading