Conversation
|
This can already be done with |
|
I closed it too soon. SLURM_SC_OFFLINE_ARGS="update State=DRAIN"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS NodeName=$HOSTNAME Reason="$LEADER $NOTE"Versus: SLURM_SC_OFFLINE_ARGS="reboot ASAP"
exec $SLURM_SCONTROL $SLURM_SC_OFFLINE_ARGS $HOSTNAME |
|
I'm working on an improved version with Slurm 18.08 support ( |
|
I've got an internal version that we use. I'm going to push it to this branch. |
57329fa to
7486112
Compare
|
Ok, so the difference between node-mark-offline and node-mark-reboot is only a few lines, so they can easily be merged into one helper. The main issue is that The mark-node-online helper can cancel pending reboots (if the node is healthy again) or mark nodes online after a reboot. We've opted to reboot them with |
3df3e9c to
eebeed7
Compare
|
For anyone looking to use this helper: this would work perfectly with something like this (I'm referring to the service file). That's because we've opted to let the node return in a drained state, so it will only be resumed if NHC is run during the boot process (or manually). That's by design, because we don't want to trigger a boot loop during the prologue, and we also like to avoid scheduling a job on a node before we know for sure that it's in a good state after the reboot. We use it like this in Alternatively, there is this pull request that tries to handle it differently. |
I added a helper script to mark nodes for reboot. It's based on
node-mark-offline, but executesscontrol reboot ASAP <node>instead. This helper script can be used by settingOFFLINE_NODEto$HELPERDIR/node-mark-reboot. This is useful for checks that need a reboot when failed. It's compatible with Slurm only.