Skip to content

Add ECC support for Nvidia professional GPUs#486

Open
LH-and-FPGA wants to merge 1 commit into
Syllo:masterfrom
LH-and-FPGA:master
Open

Add ECC support for Nvidia professional GPUs#486
LH-and-FPGA wants to merge 1 commit into
Syllo:masterfrom
LH-and-FPGA:master

Conversation

@LH-and-FPGA

Copy link
Copy Markdown

Summary

Adds an ECC error counter to the device header, shown right after the power field. NVIDIA professional and datacenter GPUs (e.g. RTX PRO, Quadro, Tesla, A100/H100) expose volatile ECC error counts through NVML; this surfaces them in nvtop.

Display format is ECC <corrected>/<uncorrected>:

  • corrected — single-bit errors automatically fixed by ECC (normal)
  • uncorrected — double-bit errors; a non-zero value is highlighted in red since it indicates a hardware fault

The field follows the existing "valid bit" pattern: on GPUs that do not support ECC (consumer cards, older drivers), NVML return NVML_ERROR_NOT_SUPPORTED, the valid bit stays unset, and the field is simply left blank — no special-casing required.

Changes

  • extract_gpuinfo_common.h: add ecc_corrected / ecc_uncorrected dynamic fields and their valid bits
  • extract_gpuinfo_nvidia.c: load nvmlDeviceGetTotalEccErrors (optional symbol, so older NVML libs still work) and query volatile ECC counts
  • interface.c / interface_internal_common.h: add the ecc_info window and draw it after the power field

Notes

  • Volatile counters are used (errors since last driver reload/reboot).
  • Tested on a machine with an RTX PRO 4000 Blackwell (ECC enabled, shows ECC 0/0) and an RTX 5090 (no ECC, field correctly hidden).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant