Skip to content

DDP for FineTuning#812

Merged
psinger-prior merged 11 commits into
mainfrom
psi/finetuning_v1
Mar 26, 2026
Merged

DDP for FineTuning#812
psinger-prior merged 11 commits into
mainfrom
psi/finetuning_v1

Conversation

@psinger-prior
Copy link
Copy Markdown
Contributor

@psinger-prior psinger-prior commented Mar 9, 2026

Issue

Closing #809

Motivation and Context


Public API Changes

  • No Public API changes
  • Yes, Public API changes (Details below)

How Has This Been Tested?

Running example scripts on single and multi gpu nodes.


Checklist

  • The changes have been tested locally.
  • Documentation has been updated (if the public API or usage changes).
  • A changelog entry has been added (see changelog/README.md), or "no changelog needed" label requested.
  • The code follows the project's style guidelines.
  • I have considered the impact of these changes on the public API.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Distributed Data Parallel (DDP) support for fine-tuning, which is a great enhancement for multi-GPU training. The implementation is thorough, covering distributed sampling, metric synchronization, and efficient model state saving. I've found one area for improvement regarding optimizer initialization in the DDP setup for better consistency and adherence to best practices.

Comment thread src/tabpfn/finetuning/finetuned_base.py Outdated
@psinger-prior psinger-prior marked this pull request as ready for review March 9, 2026 11:42
@psinger-prior psinger-prior requested a review from a team as a code owner March 9, 2026 11:42
@psinger-prior psinger-prior requested review from alanprior and removed request for a team March 9, 2026 11:42
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@alanprior
Copy link
Copy Markdown
Contributor

@psinger-prior sorry, I've completely missed this PR, probably due to the war. @anuragg1209 would you feel OK to review it?

@anuragg1209
Copy link
Copy Markdown
Contributor

@psinger-prior sorry, I've completely missed this PR, probably due to the war. @anuragg1209 would you feel OK to review it?

Hi @alanprior, yes, this PR review is under my To-Dos. Please feel free to unsubscribe.

@anuragg1209
Copy link
Copy Markdown
Contributor

Hi @psinger-prior,
Thanks for the PR. I tested the DDP on 4 H100s and everything worked correctly. I wrote test scripts for classifier DDP, regressor DDP, and gradient synchronization verification (parameter difference across ranks: 0.00e+00). All the tests passed, and the core implementation looks solid.

I just have a few comments to add.

Comment thread src/tabpfn/finetuning/finetuned_base.py
Comment thread src/tabpfn/finetuning/finetuned_base.py Outdated
Comment thread examples/finetune_classifier.py
Copy link
Copy Markdown
Contributor

@anuragg1209 anuragg1209 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All comments addressed! LGTM! Thanks @psinger-prior for the PR!

@anuragg1209 anuragg1209 removed the request for review from alanprior March 26, 2026 09:22
@psinger-prior psinger-prior added this pull request to the merge queue Mar 26, 2026
Merged via the queue into main with commit f22aaf7 Mar 26, 2026
12 checks passed
@psinger-prior psinger-prior mentioned this pull request Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants