-
Notifications
You must be signed in to change notification settings - Fork 635
Implement ciflow/rocm on Torchtitan #2114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can this label be automatically added? If not, what happens if it is not manually added -- is rocm test skipped? |
If the label is not added manually then the ROCm tests will not run. CUDA tests can still run normally. |
For the context,
So, let's be clear that adding the label automatically is doable, but only in the first case for privilege users. The mechanism to automatically add the label is independent from the change in this PR, so you can do it in a separate PR later. We need:
One note is that when auto label rule is in place, having both the |
.github/pytorch-probot.yml
Outdated
| @@ -0,0 +1,2 @@ | |||
| ciflow_push_tags: | |||
| - ciflow/8gpu | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does ciflow_push_tags do?
why it's called 8gpu?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ciflow_push_tags tells the bot which labels should map to git tags that trigger CI workflows.
It's called 8gpu following Huy's suggestion on slack https://pytorch.slack.com/archives/C0A0E8ARXPZ/p1764843858514179
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems adding ciflow/8gpu label in a PR will automatically create a tag ciflow/8gpu/[PR_number]. I say so because I observed https://github.com/pytorch/torchtitan/releases/tag/ciflow%2F8gpu%2F2125
What is this ciflow_push_tags for? Is it for letting the bot know that the above happened and which tag to look at?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. As per my understanding, when we add PR label, it'll automatically create & push tag. Because the ciflow_push_tags config tells the bot to create & push the matching tags so that tag-based CI workflows can run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shat is the bot doing? Is it doing the auto labeling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .github/pytorch-probot.yml connects labels to tags. When it sees ciflow/8gpu label in PR, it creates the tag. The automatic labeling is handled by .github/labeler.yml. However, currently I don't see the automatic label getting added when I make changes to PR. @wwwjn Do you know if the totrchtitan repo is setup correctly to put this label automatically or do I need any additional changes?
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it true that
- for each PR a codeowner needs to explicitly add the
ciflow/8gpulabel, so that rocm test could be run? - if a codeowner doesn't do this, the PR can be merged without rocm test passing?
- after CUDA test is run / finished, if we add the label, we can run rocm test without rerun the CUDA test?
.github/pytorch-probot.yml
Outdated
| @@ -0,0 +1,2 @@ | |||
| ciflow_push_tags: | |||
| - ciflow/8gpu | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems adding ciflow/8gpu label in a PR will automatically create a tag ciflow/8gpu/[PR_number]. I say so because I observed https://github.com/pytorch/torchtitan/releases/tag/ciflow%2F8gpu%2F2125
What is this ciflow_push_tags for? Is it for letting the bot know that the above happened and which tag to look at?
| tags: | ||
| - ciflow/8gpu/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for -- why do we need both pull_request and tags?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my understanding the PR workflows and tag workflows are totally independent. Tags provides CI flow, meaning tags can be pushed to trigger CI runs on specific commits even after the PR is closed. They can also be used for versioning releases.
pull_request are used to run workflows when the PR is open.
…feature to add label automatically.
@tianyu-l: Here are my responses-
|
wwwjn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @akashveramd , trying to catch up what will happen after this PR:
As a person with write access to torchtitan:
- My PR should be automatically taged with
ciflow/8gpu(I saw we have labeler.yml and a bot, is it helping our doing the auto labeling?) - My PR needs to pass both rocm tests and CUDA tests to get merged, otherwise it will be blocked
As an unauthorized user:
- If my PR didn't get taged with
ciflow/8gpu, I just need to pass CUDA tests to get my PR merged - If my PR get labeled with
ciflow/8gpu, I have to pass both rocm and CUDA tests before get merged
As a repo maintainer:
- We should review and add
ciflow/8gputag for unauthorized user's PR that we think it's safe to run CI tests, and approve CI workflow to run
Is this a correct understanding / summary?
.github/pytorch-probot.yml
Outdated
| @@ -0,0 +1,2 @@ | |||
| ciflow_push_tags: | |||
| - ciflow/8gpu | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shat is the bot doing? Is it doing the auto labeling?
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work.
I didn't understanding every mechanism here, but I'm OK with landing. @wwwjn please stamp when you are comfortable.
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's give it a try
In this PR, I implemented ciflow/rocm on Torchtitan. The changes are part of integration_test_8gpu_features.yaml. The workflow still supports running on pull_request (without any PR label) for CUDA. However, along with push to main and cron schedule, with the ciflow/8gpu label added to PR, the workflow runs for both CUDA & ROCm.