Skip to content

Conversation

@akashveramd
Copy link
Collaborator

@akashveramd akashveramd commented Dec 5, 2025

In this PR, I implemented ciflow/rocm on Torchtitan. The changes are part of integration_test_8gpu_features.yaml. The workflow still supports running on pull_request (without any PR label) for CUDA. However, along with push to main and cron schedule, with the ciflow/8gpu label added to PR, the workflow runs for both CUDA & ROCm.

@akashveramd akashveramd self-assigned this Dec 5, 2025
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2025
@tianyu-l
Copy link
Contributor

tianyu-l commented Dec 5, 2025

with the ciflow/8gpu label added to PR, the workflow runs for both CUDA & ROCm.

Can this label be automatically added? If not, what happens if it is not manually added -- is rocm test skipped?

@akashveramd
Copy link
Collaborator Author

with the ciflow/8gpu label added to PR, the workflow runs for both CUDA & ROCm.

Can this label be automatically added? If not, what happens if it is not manually added -- is rocm test skipped?

If the label is not added manually then the ROCm tests will not run. CUDA tests can still run normally.
@huydhn: Please correct me if I am wrong, but we wanted to add label manually by code owners due to security risk, similar to how it works in PyTorch?

@huydhn
Copy link
Contributor

huydhn commented Dec 6, 2025

If the label is not added manually then the ROCm tests will not run. CUDA tests can still run normally. @huydhn: Please correct me if I am wrong, but we wanted to add label manually by code owners due to security risk, similar to how it works in PyTorch?

For the context, ciflow bot only works if the PR author has write permission to the repo or if the CI has already been approved to run https://github.com/pytorch/test-infra/blob/main/torchci/lib/bot/ciflowPushTrigger.ts#L176-L183. This policy applies to every repos under PyTorch.

  • In the first case, it's perfectly ok to automatically add ciflow label.
  • In the second case, even if you have the mechanism to add ciflow automatically, the bot will remove the label from the PR if the author is an unknown contributor with only read permission. When this happens, the reviewer needs to review the PR, approve CI run, then add the label manually anyway. This is the manual process that I'm referring to.

So, let's be clear that adding the label automatically is doable, but only in the first case for privilege users. The mechanism to automatically add the label is independent from the change in this PR, so you can do it in a separate PR later. We need:

  1. A labeler.yml rule like https://github.com/pytorch/pytorch/blob/main/.github/labeler.yml#L11. This works in most cases, e.g. the PR change a file and ciflow label is added
  2. For more complex logic like using regex, we might need to update autoLabelbot at https://github.com/pytorch/test-infra/blob/main/torchci/lib/bot/autoLabelBot.ts#L26

One note is that when auto label rule is in place, having both the ciflow push trigger and pull_request triggers feel redundant, and the latter can be removed.

@@ -0,0 +1,2 @@
ciflow_push_tags:
- ciflow/8gpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does ciflow_push_tags do?
why it's called 8gpu?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ciflow_push_tags tells the bot which labels should map to git tags that trigger CI workflows.

It's called 8gpu following Huy's suggestion on slack https://pytorch.slack.com/archives/C0A0E8ARXPZ/p1764843858514179

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems adding ciflow/8gpu label in a PR will automatically create a tag ciflow/8gpu/[PR_number]. I say so because I observed https://github.com/pytorch/torchtitan/releases/tag/ciflow%2F8gpu%2F2125

What is this ciflow_push_tags for? Is it for letting the bot know that the above happened and which tag to look at?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. As per my understanding, when we add PR label, it'll automatically create & push tag. Because the ciflow_push_tags config tells the bot to create & push the matching tags so that tag-based CI workflows can run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shat is the bot doing? Is it doing the auto labeling?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .github/pytorch-probot.yml connects labels to tags. When it sees ciflow/8gpu label in PR, it creates the tag. The automatic labeling is handled by .github/labeler.yml. However, currently I don't see the automatic label getting added when I make changes to PR. @wwwjn Do you know if the totrchtitan repo is setup correctly to put this label automatically or do I need any additional changes?

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that

  • for each PR a codeowner needs to explicitly add the ciflow/8gpu label, so that rocm test could be run?
  • if a codeowner doesn't do this, the PR can be merged without rocm test passing?
  • after CUDA test is run / finished, if we add the label, we can run rocm test without rerun the CUDA test?

@@ -0,0 +1,2 @@
ciflow_push_tags:
- ciflow/8gpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems adding ciflow/8gpu label in a PR will automatically create a tag ciflow/8gpu/[PR_number]. I say so because I observed https://github.com/pytorch/torchtitan/releases/tag/ciflow%2F8gpu%2F2125

What is this ciflow_push_tags for? Is it for letting the bot know that the above happened and which tag to look at?

Comment on lines +6 to +7
tags:
- ciflow/8gpu/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for -- why do we need both pull_request and tags?

Copy link
Collaborator Author

@akashveramd akashveramd Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my understanding the PR workflows and tag workflows are totally independent. Tags provides CI flow, meaning tags can be pushed to trigger CI runs on specific commits even after the PR is closed. They can also be used for versioning releases.
pull_request are used to run workflows when the PR is open.

@akashveramd
Copy link
Collaborator Author

akashveramd commented Dec 12, 2025

Is it true that

  • for each PR a codeowner needs to explicitly add the ciflow/8gpu label, so that rocm test could be run?
  • if a codeowner doesn't do this, the PR can be merged without rocm test passing?
  • after CUDA test is run / finished, if we add the label, we can run rocm test without rerun the CUDA test?

@tianyu-l: Here are my responses-

for each PR a codeowner needs to explicitly add the ciflow/8gpu label, so that rocm test could be run?

  • No, for the trusted authors who has write permissions the ciflow/8gpu label will be added automatically which will run ROCm tests. This is WIP. For unknown authors with only read permission, the reviewer needs to review the PR, approve CI run, then add the label manually. Which then will run the ROCm tests. However, on cron schedule and when push to main the ROCm tests will run irrespective of PR label.

if a codeowner doesn't do this, the PR can be merged without rocm test passing?

  • Yes, for unknown authors. However, isn't it the same for PyTorch, i.e. ROCm workflow needs PR labels to run?
    cc: @huydhn

after CUDA test is run / finished, if we add the label, we can run rocm test without rerun the CUDA test?

  • Yes, CUDA tests won't re-run when the label is added. But something I have noticed is when I manually add the label after the CUDA workflow has completed running, the CUDA workflow will disappear from the checks. I believe it's happening because CUDA & ROCm shares the same workflow. I'll have to see how it works when automatic label is in place.

Copy link
Contributor

@wwwjn wwwjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @akashveramd , trying to catch up what will happen after this PR:

As a person with write access to torchtitan:

  • My PR should be automatically taged with ciflow/8gpu (I saw we have labeler.yml and a bot, is it helping our doing the auto labeling?)
  • My PR needs to pass both rocm tests and CUDA tests to get merged, otherwise it will be blocked

As an unauthorized user:

  • If my PR didn't get taged with ciflow/8gpu, I just need to pass CUDA tests to get my PR merged
  • If my PR get labeled with ciflow/8gpu, I have to pass both rocm and CUDA tests before get merged

As a repo maintainer:

  • We should review and add ciflow/8gpu tag for unauthorized user's PR that we think it's safe to run CI tests, and approve CI workflow to run

Is this a correct understanding / summary?

@@ -0,0 +1,2 @@
ciflow_push_tags:
- ciflow/8gpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shat is the bot doing? Is it doing the auto labeling?

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work.

I didn't understanding every mechanism here, but I'm OK with landing. @wwwjn please stamp when you are comfortable.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's give it a try

@tianyu-l tianyu-l merged commit 9bc50ea into main Dec 13, 2025
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm-mi300 ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot. module: rocm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants