Skip to content

Conversation

@TXacs
Copy link

@TXacs TXacs commented Dec 5, 2025

Auto-Partition in torchtitan

Overview

This PR provides an automatic partitioning method that considers the computation cost of embedding layers.
Thsi method involves calculating the floating-point operations (FLOPs) of the embedding layers and constructing an array that incorporates the FLOPs of both the transformer and embedding layers. Subsequently, a heuristic algorithm is employed to identify a balanced pipeline partition.

Solution Architecture

  1. Dynamic Cost Analysis
  2. Adaptive Partitioning Algorithm
  3. Workload Balancing

Performance

Hardware configuration: 4x RTX 3090 24GB, pipeline parallelism dimension is 4.

llama3 配置对比

hidden size layers autopipe TPS default TPS Speedup
dim=256 6 31,094 29,549 +5.2%
dim=256 12 21,803 21,923 -0.5%
dim=2048 12 3,348 2,616 +28.0%
dim=4096 12 981 761 +28.9%

deepseekv3(without moe) 配置对比

hidden size layers autopipe TPS default TPS Speedup
dim=256 6 13,373 13,059 +2.4%
dim=256 12 7,714 6,859 +12.5%
dim=2048 12 4,331 3,810 +13.7%
dim=4096 12 2,888 2,561 +12.8%
dim=4096 16 2,207 2,008 +9.9%
dim=8192 16 4,331 3,935 +10.1%

1. Improve pipeline performance
2. Auto partition modules
@meta-cla
Copy link

meta-cla bot commented Dec 5, 2025

Hi @TXacs!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Is it true that the only "real" deltas are

  • autopipe.cpp
  • pipeline_parallel.py
  • profiler.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks interesting -- how much benefit you'd get from having a c++ implementation, compared with a python one?

@TXacs
Copy link
Author

TXacs commented Dec 5, 2025

Thanks. Is it true that the only "real" deltas are

  • autopipe.cpp
  • pipeline_parallel.py
  • profiler.py

Yes,actually, profile.py also uses the file from DeepSpeed. It would be even better if TorchTitan could provide a more authoritative FLOPs calculation method in the future, so that we could also adapt it for MoE models.

@tianyu-l tianyu-l requested a review from H-Huang December 5, 2025 01:59
@meta-cla
Copy link

meta-cla bot commented Dec 5, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2025
@tianyu-l tianyu-l added the enhancement New feature or request label Dec 9, 2025

parts = pipeline(
mflops_list,
[i * 3 for i in mflops_list], # Assume backward is 3x forward
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it assumed to be 3x?

Copy link

@McmillanTAC McmillanTAC Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the computation time, in which the backward computation time is twice that of the forward computation time, by default, we assume that the floating-point operations (FLOPs) required for backward computation are twice those for forward computation. During the modeling process, we take the recomputation technique into account. This technique inserts an additional forward computation before the backward computation. Consequently, we set the default FLOPs for backward computation to be three times those of forward computation.

# Profile each layer's FLOPS
mflops_list = []
for _, layer in enumerate(model):
prof = FlopsProfiler(layer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the FlopsProfiler does not estimate the backward flops?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct that the FlopsProfiler does not estimate backward flops. Instead, we use a heuristic rule by default: given the adoption of recomputation techniques in the backward computation, backward flops are set to three times those of the forward computation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why three times? IIUC the convention was two times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants