perf(pipeline): implement auto-partition algorithm #2113

TXacs · 2025-12-05T01:13:00Z

Auto-Partition in torchtitan

Overview

This PR provides an automatic partitioning method that considers the computation cost of embedding layers.
Thsi method involves calculating the floating-point operations (FLOPs) of the embedding layers and constructing an array that incorporates the FLOPs of both the transformer and embedding layers. Subsequently, a heuristic algorithm is employed to identify a balanced pipeline partition.

Solution Architecture

Dynamic Cost Analysis
Adaptive Partitioning Algorithm
Workload Balancing

Performance

Hardware configuration: 4x RTX 3090 24GB, pipeline parallelism dimension is 4.

llama3 配置对比

hidden size	layers	autopipe TPS	default TPS	Speedup
dim=256	6	31,094	29,549	+5.2%
dim=256	12	21,803	21,923	-0.5%
dim=2048	12	3,348	2,616	+28.0%
dim=4096	12	981	761	+28.9%

deepseekv3(without moe) 配置对比

hidden size	layers	autopipe TPS	default TPS	Speedup
dim=256	6	13,373	13,059	+2.4%
dim=256	12	7,714	6,859	+12.5%
dim=2048	12	4,331	3,810	+13.7%
dim=4096	12	2,888	2,561	+12.8%
dim=4096	16	2,207	2,008	+9.9%
dim=8192	16	4,331	3,935	+10.1%

1. Improve pipeline performance 2. Auto partition modules

meta-cla · 2025-12-05T01:13:06Z

Hi @TXacs!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

tianyu-l

Thanks. Is it true that the only "real" deltas are

autopipe.cpp
pipeline_parallel.py
profiler.py

tianyu-l · 2025-12-05T01:41:52Z

torchtitan/experiments/autopartition/infra/cpp/autopipe.cpp

This looks interesting -- how much benefit you'd get from having a c++ implementation, compared with a python one?

TXacs · 2025-12-05T01:52:41Z

Thanks. Is it true that the only "real" deltas are

autopipe.cpp

pipeline_parallel.py

profiler.py

Yes，actually, profile.py also uses the file from DeepSpeed. It would be even better if TorchTitan could provide a more authoritative FLOPs calculation method in the future, so that we could also adapt it for MoE models.

meta-cla · 2025-12-05T03:08:46Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

H-Huang · 2025-12-11T16:39:38Z

torchtitan/experiments/autopartition/infra/pipeline_parallel.py

+
+    parts = pipeline(
+        mflops_list,
+        [i * 3 for i in mflops_list],  # Assume backward is 3x forward


why is it assumed to be 3x?

Similar to the computation time, in which the backward computation time is twice that of the forward computation time, by default, we assume that the floating-point operations (FLOPs) required for backward computation are twice those for forward computation. During the modeling process, we take the recomputation technique into account. This technique inserts an additional forward computation before the backward computation. Consequently, we set the default FLOPs for backward computation to be three times those of forward computation.

H-Huang · 2025-12-11T16:41:51Z

torchtitan/experiments/autopartition/infra/pipeline_parallel.py

+    # Profile each layer's FLOPS
+    mflops_list = []
+    for _, layer in enumerate(model):
+        prof = FlopsProfiler(layer)


I guess the FlopsProfiler does not estimate the backward flops?

You are correct that the FlopsProfiler does not estimate backward flops. Instead, we use a heuristic rule by default: given the adoption of recomputation techniques in the backward computation, backward flops are set to three times those of the forward computation.

curious why three times? IIUC the convention was two times.

perf(pipeline): implement auto-partition algorithm

6a06ed7

1. Improve pipeline performance 2. Auto partition modules

tianyu-l reviewed Dec 5, 2025

View reviewed changes

tianyu-l requested a review from H-Huang December 5, 2025 01:59

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2025

Format to fix and add license

1f8b2f4

tianyu-l added the enhancement New feature or request label Dec 9, 2025

H-Huang reviewed Dec 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(pipeline): implement auto-partition algorithm #2113

perf(pipeline): implement auto-partition algorithm #2113

Uh oh!

TXacs commented Dec 5, 2025

Uh oh!

meta-cla bot commented Dec 5, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Dec 5, 2025

Uh oh!

TXacs commented Dec 5, 2025

Uh oh!

meta-cla bot commented Dec 5, 2025

Uh oh!

H-Huang Dec 11, 2025

Uh oh!

McmillanTAC Dec 12, 2025 •

edited

Loading

Uh oh!

H-Huang Dec 11, 2025

Uh oh!

McmillanTAC Dec 12, 2025

Uh oh!

tianyu-l Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf(pipeline): implement auto-partition algorithm #2113

Are you sure you want to change the base?

perf(pipeline): implement auto-partition algorithm #2113

Uh oh!

Conversation

TXacs commented Dec 5, 2025

Auto-Partition in torchtitan

Overview

Solution Architecture

Performance

llama3 配置对比

deepseekv3(without moe) 配置对比

Uh oh!

meta-cla bot commented Dec 5, 2025

Action Required

Process

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

TXacs commented Dec 5, 2025

Uh oh!

meta-cla bot commented Dec 5, 2025

Uh oh!

H-Huang Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

McmillanTAC Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

McmillanTAC Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

McmillanTAC Dec 12, 2025 •

edited

Loading