THROTL_GRP_QUANTUM
TL;DR: The second while loop (https://elixir.bootlin.com/linux/v6.14/source/block/blk-throttle.c#L960) is compiled with certain unrolling optimization
Experiments: directly mutating VALUE below:
if (nr_writes >= VALUE)
break;
for value in `seq 1 10`; do
sed "s|if (nr_writes >= FINDME)|if (nr_writes >= $value)|" \
< block/blk-throttle.template.c \
> block/blk-throttle.c
make block/blk-throttle.o
objdump -d block/blk-throttle.o > disas/origin.$value.txt
done
Running x=N; diff origin.$x.txt origin.$((x+1)).txt
Only 4 to 5 produce reasonably small changes. All other adjacent pairs differ drastically.
4083c4083
< 3648: 41 83 fd 04 cmp $0x4,%r13d
---
> 3648: 41 83 fd 05 cmp $0x5,%r13d
Adding __attribute__((optimize("O0"))) to throtl_dispatch_tg function
1 vs. 2
73,74c73,74
< c3: 83 7d e4 00 cmpl $0x0,-0x1c(%rbp)
< c7: 75 39 jne 102 <throtl_dispatch_tg+0xf2>
---
> c3: 83 7d e4 01 cmpl $0x1,-0x1c(%rbp)
> c7: 77 39 ja 102 <throtl_dispatch_tg+0xf2>
All other adjacent pairs are like
73c73
< c3: 83 7d e4 01 cmpl $0x1,-0x1c(%rbp)
---
> c3: 83 7d e4 02 cmpl $0x2,-0x1c(%rbp)
With -O0, code structure can be retained
Adding __attribute__((optimize("O2","-fno-unroll-loops","-fno-peel-loops","-fno-unswitch-loops")))
$ x=4 ; diff no-unroll.$x.txt no-unroll.$((x+1)).txt
4011c4011
< 350e: 41 83 fc 04 cmp $0x4,%r12d
---
> 350e: 41 83 fc 05 cmp $0x5,%r12d
$ x=5 ; diff no-unroll.$x.txt no-unroll.$((x+1)).txt
4011c4011
< 350e: 41 83 fc 05 cmp $0x5,%r12d
---
> 350e: 41 83 fc 06 cmp $0x6,%r12d
$ x=7 ; diff no-unroll.$x.txt no-unroll.$((x+1)).txt
4011c4011
< 3512: 41 83 fc 07 cmp $0x7,%r12d
---
> 3512: 41 83 fc 08 cmp $0x8,%r12d
$ x=8 ; diff no-unroll.$x.txt no-unroll.$((x+1)).txt
4011c4011
< 3512: 41 83 fc 08 cmp $0x8,%r12d
---
> 3512: 41 83 fc 09 cmp $0x9,%r12d
$ x=9 ; diff no-unroll.$x.txt no-unroll.$((x+1)).txt
4011c4011
< 3512: 41 83 fc 09 cmp $0x9,%r12d
---
> 3512: 41 83 fc 0a cmp $0xa,%r12d
All other adjacent pairs exhibit huge differences. E.g. from 6 to 7, the cmp is moved to a different address
$ for x in `seq 1 9`; do echo "x=$x"; diff no-unroll.$x.txt no-unroll.$((x+1)).txt | wc -l; done
x=1
3349
x=2
3322
x=3
89
x=4
4
x=5
4
x=6
99
x=7
4
x=8
4
x=9
4
Cancelling more optimization make things better but still not like -O0
__attribute__((optimize(
"O2",
"-fno-unroll-loops",
"-fno-peel-loops",
"-fno-unswitch-loops",
"-fno-reorder-blocks",
"-fno-guess-branch-probability",
"-fno-reorder-functions"
)))
$ for x in `seq 1 9`; do echo "x=$x"; diff no-reorder.$x.txt no-reorder.$((x+1)).txt | wc -l; done
x=1
3304
x=2
3344
x=3
4
x=4
4
x=5
4
x=6
4
x=7
4
x=8
4
x=9
4
Dump optimization record
gcc -Wp,-MMD,block/.blk-throttle.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -I./ubuntu/include -include ./include/linux/compiler_types.h -D__KERNEL__ -std=gnu11 -fshort-wchar -funsigned-char -fno-common -fno-PIE -fno-strict-aliasing -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables -mindirect-branch=thunk-extern -mindirect-branch-register -mindirect-branch-cs-prefix -mfunction-return=thunk-extern -fno-jump-tables -mharden-sls=all -fpatchable-function-entry=16,16 -fno-delete-null-pointer-checks -O2 -fno-allow-store-data-races -fstack-protector-strong -fno-omit-frame-pointer -fno-optimize-sibling-calls -ftrivial-auto-var-init=zero -fno-stack-clash-protection -fzero-call-used-regs=used-gpr -pg -mrecord-mcount -mfentry -DCC_USING_FENTRY -fmin-function-alignment=16 -fstrict-flex-arrays=3 -fno-strict-overflow -fno-stack-check -fconserve-stack -Wall -Wundef -Werror=implicit-function-declaration -Werror=implicit-int -Werror=return-type -Werror=strict-prototypes -Wno-format-security -Wno-trigraphs -Wno-frame-address -Wno-address-of-packed-member -Wmissing-declarations -Wmissing-prototypes -Wframe-larger-than=1024 -Wno-main -Wno-dangling-pointer -Wvla -Wno-pointer-sign -Wcast-function-type -Wno-stringop-overflow -Wno-array-bounds -Wno-alloc-size-larger-than -Wimplicit-fallthrough=5 -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -Wenum-conversion -Wextra -Wunused -Wno-unused-but-set-variable -Wno-unused-const-variable -Wno-packed-not-aligned -Wno-format-overflow -Wno-format-truncation -Wno-stringop-truncation -Wno-override-init -Wno-missing-field-initializers -Wno-type-limits -Wno-shift-negative-value -Wno-maybe-uninitialized -Wno-sign-compare -Wno-unused-parameter -g -gdwarf-5 -fsanitize=bounds-strict -fsanitize=shift -fsanitize=bool -fsanitize=enum -fsanitize=signed-integer-overflow -DKBUILD_MODFILE='"block/blk-throttle"' -DKBUILD_BASENAME='"blk_throttle"' -DKBUILD_MODNAME='"blk_throttle"' -D__KBUILD_MODNAME=kmod_blk_throttle \
-fopt-info-all-optall \
-fsave-optimization-record \
-c -o block/blk-throttle.o block/blk-throttle.c
block/blk-throttle.c:954:56: note: considering unrolling loop 4 at BB 39
considering unrolling loop with constant number of iterations
considering unrolling loop with runtime-computable number of iterations
which confirms the theory again.
Feasibility of #16
See chat history https://chatgpt.com/share/68b53b44-6f98-8013-9258-dff728a75001
Especially the last question --
- Forget about compilers for now. If I want to achieve something semantically equivalent to "changing a constant in source from A to B", is it always possible to do so by only modifying assembly operands (i.e. not changing assembly code structure or instruction boundaries), even via heroic manual binary rewrite? If not, give counter examples
ChatGPT answer
3) Can you *always* emulate “A→B” by **only** changing existing operands (no instruction-length/control-flow/layout changes)?
No. Many real cases require changing instruction forms, duplicating/removing blocks, or editing data sections. Counterexamples:
-
Immediate size overflows
Original fits imm8, new needs imm32:
cmp $127,%eax (3 bytes) → cmp $300,%eax needs the imm32 form (6 bytes). You can’t keep instruction boundaries unchanged.
-
Algorithm selection changes
Division/mod by constant:
x/8 → sar $3,%eax (1 instruction).
x/3 → sequence with a 64-bit multiply by magic + shifts/adds.
Going to or from a power-of-two divisor crosses between one instruction and a sequence. No operand-only patch can do that.
-
Loop unrolling/peeling/if-conversion driven by the constant
A bound moving from 2→3 may flip heuristic thresholds: the compiler might (un)roll or (un)peel. The number of replicated basic blocks changes; you would have to add or remove blocks, not just immediates.
-
Switch lowering & jump-table sizing
Changing case 0..A to case 0..B generally requires resizing the jump table, adjusting the range check and bias, and possibly emitting more case bodies. Simply bumping a compare immediate will index past the last table entry — wrong semantics.
-
Compile-time only constants
Anything in #if / static_assert / template parameters / if constexpr affects what code exists. The “other side” of the branch was never generated; you can’t conjure it by operand edits.
-
Layout-affecting constants
Array/struct sizes or alignment constants change field offsets and stack frame layout throughout a function (and sometimes across TUs). Achieving semantic equivalence would require editing dozens of displacements, and may still miss ABI effects. Some changes also alter section sizes → cascading address shifts; you said “no structure change.”
-
Call vs inline cutoff
A size threshold might flip an inline into a library call (or vice versa). Operand-only edits can’t insert/remove a call sequence.
-
Architectural encoding limits
Some instructions simply lack an immediate form for the needed width (e.g., cmp r/m64, imm sign-extends imm32; if you needed a distinct 64-bit constant outside the sign-extendable range, there is no encoding to keep length identical).
-
Constant folded away
If A makes a branch statically dead (DCE removes the whole block) and B would keep it alive, the “missing” code isn’t present to re-enable with an operand tweak.
THROTL_GRP_QUANTUMTL;DR: The second
whileloop (https://elixir.bootlin.com/linux/v6.14/source/block/blk-throttle.c#L960) is compiled with certain unrolling optimizationExperiments: directly mutating
VALUEbelow:Running
x=N; diff origin.$x.txt origin.$((x+1)).txtOnly 4 to 5 produce reasonably small changes. All other adjacent pairs differ drastically.
Adding
__attribute__((optimize("O0")))tothrotl_dispatch_tgfunction1 vs. 2
All other adjacent pairs are like
With
-O0, code structure can be retainedAdding
__attribute__((optimize("O2","-fno-unroll-loops","-fno-peel-loops","-fno-unswitch-loops")))All other adjacent pairs exhibit huge differences. E.g. from 6 to 7, the
cmpis moved to a different addressCancelling more optimization make things better but still not like
-O0Dump optimization record
which confirms the theory again.
Feasibility of #16
See chat history https://chatgpt.com/share/68b53b44-6f98-8013-9258-dff728a75001
Especially the last question --
ChatGPT answer
3) Can you *always* emulate “A→B” by **only** changing existing operands (no instruction-length/control-flow/layout changes)?No. Many real cases require changing instruction forms, duplicating/removing blocks, or editing data sections. Counterexamples:
Immediate size overflows
Original fits imm8, new needs imm32:
cmp $127,%eax(3 bytes) →cmp $300,%eaxneeds the imm32 form (6 bytes). You can’t keep instruction boundaries unchanged.Algorithm selection changes
Division/mod by constant:
x/8→sar $3,%eax(1 instruction).x/3→ sequence with a 64-bit multiply by magic + shifts/adds.Going to or from a power-of-two divisor crosses between one instruction and a sequence. No operand-only patch can do that.
Loop unrolling/peeling/if-conversion driven by the constant
A bound moving from 2→3 may flip heuristic thresholds: the compiler might (un)roll or (un)peel. The number of replicated basic blocks changes; you would have to add or remove blocks, not just immediates.
Switch lowering & jump-table sizing
Changing
case 0..Atocase 0..Bgenerally requires resizing the jump table, adjusting the range check and bias, and possibly emitting more case bodies. Simply bumping a compare immediate will index past the last table entry — wrong semantics.Compile-time only constants
Anything in
#if/static_assert/ template parameters /if constexpraffects what code exists. The “other side” of the branch was never generated; you can’t conjure it by operand edits.Layout-affecting constants
Array/struct sizes or alignment constants change field offsets and stack frame layout throughout a function (and sometimes across TUs). Achieving semantic equivalence would require editing dozens of displacements, and may still miss ABI effects. Some changes also alter section sizes → cascading address shifts; you said “no structure change.”
Call vs inline cutoff
A size threshold might flip an inline into a library call (or vice versa). Operand-only edits can’t insert/remove a call sequence.
Architectural encoding limits
Some instructions simply lack an immediate form for the needed width (e.g.,
cmp r/m64, immsign-extends imm32; if you needed a distinct 64-bit constant outside the sign-extendable range, there is no encoding to keep length identical).Constant folded away
If A makes a branch statically dead (DCE removes the whole block) and B would keep it alive, the “missing” code isn’t present to re-enable with an operand tweak.