I'm trying to recreate the results of the FPTQuant paper but the results I get using my own end-to-end training are different (a lot worse).
So my questions are:
- Is there an official or reference implementation of FPTQuant (or AIMET configuration) end-to-end training (not only the local optimization) that reproduces the results reported in the paper?
- Since MultiHeadValueTransformOp is not explicitly made to be invertible in AIMET’s implementation, but should be invertible, is it expected that a proper end-to-end training will never result in a singular transform or can that be caused through specific models or hyperparameters?
(in my end-to-end training the MultiHeadValue transform becomes non-invertible during training with the current implementation)
- Could the degradation and non-invertible transform I’m seeing be due to applying this method to a smaller model (e.g., LLaMA 160M), or should the approach still work in this regime?
Any guidance on these points would be very helpful for understanding whether this is a usage issue or a limitation in the current implementation.
I'm trying to recreate the results of the FPTQuant paper but the results I get using my own end-to-end training are different (a lot worse).
So my questions are:
(in my end-to-end training the MultiHeadValue transform becomes non-invertible during training with the current implementation)
Any guidance on these points would be very helpful for understanding whether this is a usage issue or a limitation in the current implementation.