Hello,
First of all, great work on this project! I have a couple of questions regarding the implementation details of Reinforced Fine-Tuning (ReFT):
- Code Availability: Are there any plans to release the code for ReFT? It would be incredibly helpful for reproducibility and further research.
- Implementation Details: Could you clarify how
log p_{\theta}(x|a) is calculated in your approach? Specifically, is the methodology similar to the implementation in DDPO?
Thank you for your time and contributions! Looking forward to your response.