I need to improve the device efficiency of the code in this repo to work with a single GPU, ideally a fairly limited one (e.g., 16GB VRAM cards), but to begin with a single A6000. Furthermore, the speed of motion augmentation needs to improve significantly.