--num-gpus is implemented by sharding each expert layer across GPUs, i.e. expert parallelism
this is probably not advisable for local experimentation, especially on batch size 1 -- where EP only adds communication overhead to no speed benefit vs naive model/pipeline parallel.
--num-gpusis implemented by sharding each expert layer across GPUs, i.e. expert parallelismthis is probably not advisable for local experimentation, especially on batch size 1 -- where EP only adds communication overhead to no speed benefit vs naive model/pipeline parallel.