DeepSpeed-MoE
TODO
Parallelism in DeepSpeed-MoE
TODO: figure out what type of parallelism is used by DeepSpeed-MoE: data, model, pipeline, tensor parallelism?
Synchronization/communication collectives
TODO: figure out what synchronization collectives are used. Eg: SPMD, gang-scheduling, MPMD, etc…?
Pretrained MoE model
TODO: figure out whether we can reuse the Fairseq pre-trained MoE model or we need to obtain another one
Benchmarking results
TODO