DeepSpeed-MoE

TODO

Parallelism in DeepSpeed-MoE

TODO: figure out what type of parallelism is used by DeepSpeed-MoE: data, model, pipeline, tensor parallelism?

Synchronization/communication collectives

TODO: figure out what synchronization collectives are used. Eg: SPMD, gang-scheduling, MPMD, etc…?

Pretrained MoE model

TODO: figure out whether we can reuse the Fairseq pre-trained MoE model or we need to obtain another one

Benchmarking results

TODO