Cublaslt: Grouped Gemm !link!

// For Batched/Grouped: Strides define the step to the next matrix in the group int64_t strideA = M * K; int64_t strideB = K * N; int64_t strideC = M * N; int batchCount = 100; // Number of GEMMs in the group

If your group consists of matrices with $M, N, K$ dimensions (common in MoE), standard Batched GEMM (Strided) won't work. cublaslt grouped gemm

#include <cublasLt.h>

State vector updates involve many small, independent matrix multiplications, each with different dimensions based on qubit connectivity. // For Batched/Grouped: Strides define the step to

Grouped GEMM is not a magic bullet. To get the best performance: int64_t strideB = K * N