// For Batched/Grouped: Strides define the step to the next matrix in the group int64_t strideA = M * K; int64_t strideB = K * N; int64_t strideC = M * N; int batchCount = 100; // Number of GEMMs in the group
If your group consists of matrices with $M, N, K$ dimensions (common in MoE), standard Batched GEMM (Strided) won't work. cublaslt grouped gemm
#include <cublasLt.h>
State vector updates involve many small, independent matrix multiplications, each with different dimensions based on qubit connectivity. // For Batched/Grouped: Strides define the step to
Grouped GEMM is not a magic bullet. To get the best performance: int64_t strideB = K * N