Cublaslt Grouped Gemm | ~repack~
Enter – a modern solution designed to handle the messy, heterogeneous reality of advanced computing. The Problem with Traditional Batched GEMM Imagine training a recommendation system with embedding tables of varying sizes, or running inference on a transformer model with variable sequence lengths. In these scenarios, you might have 1,024 independent GEMM operations, each with different M, N, or K dimensions.
Traditional cuBLAS offers batched GEMM (e.g., cublas<t>gemmBatched ), which runs a list of independent matrix multiplications. However, it comes with a major limitation: (M, N, K) and data types. cublaslt grouped gemm
// Allocate and fill matrices...
float alpha = 1.0f, beta = 0.0f; cublasLtMatmulGrouped(handle, nullptr, matmulDesc, &alpha, &beta, (void**)A_ptrs, (void**)B_ptrs, (void**)C_ptrs, (void**)C_ptrs, groupCount, groupPlans); cuBLASLt Grouped GEMM represents a paradigm shift for batched linear algebra on GPUs. It acknowledges that real-world workloads are irregular, heterogeneous, and dynamic. By moving the complexity of scheduling and fusing into the library, it allows developers to write clean, expressive code that still achieves near-peak hardware performance. Enter – a modern solution designed to handle