Mirage
Getting Started
Building from source
Multi-Level Graph Representation
The CUDA Transpiler
Introduction to Mirage’s CUDA Transpiler
Architecture
Prerequisites
Transpiler Overview
Algorithms
Threadblock Level Data Reuse
Layout Resolution
TB Graph Scheduling and Memory Planning
TB Graph Scheduling
Decide How to Swizzle
Memory Planning
Problems and Solutions
How to Perform Threadblock Level Matrix Multiplication when the Size is not Divisible by MMA Size
How to Decide
thr_layout
when Calling
make_tiled_mma
How to Check Whether or Not We Can Use Chunked Copy
When and How to Store the Accumulator in Register File (instead of Shared Memory)
The Triton Transpiler
Introduction
Key Features
Limitations and Solutions
Power-of-2 Dimension Requirements
Matrix Multiplication Constraints
Shared Memory Limitations
Profiler Implementation
TritonProfiler
Usage Guide
Basic Usage
Advanced Guide
Debug Mode
Iteration Times
Tutorials
Superoptimizing RMSNorm and Linear
Superoptimizing Low-Rank Adaptation
Superoptimizing Gated MLP
Superoptimizing Group-Query Attention
Superoptimizing Attention with QK Normalization
Superoptimizing Multi-Latent Attention
Mirage
Tutorials
Superoptimizing Multi-Latent Attention
View page source
Superoptimizing Multi-Latent Attention
Introduction