Mirage

Getting Started

  • Building from source
  • Multi-Level Graph Representation
  • The CUDA Transpiler
    • Introduction to Mirage’s CUDA Transpiler
    • Architecture
    • Prerequisites
    • Transpiler Overview
    • Algorithms
      • Threadblock Level Data Reuse
      • Layout Resolution
      • TB Graph Scheduling and Memory Planning
      • TB Graph Scheduling
      • Decide How to Swizzle
      • Memory Planning
    • Problems and Solutions
      • How to Perform Threadblock Level Matrix Multiplication when the Size is not Divisible by MMA Size
      • How to Decide thr_layout when Calling make_tiled_mma
      • How to Check Whether or Not We Can Use Chunked Copy
      • When and How to Store the Accumulator in Register File (instead of Shared Memory)
  • The Triton Transpiler
    • Introduction
    • Key Features
    • Limitations and Solutions
      • Power-of-2 Dimension Requirements
      • Matrix Multiplication Constraints
      • Shared Memory Limitations
    • Profiler Implementation
      • TritonProfiler
    • Usage Guide
      • Basic Usage
    • Advanced Guide
      • Debug Mode
      • Iteration Times
  • Tutorials
    • Superoptimizing RMSNorm and Linear
    • Superoptimizing Low-Rank Adaptation
    • Superoptimizing Gated MLP
    • Superoptimizing Group-Query Attention
    • Superoptimizing Attention with QK Normalization
    • Superoptimizing Multi-Latent Attention
Mirage
  • Tutorials
  • View page source

Tutorials

Below is a set of tutorials for superoptimizing DNNs with Mirage.

  • Superoptimizing RMSNorm and Linear
  • Superoptimizing Low-Rank Adaptation
  • Superoptimizing Gated MLP
  • Superoptimizing Group-Query Attention
  • Superoptimizing Attention with QK Normalization
  • Superoptimizing Multi-Latent Attention
Previous Next

© Copyright 2024 Mirage team.

Built with Sphinx using a theme provided by Read the Docs.