Skip to main content

One post tagged with "kernels"

View All Tags

Making INT8 actually fast: a fused kernel for Ideogram 4 on a 3090

· 10 min read

A custom fused INT8 GEMM that turns our Ideogram 4.0 INT8 build from the slowest variant into the fastest on consumer Ampere, and makes 1024px single-GPU.

Quick summary: Last week we shipped an INT8 build of Ideogram 4.0 that matched FP8 on quality but was, embarrassingly, the slowest of the three variants on a 3090. The reason was that the "INT8" path never actually used the GPU's INT8 hardware. We wrote one fused kernel that fixes that. INT8 goes from slowest to fastest, and a 1024px image now generates on a single RTX 3090.