Discussion about this post

User's avatar
Rainbow Roxy's avatar

Thanks for writing this, it clarifies a lot. It’s truly amazing how focusing on just matrix multiplication with a systolic array can unlock such imense power. What a clever design!

Neural Foundry's avatar

Exceptional breakdown of the systolic array tradeoff. The data reuse insight is critical because memory bandwidth is the real bottleneck in modern ML, not FLOPS. I remember benchmarking a sparse attention layer that ran slower on TPUs than V100s precisley because the systolic array processed all those zeros. The bit about XLA limiting escape hatches is underrated - when your model doesn't fit the compiler's optimization patterns, you're kinda stuck.

No posts

Ready for more?