Discussion about this post

User's avatar
Neural Foundry's avatar

The breakdown of how dictionary encoding limits actually trip up production systems was something I wish I'd understood earlier. I spent way too long debugging slow queries on high-cardinality event data before realizing the dictionaries were spilling over to plain encoding. The part about row group statistics beign most effective on sorted data is spot-on but rarely gets discussed - random data basically negates the whole min/max optimization. One thing that often bites teams is the file size sweet spot changing based on storage backend - what works great for local SSD can perform totally differently on S3 where connection overhead dominates.

No posts

Ready for more?