GeistHaus
log in · sign up

Compress-a-Palooza: Unpacking 5 Billion Varints in only 4 Billion CPU Cycles

bazhenov.me

Introduction Link to heading Varint is a widely recognized technique used for compressing integer streams. Essentially, it suggests that it can be more efficient to encode a number using a variable-length representation instead of a fixed-size binary representation. By removing leading zeros from the binary number, the overall representation size can be reduced. This technique works particularly well for encoding smaller numbers. In this article, I provide a brief introduction and rationale for varint encoding. Additionally, I describe the Stream VByte format, which enables fully vectorized decoding through SSSE3 instructions. I also share my findings from implementing this algorithm in Rust, which includes both encoding and decoding primitives and the ability to read data from both RAM and disk.

2 pages link to this URL
Phrase Matching in Marginalia Search

Marginalia Search now properly supports phrase matching. This not only permits a more robust implementation of quoted search queries, but also helps promote results where the search terms occur in the document exactly in the same order as they do in the query. This is a write-up about implementing this change. This is going to be a relatively long post, as it represents about 4 months of work. I’m also happy and grateful to announce that the nlnet people reached out after the run of the grant was over and asked me if I had more work in the pipe, and agreed to fund this change as well!

0 inbound links article en log nlnetsearch-engineProgramming