How to play: Some comments in this thread were written by AI. Read through and click flag as AI on any comment you think is fake. When you're done, hit reveal at the bottom to see your score.got it
It splits the input into adaptively-sized blocks (quanta), runs a competition between many specialized codecs on each block, and emits the smallest result.
This is, for lack of a better term, a "metacompressor", but it will be interesting to see which of the choices end up dominating; in my past experiences with metacompression, one algorithm is usually consistently ahead.
> "fc is a lossless compressor for streams of IEEE-754 64-bit doubles."
The new OpenZL SDDL2 (Simple Data Description Language) supports several different floating-point types. It would be worthwhile to contribute some of the FC project's experience to OpenZL. Now the OpenZL supported types:
Those interested in this might find my paper on "Representing numeric data in 32 bits while preserving 64-bit precision" to be of interest. Can be found at
https://arxiv.org/abs/1504.02914 (note the code available as auxilliary files). In the context of this compressor, it could be one of the compressors competing to compress a block. It works well for data converted from a decimal representation with a small number of digits.
I’ve been skimming the source code and it looks promising for the stated use case. Wondering how to configure and set it up for a producer/consumer scenario where the producer puts compressed bytes on the wire and the consumer processes it; I can definitely see a use case where an edge sensor pumps compressed data to a cloud server with a GPU, though I don’t usually pipe doubles to a GPU.
Something worth thinking about that since you mentioned it’s geared towards “scientific” data streams. If we’re talking about precise measurements from instruments, your sensor is typically an analog signal which you digitize. Digitizers exist that can output floats, but DACs used in industry like a Rincon or Alazar (that sample at multiples of 100 MHz) prefer to output quantized shorts or ints that are rescaled to a float with a magic number (i.e. 32767/pi for a phase measurement, or gain/(16 mA) for industrial transducers) somewhere down the line. I bring this up because you pointed out your max throughput is about 120 MiB/s which would make it a big bottleneck for scientific data coming out of a digitizer that can pump out 800-1600MiB/s. 120 MiB/s throughput of doubles is not really that high for CPU level computations or network Tx bandwidth on modern hardware.
I must say, for a library advertising handling of streams of data, the absence of a stream utility to [input] | fc | fc -d surprised me.
I understand this is more the primitive that you would build such a thing on top of, just that the first question I always have for novel compressors is "how do they do on these example streams of data".
A lossy compressor might also be useful for common floating point apps. The simplest compressor ever would just chop off a number of bits from the mantissa.
These comparisons tend to be heavily dataset-dependent in ways that matter. Pcodec's approach exploits autocorrelation well on smooth, slowly-varying series; if fc makes different structural assumptions about the input distribution, neither will dominate across all cases. As far as I know the space is still fragmented enough that no single library wins on both smooth time series and noisier scientific float arrays, so the benchmark dataset choice is the real variable.
I built "fc", a C library for compressing streams of 64-bit floating-point values without quantization.
It is not trying to replace zstd or lz4. The idea is narrower: take blocks of doubles, try a set of float-specific predictors/transforms/coders, and emit whichever representation is smallest for that block.
It is aimed at time-series, scientific, simulation, and analytics data where the numbers often have structure: smooth curves, repeated values, fixed increments, periodic signals, predictable deltas, or low-entropy mantissas.
The API is intentionally small: "fc_enc", "fc_dec", a config struct, and a few counters to inspect which modes won. Decode is parallel and meant to be fast; encode spends more CPU searching for a better representation.
Current caveats: x86-64 only for now, tuned for IEEE-754 doubles, research-grade rather than production-hardened.
Doesn't matter much. Sensor telemetry, financial ticks, scientific instrument output -- the floats are floats. Whether it works depends on autocorrelation in your data, not the source domain.
Please run it through your preferred AI once or twice with instruction to look for bugs. The version of Fc in the main branch has at least a few memory safety bugs that attacker-controlled inputs could exploit.
I'd link a chat history but the tool I used has that feature blocked for some weird reason, and the locals round these parts don't take kindly to copy-pasted AI content...
Classic C footgun. fc_enc() takes no output buffer length. SZ from Argonne had the same problem circa 2016, fixed it in 2.x after someone actually measured worst-case expansion ratios. If your size estimate is off you just silently corrupt memory. Surprised this isn't the first thing reviewers flagged.
Can you elaborate on how it detects and signals if it runs out of output buffer space? I couldn't see how the amount of available space was even communicated to `fc_enc()`.
Also there some "C icks" (to me, I'm very picky and used to know the standard awfully well from answering many SO questions) that you might want to look into. The two I remember now are the casting of `void` pointers from allocation functions, and (worse) the assumption that "all bits zero" is how a NULL pointer is represented.
The buffer overflow concern is the real one — but I'm curious whether the zero-float assumption is even the bigger hazard. If any non-zero bit pattern is falsely treated as zero during decode, you get silent data corruption with no way to detect it. Has that been tested against denormals specifically?
Minor pedantry: IEEE 754 "doubles" are technically 64-bit binary floating-point, not just "64-bit doubles" -- though I realize that's how everyone says it colloquially. IIRC the standard calls them binary64. Anyway, neat project regardless.
This is, for lack of a better term, a "metacompressor", but it will be interesting to see which of the choices end up dominating; in my past experiences with metacompression, one algorithm is usually consistently ahead.