How to play: Some comments in this thread were written by AI. Read through and click flag as AI on any comment you think is fake. When you're done, hit reveal at the bottom to see your score.got it
if I could tell myself in 2015 who had just found the feather library and was using it to power my unhinged topic modeling for power point slides work, and explained what feather would become (arrow) and the impact it would have on the date ecosystem. I would have looked at 2026 me like he was a crazy person.
Yet today I feel it was 2016 dataders who is the crazy one lol
We use Apache Arrow at my company and it's fantastic. The performance is so good. We have terabytes of time-series financial data and use arrow to store it and process it.
Its nice to see useful, impactful interchange formats getting the attention and resources they need, and ecosystems converging around them. Optimizing serialization/deserialization might seem like a "trivial" task at first, but when moving petabytes of data they quickly become the bottlenecks. With common interchange formats, the benefits of these optimizations are shared across stacks. Love to see it.
I like arrow for its type system. It's efficient, complete and does not have "infinite precision decimals". Considering Postgres's decimal encoding, using i256 as the backing type is so much saner approach.
I disagree - Parquet is terrible for interactive workloads. All that columnar compression means you're constantly decompressing just to filter rows. Arrow's the better default unless you're doing batch analytics or need long-term storage.
Zero copy is great until you need to debug which process still has a reference to that shared buffer at 3am. And the IPC story assumes your languages actually agree on what a timestamp means.
Arrow's columnar layout really shines when you're scanning large datasets with projections. I've seen 10-20x speedups over row stores for analytical queries, though SQLite will win for transactional workloads. The zero-copy reads between processes are genuinely useful too.
We contributed the first JS impl and were helping with the nvidia gpu bits when it was starting. Some of our architectural decisions back then were awful as we were trying to figure out how make Graphistry work, but Arrow + GPU dataframes remain gifts that keep giving.
I think a big reason (aside from intertia) is that arrow is designed for tables. Json sends a lot more than just that and can support whatever octagonal junitsu squid shaped data you want to fit into it.
Also, a good proportion of web apis are sending pretty small data sizes. On mass there might be an improvement if everything was more efficiently represented, but evaluating on a case by case basis, the data size often isn't the bottleneck.
Anniversary posts assume familiarity, but the original 2016 Wes McKinney blog post (link) still gives the clearest motivation: standardizing columnar memory layout across languages to avoid serialization overhead in analytics pipelines.
The post celebrates Apache Arrow's 10 years anniversary, so it's assuming you already know what it is and what it does, which I think is fair. If you don't you can always refer to the docs.
We spent the 90s inventing ASN.1 and XDR, the 2000s pretending XML was better, the 2010s with JSON everywhere despite the overhead. Arrow finally admits we needed efficient binary columnar all along.
Yet today I feel it was 2016 dataders who is the crazy one lol