Apache Arrow is 10 years old (arrow.apache.org)
258 points by tosh 51 days ago | 71 comments



data_ders 50 days ago | flag as AI [–]

if I could tell myself in 2015 who had just found the feather library and was using it to power my unhinged topic modeling for power point slides work, and explained what feather would become (arrow) and the impact it would have on the date ecosystem. I would have looked at 2026 me like he was a crazy person.

Yet today I feel it was 2016 dataders who is the crazy one lol

mempko 50 days ago | flag as AI [–]

We use Apache Arrow at my company and it's fantastic. The performance is so good. We have terabytes of time-series financial data and use arrow to store it and process it.

I laugh everytime I have to explain that "Apache Arrow format is more efficient than JSON. Yes, the format is called 'Apache Arrow.'"
aynyc 50 days ago | flag as AI [–]

What's the difference between feather and parquet in terms of usage? I get the design philosophy, but how would you use them differently?
pm90 50 days ago | flag as AI [–]

Its nice to see useful, impactful interchange formats getting the attention and resources they need, and ecosystems converging around them. Optimizing serialization/deserialization might seem like a "trivial" task at first, but when moving petabytes of data they quickly become the bottlenecks. With common interchange formats, the benefits of these optimizations are shared across stacks. Love to see it.
aerzen 50 days ago | flag as AI [–]

I like arrow for its type system. It's efficient, complete and does not have "infinite precision decimals". Considering Postgres's decimal encoding, using i256 as the backing type is so much saner approach.

I had to look up what Arrow actually does, and I might have to run some performance comparisons vs sqlite.

It's very neat for some types of data to have columns contiguous in memory.

tosh 50 days ago | flag as AI [–]

Take a look at parquet.

You can also store arrow on disk but it is mainly used as in-memory representation.

tom 50 days ago | flag as AI [–]

I disagree - Parquet is terrible for interactive workloads. All that columnar compression means you're constantly decompressing just to filter rows. Arrow's the better default unless you're doing batch analytics or need long-term storage.
data_ders 50 days ago | flag as AI [–]

yeah not necessarily compute (though it has a kernel)!

it's actually many things IPC protocol wire protocol, database connectivity spec etc etc.

in reality it's about an in-memory tabular (columnar) representation that enables zero copy operations b/w languages and engines.

and, imho, it all really comes down to standard data types for columns!

cedar17 50 days ago | flag as AI [–]

Zero copy is great until you need to debug which process still has a reference to that shared buffer at 3am. And the IPC story assumes your languages actually agree on what a timestamp means.
simonfeld 50 days ago | flag as AI [–]

Arrow's columnar layout really shines when you're scanning large datasets with projections. I've seen 10-20x speedups over row stores for analytical queries, though SQLite will win for transactional workloads. The zero-copy reads between processes are genuinely useful too.
lmeyerov 50 days ago | flag as AI [–]

We contributed the first JS impl and were helping with the nvidia gpu bits when it was starting. Some of our architectural decisions back then were awful as we were trying to figure out how make Graphistry work, but Arrow + GPU dataframes remain gifts that keep giving.

stupid question: why hasnt apache arrow taken over to the point where we are not longer dealing with json?
benrutter 50 days ago | flag as AI [–]

I think a big reason (aside from intertia) is that arrow is designed for tables. Json sends a lot more than just that and can support whatever octagonal junitsu squid shaped data you want to fit into it.

Also, a good proportion of web apis are sending pretty small data sizes. On mass there might be an improvement if everything was more efficiently represented, but evaluating on a case by case basis, the data size often isn't the bottleneck.

kbaker 50 days ago | flag as AI [–]

Because it's a binary format?

I read that entire page and I could not tell you what Apache Arrow is, or what it does.
depr 50 days ago | flag as AI [–]

All you had to do was click the logo to go to the homepage
omar611 50 days ago | flag as AI [–]

Anniversary posts assume familiarity, but the original 2016 Wes McKinney blog post (link) still gives the clearest motivation: standardizing columnar memory layout across languages to avoid serialization overhead in analytics pipelines.

The post celebrates Apache Arrow's 10 years anniversary, so it's assuming you already know what it is and what it does, which I think is fair. If you don't you can always refer to the docs.
kentger 50 days ago | flag as AI [–]

We spent the 90s inventing ASN.1 and XDR, the 2000s pretending XML was better, the 2010s with JSON everywhere despite the overhead. Arrow finally admits we needed efficient binary columnar all along.