Apache Arrow is 10 years old

data_ders · 144 days ago

if I could tell myself in 2015 who had just found the feather library and was using it to power my unhinged topic modeling for power point slides work, and explained what feather would become (arrow) and the impact it would have on the date ecosystem. I would have looked at 2026 me like he was a crazy person.

Yet today I feel it was 2016 dataders who is the crazy one lol

mempko · 144 days ago

We use Apache Arrow at my company and it's fantastic. The performance is so good. We have terabytes of time-series financial data and use arrow to store it and process it.

final_aeon · 144 days ago

I laugh everytime I have to explain that "Apache Arrow format is more efficient than JSON. Yes, the format is called 'Apache Arrow.'"

aynyc · 144 days ago

What's the difference between feather and parquet in terms of usage? I get the design philosophy, but how would you use them differently?

pm90 · 144 days ago

Its nice to see useful, impactful interchange formats getting the attention and resources they need, and ecosystems converging around them. Optimizing serialization/deserialization might seem like a "trivial" task at first, but when moving petabytes of data they quickly become the bottlenecks. With common interchange formats, the benefits of these optimizations are shared across stacks. Love to see it.

aerzen · 144 days ago

I like arrow for its type system. It's efficient, complete and does not have "infinite precision decimals". Considering Postgres's decimal encoding, using i256 as the backing type is so much saner approach.

actionfromafar · 144 days ago

I had to look up what Arrow actually does, and I might have to run some performance comparisons vs sqlite.

It's very neat for some types of data to have columns contiguous in memory.

tosh · 144 days ago

Take a look at parquet.

You can also store arrow on disk but it is mainly used as in-memory representation.

tom · 144 days ago

I disagree - Parquet is terrible for interactive workloads. All that columnar compression means you're constantly decompressing just to filter rows. Arrow's the better default unless you're doing batch analytics or need long-term storage.

data_ders · 144 days ago

yeah not necessarily compute (though it has a kernel)!

it's actually many things IPC protocol wire protocol, database connectivity spec etc etc.

in reality it's about an in-memory tabular (columnar) representation that enables zero copy operations b/w languages and engines.

and, imho, it all really comes down to standard data types for columns!

cedar17 · 144 days ago

Zero copy is great until you need to debug which process still has a reference to that shared buffer at 3am. And the IPC story assumes your languages actually agree on what a timestamp means.

simonfeld · 144 days ago

Arrow's columnar layout really shines when you're scanning large datasets with projections. I've seen 10-20x speedups over row stores for analytical queries, though SQLite will win for transactional workloads. The zero-copy reads between processes are genuinely useful too.

lmeyerov · 144 days ago

We contributed the first JS impl and were helping with the nvidia gpu bits when it was starting. Some of our architectural decisions back then were awful as we were trying to figure out how make Graphistry work, but Arrow + GPU dataframes remain gifts that keep giving.

vivzkestrel · 144 days ago

stupid question: why hasnt apache arrow taken over to the point where we are not longer dealing with json?

benrutter · 144 days ago

I think a big reason (aside from intertia) is that arrow is designed for tables. Json sends a lot more than just that and can support whatever octagonal junitsu squid shaped data you want to fit into it.

Also, a good proportion of web apis are sending pretty small data sizes. On mass there might be an improvement if everything was more efficiently represented, but evaluating on a case by case basis, the data size often isn't the bottleneck.

kbaker · 144 days ago

Because it's a binary format?

HoldOnAMinute · 144 days ago

I read that entire page and I could not tell you what Apache Arrow is, or what it does.

depr · 144 days ago

All you had to do was click the logo to go to the homepage

omar611 · 144 days ago

Anniversary posts assume familiarity, but the original 2016 Wes McKinney blog post (link) still gives the clearest motivation: standardizing columnar memory layout across languages to avoid serialization overhead in analytics pipelines.

gchamonlive · 144 days ago

The post celebrates Apache Arrow's 10 years anniversary, so it's assuming you already know what it is and what it does, which I think is fair. If you don't you can always refer to the docs.

kentger · 144 days ago

We spent the 90s inventing ASN.1 and XDR, the 2000s pretending XML was better, the 2010s with JSON everywhere despite the overhead. Arrow finally admits we needed efficient binary columnar all along.