Allocating on the Stack

matthewaveryusa · 129 days ago

It's kind of like the small string optimization you see in C++[1] where all the string metadata to account for heap pointer, size and capacity is union'ed with char*. Getting the stack allocation doesn't costs extra memory, but does cost a bit check. Not sure if slices in go use the same method. 32 bytes is a lot so maybe they fattened slice representations a bit to get a bit more bang for your buck?

[1] https://github.com/elliotgoodrich/SSO-23

HarHarVeryFunny · 129 days ago

This article is about Go, but I wonder how many C/C++ developers realize that you've always had the ability to allocate on the stack using alloca() rather than malloc().

Of course use cases are limited (variable length buffers/strings, etc) since the lifetime of anything on the stack has to match the lifetime of the stack frame (i.e the calling function), but it's super fast since it's just bumping up the stack pointer.

up2isomorphism · 129 days ago

It is a good thing many people do not know it. Since if you need this to squeeze that little performance window, you’d better know what you are doing.

lstodd · 129 days ago

It becames super slow when you bump that pointer into a page that's missing from the TLB.

harbor · 129 days ago

We hit this in prod when a spike in concurrent requests forced Go to allocate a bunch of large string buffers. TLB misses killed throughput until we capped goroutines and pooled those buffers. Turns out the escape analysis was putting more on heap than we expected anyway.

fpark · 129 days ago

I disagree that alloca() is "super fast" in practice. The stack pointer bump is trivial compared to the cache misses you'll cause by fragmenting your stack frame. In real workloads, the unpredictable frame sizes make the performance worse than just using a reasonable malloc implementation with a good allocator.

nasretdinov · 129 days ago

Nice to see common and natural patterns to have their performance improved. Theoretically appending to a slice would be possible to handle with just stack growth, but that would require having large gaps between goroutine stacks and mapping them lazily upon access instead of moving goroutines to the new contiguous blocks as it's implemented right now. But given how many questionable changes it requires from runtime it's certainly not going to happen :)

ivanjermakov · 129 days ago

Having big stack frames is bad for cache locality. Stack is not something magical, it's mapped to the same physical memory as heap and needs to be loaded. Pretty sure such optimization would reduce performance in most cases.

powen · 129 days ago

Wait until your profiler tells you that goroutine stack growth is burning 15% of CPU because of all this clever optimization. I've seen worse decisions ship, though.

anematode · 129 days ago

Awesome stuff! Does Go have profile-guided optimization? I'm wondering whether a profile could hint to the compiler how large to make the pre-reserved stack space.

tptacek · 129 days ago

Yep. `go build -pgo=foo.pprof`

https://go.dev/doc/pgo

csjh · 129 days ago

Optimizations like these are so cool. I love seeing higher level languages take advantage of their high level-ness

joshsegall · 129 days ago

Agreed. There's quite a bit of room for optimization if your language design allows for it. Plus you have flexibility to make different tradeoffs as computer architectures and the cost of various operations change over time.

fsckboy · 129 days ago

you can do this in C, you just need to let its low level-ness be at the same level as everything else you do, just a setjmp longerjmp

heavenlyhash · 129 days ago

I want to like this, and it's directionally good work...

But it's hard to see this as very useful unless we also start to see some increases in legibility, and ways to make sure these optimizations are being used (and that textually minor changes don't cause non-obvious performance regressions).

I've written a lot of golang code that was benchmarked to shreds, and in which we absolutely cared about stack-vs-heap allocations because they were crucial to overall performance. I've spent a lot of time pouring over assembler dumps, because grepping those for indications of new object creation was sometimes clearer (and certainly more definitive) than trying to infer it from the source code level. The one thing I've learned from this?

It's very, very easy for all those efforts to come to naught if the rules change slightly.

And it's very, very, VERY easy for a co-maintainer on a project to stroll in and make seemingly textually trivial changes that have outsized impacts. (I'm looking at inliner thresholds, specifically. Hoo boy.)

The best balm we have for this right now is writing benchmarks and making sure they report zero allocs. (Or unit tests using the runtime memstats harness; potato potato.) But that is a very fragile balm, and relatively complex to maintain, and (if DX is considered) is not textually local to the code in question -- which means someone changing the code can easily miss the criticality of a section (until the tests yell at them, at least).

I really yearn for some markup that can say "I expect this code to contain zero heap allocations; please flunk the compile if that is not the case".

OptionOfT · 129 days ago

> ... > On the third loop iteration, the backing store of size 2 is full. append again has to allocate a new backing store, this time of size 4. The old backing store of size 2 is now garbage.

Correct me if I'm wrong, but isn't this a worst-case scenario? realloc can, iirc, extend in place. Your original pointer is still invalid then, but no copy is needed then.

Unless I'm missing something?

Equally, what happens to the ordering of variables on the stack? Is this new one pushed as the last one? Or is there space kept open?

E.g.:

    var tasks []task
    var other_var int

kbolino · 129 days ago

The ability to grow without copying is already part of how slices work. Every slice is really a 3-word tuple of pointer, length, and capacity. If not explicitly set with make, the capacity property defaults to a value that fills out the size class of the allocation. It just so happens that, in this case, the size of the Task type doesn't allow for more than 1 value to fit in the smallest allocation. If you were to do this with a []byte or []int32 etc., you would see that the capacity doesn't necessarily start at 1: https://go.dev/play/p/G5cifdChGIZ

direwolf20 · 129 days ago

[]task is a pointer to a range of elements. TFA says if you initialize it to point to a new array of 10 elements, that array of 10 elements may be stack–allocated. If you allocate another array dynamically, that one won't be.

derekwin · 129 days ago

realloc can extend in place if there's room after the allocation, but on the stack you've got other frames above you. The whole point of escape analysis is you don't need realloc—compiler sees it fits, stack-allocates upfront. This isn't new; we did similar tricks in C++ 20 years ago.

bertylicious · 129 days ago

Nice! That's (seems) so simple yet also so very effective. Shouldn't other memory-managed languages be able to profit from this as well?

KolmogorovComp · 129 days ago

It’s a very well known pattern, as someone else mentioned it’s used in CPP in smallstring, Rust smallvec, C usually hand rolled etc.

lionkor · 129 days ago

C# has `stackalloc`

lstodd · 129 days ago

I read that as "Allocating on the Slack" and immediately came up with three ways how to do that.

up2isomorphism · 129 days ago

It is actually a bad design when compiler go this far into a micro optimization but assume it understands the context so it can make decisions for you.

zabzonk · 129 days ago

alloca() is not part of the C++ standard, and I can't imagine how it could used safely in a C++ environment

mwkaufma · 129 days ago

If I had a nickel for every article about avoiding implicit boxing in gc-heap languages...

adonovan · 129 days ago

...you would have the same balance as before, because this is not an article about implicit boxing. ;-)

avi · 129 days ago

There's been work on escape analysis for decades, but I find it interesting that Go's approach seems to rely heavily on inlining to expose allocation patterns. As far as I know, most static analysis in this space trades precision for compilation speed—Go appears to lean harder into the former, at least for these specific cases.