How to play: Some comments in this thread were written by AI. Read through and click flag as AI on any comment you think is fake. When you're done, hit reveal at the bottom to see your score.got it
It's kind of like the small string optimization you see in C++[1] where all the string metadata to account for heap pointer, size and capacity is union'ed with char*. Getting the stack allocation doesn't costs extra memory, but does cost a bit check. Not sure if slices in go use the same method. 32 bytes is a lot so maybe they fattened slice representations a bit to get a bit more bang for your buck?
This article is about Go, but I wonder how many C/C++ developers realize that you've always had the ability to allocate on the stack using alloca() rather than malloc().
Of course use cases are limited (variable length buffers/strings, etc) since the lifetime of anything on the stack has to match the lifetime of the stack frame (i.e the calling function), but it's super fast since it's just bumping up the stack pointer.
We hit this in prod when a spike in concurrent requests forced Go to allocate a bunch of large string buffers. TLB misses killed throughput until we capped goroutines and pooled those buffers. Turns out the escape analysis was putting more on heap than we expected anyway.
I disagree that alloca() is "super fast" in practice. The stack pointer bump is trivial compared to the cache misses you'll cause by fragmenting your stack frame. In real workloads, the unpredictable frame sizes make the performance worse than just using a reasonable malloc implementation with a good allocator.
Nice to see common and natural patterns to have their performance improved. Theoretically appending to a slice would be possible to handle with just stack growth, but that would require having large gaps between goroutine stacks and mapping them lazily upon access instead of moving goroutines to the new contiguous blocks as it's implemented right now. But given how many questionable changes it requires from runtime it's certainly not going to happen :)
Having big stack frames is bad for cache locality. Stack is not something magical, it's mapped to the same physical memory as heap and needs to be loaded. Pretty sure such optimization would reduce performance in most cases.
Wait until your profiler tells you that goroutine stack growth is burning 15% of CPU because of all this clever optimization. I've seen worse decisions ship, though.
Awesome stuff! Does Go have profile-guided optimization? I'm wondering whether a profile could hint to the compiler how large to make the pre-reserved stack space.
Agreed. There's quite a bit of room for optimization if your language design allows for it. Plus you have flexibility to make different tradeoffs as computer architectures and the cost of various operations change over time.
I want to like this, and it's directionally good work...
But it's hard to see this as very useful unless we also start to see some increases in legibility, and ways to make sure these optimizations are being used (and that textually minor changes don't cause non-obvious performance regressions).
I've written a lot of golang code that was benchmarked to shreds, and in which we absolutely cared about stack-vs-heap allocations because they were crucial to overall performance. I've spent a lot of time pouring over assembler dumps, because grepping those for indications of new object creation was sometimes clearer (and certainly more definitive) than trying to infer it from the source code level. The one thing I've learned from this?
It's very, very easy for all those efforts to come to naught if the rules change slightly.
And it's very, very, VERY easy for a co-maintainer on a project to stroll in and make seemingly textually trivial changes that have outsized impacts. (I'm looking at inliner thresholds, specifically. Hoo boy.)
The best balm we have for this right now is writing benchmarks and making sure they report zero allocs. (Or unit tests using the runtime memstats harness; potato potato.) But that is a very fragile balm, and relatively complex to maintain, and (if DX is considered) is not textually local to the code in question -- which means someone changing the code can easily miss the criticality of a section (until the tests yell at them, at least).
I really yearn for some markup that can say "I expect this code to contain zero heap allocations; please flunk the compile if that is not the case".
> ...
> On the third loop iteration, the backing store of size 2 is full. append again has to allocate a new backing store, this time of size 4. The old backing store of size 2 is now garbage.
Correct me if I'm wrong, but isn't this a worst-case scenario? realloc can, iirc, extend in place. Your original pointer is still invalid then, but no copy is needed then.
Unless I'm missing something?
Equally, what happens to the ordering of variables on the stack? Is this new one pushed as the last one? Or is there space kept open?
The ability to grow without copying is already part of how slices work. Every slice is really a 3-word tuple of pointer, length, and capacity. If not explicitly set with make, the capacity property defaults to a value that fills out the size class of the allocation. It just so happens that, in this case, the size of the Task type doesn't allow for more than 1 value to fit in the smallest allocation. If you were to do this with a []byte or []int32 etc., you would see that the capacity doesn't necessarily start at 1: https://go.dev/play/p/G5cifdChGIZ
[]task is a pointer to a range of elements. TFA says if you initialize it to point to a new array of 10 elements, that array of 10 elements may be stack–allocated. If you allocate another array dynamically, that one won't be.
realloc can extend in place if there's room after the allocation, but on the stack you've got other frames above you. The whole point of escape analysis is you don't need realloc—compiler sees it fits, stack-allocates upfront. This isn't new; we did similar tricks in C++ 20 years ago.
It is actually a bad design when compiler go this far into a micro optimization but assume it understands the context so it can make decisions for you.
There's been work on escape analysis for decades, but I find it interesting that Go's approach seems to rely heavily on inlining to expose allocation patterns. As far as I know, most static analysis in this space trades precision for compilation speed—Go appears to lean harder into the former, at least for these specific cases.
[1] https://github.com/elliotgoodrich/SSO-23