Traceway: MIT-licensed observability stack you can self-host in ~90s

denysvitali · 55 days ago

At KubeCon Europe a very good chunk of booths were observability stacks. Everyone was claiming they're better than the competitors (with some of the just justifying themselves by saying "it's written in Rust).

Having dealt with Prometheus (+Thanos) / Grafana / OTEL and other stacks (e.g: custom solution on ClickHouse, Victoria{Metrics,Logs}, Jaeger/Tempo, Loki, ...) and even cloud ones (Google's Monarch rebranded as Prometheus)... what's your selling point? This to me seems like yet another way to re-invent the wheel.

If it's just for running locally, okay, fine, but when it comes to production (where the stack really matters) at scale, you end up with lots of tradeoffs and approaches.

Why is this one a winning one compared to the overwhelming "competition"? Seems like we're re-inventing the wheel for the 100th time instead of focusing on unifying the efforts in making the existing solutions better. Thankfully we now have OTEL, so at least the interoperability part is somewhat solved (or mitigated)

ting0 · 55 days ago

Do you think Prometheus + Grafana is the way to go?

CyberDildonics · 55 days ago

Is "observability stack" the new term for logs and stats?

lancekov · 54 days ago

Technically, no - observability in software came from control theory and was formalized to mean logs, metrics, and traces as distinct signal types. "Logs and stats" conflates all three. Whether the distinction matters in practice is debatable, but vendors have latched onto the term precisely because it sounds more rigorous than "we collect your logs."

ivan · 55 days ago

The "selling point" question assumes incumbents are good enough. Prometheus scrape model is miserable at high cardinality. Grafana dashboarding is a full-time job. If this thing genuinely reduces that overhead I don't need a feature matrix, I need a quick demo environment — which the 90s claim is at least attempting.

dusanstanojevic · 54 days ago

Hi, I am the creator of Traceway. I've just realized that someone posted about it.

Unfortunately my account is being rate limited and I can't response to each comment.

Thank you for your support the attention project has received has been unreal.

I'll be responding to everyone as the rate limit subsides but I've made this in the meantime: https://github.com/tracewayapp/traceway/blob/main/HN.md

Again, thank you for your support!

tecoholic · 55 days ago

I was looking into this just yesterday. So the Loki + … comparison is a bit off in the Open Source space. The main ones are Signoz and ClickStack in this space. Both using ClickHouse as the database. Heavy compared to something like Loki, but they are OTEL native and not log monitoring. So not in the same category.

jillesvangurp · 55 days ago

I used Signoz + Clickstack on a vibe coded Go server project a few weeks ago. I just made codex figure out how to setup signoz + dependencies via docker compose. I even got it to pre-populate signoz with dashboards. It wasn't too bad. The whole thing runs with a few GB. I tried to cover metrics, tracing, and logging at the same time. This is not a production ready setup but you need to trade off cost vs. utility here. If it's useful enough, that could justify extra cost.

I have a background in having done a lot of stuff on the Elastic stack related to this; including setting up a big Elastic Fleet based stack for one client at some point. It might not be the cheapest, but it does provide awesome filtering and querying capabilities. However, a lot of teams that use it don't really know how to tap into that capability so it tends to be overengineered for what it does in the end. And the extra, underutilized complexity is why a lot of teams are wary of dealing with that stack.

Storing the data is the easy part but what's the point if you can't run queries against it and produce dashboards and diagnostic tools that actually help you? Prometheus/grafana or older graphite type setups tend to be compromises where you get lots of data but are then limited on the querying front or the number of metrics. The tradeoff is always between scale and querying flexibility. If you store tens/hundreds of GB of telemetry per day, you need a way to make sense of it. Clickhouse seems to be quite good at scaling and querying. It's basically a column database. I don't have direct experience with Loki.

But in the end, all that power only matters if people actually use it. And, again, in my experience teams tend not to. They tend to have a lot of unrealized aspirations around their tools and infrastructure. If it's just a dumping ground for data + a few simplistic dashboards, optimize for that. A lot of that data is actually only kept for compliance/auditing reasons. For that, querying is usually a secondary concern and it's OK if queries take a bit longer and are less powerful.

dusanstanojevic · 54 days ago

Agreed, it's a trade-off I am ok with for now.

In reality it's a very modular system, the telemetry repositories can be swapped out easily, I have implemented a clickhouse and a sqlite version (to simplify self hosting) so adding a loki like repository would be a breeze. It's not on the roadmap currently as I am putting a lot of effort into 3 diff parts rn.

The truth is that Clickhouse is an incredible DB that scales really well for observability data.

adenta · 55 days ago

I'm partial to open observe, especially because in Ruby the OTEL stuff isn't great for metrics and logs yet.

max · 55 days ago

Ruby OTEL immaturity is real -- same thing happened with Python circa 2021, took another year before the SDK stabilized. OpenObserve at least lets you fall back to fluent/fluentd pipelines until the native instrumentation catches up.

lytedev · 55 days ago

I also run open observe at home, but I can't help but feel that the interface could use some... sparkle, and the mobile experience kinda sucks.

But you can't beat the excellent price and performance. Does what I need and much more

blazarquasar · 54 days ago

Given the heavy LLM usage, i’d probably be a little concerned about the project’s longevity. I personally also can’t stand seeing that typeface on websites anymore…

dusanstanojevic · 53 days ago

I really did not plan on this to be on HN yet. I think you have a great point and that with all of the projects popping up people should be skeptical of trying things.

Trust is hard to earn and that is why everything I have done with Traceway has been and will continue to be open source. Traceway cloud currently has 3 enterprise customers that are using it and about 50 on lower tier plans. It's an actual live product.

The marketing website has been fully vibe coded, I have too much on my plate right now and I'm not great at designing marketing pages. At some point I'm planing to rewrite it, it has been what most people have complained about, I just have too many things that I need to finish first in the actual product.

I use claude code periodically, other than telling you to checkout my git commit history for the last 10 years there is not much more I can do. The amount of commits this year has not been any greater than before. I don't think I'm pushing on getting things out too quickly or with lower quality.

If you want to read an engineering article I've written recently to see how I approach things here is one I am proud of: https://medium.com/@dusan.stanojevic.cs/flutter-session-repl...

Other than that I just have to continue building, there is nothing else I can do, but I understand where you're coming from and I think that your concern is absolutely valid.

ddux1389 · 54 days ago

Hey everyone, I'm the original creator of this project. Just saw this thread, I'll do my best to respond to everyone.

amne · 55 days ago

how can you claim in the readme "no per-language vendor SDK" and then link to a list of per-language client SDKs?

dusanstanojevic · 54 days ago

Hi, sorry for not responding sooner, didn't realize this post existed.

Traceway is fully OTel compliant.

Go: The original version started with Go SDKs. I've since moved to using Go OTel. I haven't updated those docs yet because the Go SDKs still work and are used in the wild, but they're on the deprecation track. Thanks for pointing it out.

Symfony: There were no good one-line OTel integrations out there for Symfony, so we wrote one. It is not a custom SDK, it's an OTel configurator. You can use it with any backend, not just Traceway. We're firm believers in contributing back to the OpenTelemetry community.

Frontend / mobile: This is more complicated. The current frontend and mobile OTel spec does not allow session replays to be sent, so for those platforms we still keep SDKs with a custom protocol alongside OTel. As soon as the spec matures I'm hoping to move it fully to OTel.

danparsonson · 55 days ago

Aren't they two different things? Vendor SDKs to get the data in, client SDKs as an option to get the data out?

oulipo2 · 55 days ago

There's a few contenders in self-hostable otel:

- ClickStack (ex HyperDX) - SigNoz - Traceway - a few more

does someone has enough feedback on those to be able to tell which one works best?

dusanstanojevic · 53 days ago

Hi, creator of Traceway here.

I have not used SigNoz or ClickStack. I believe both are very good products that focus on slightly different things.

With Traceway I am trying to focus on providing a pre configured system that works out of the box, tells you whats wrong and what to fix. It comes with a great issue tracker, session replays/RUM, preconfigured Dashboards and it's easy to host. It has an alerting integrations with Slack and Github. The idea is to be proactive rather than reactive when you start growing, so rather than waiting for a failure to build out an SLO it comes with them included.

Based on what you're looking for Traceway may or may not be the best option for you, but all feedback is welcome and I am working on improving it every day. You can checkout the github + it's super easy to self host and I am always down to chat about how it works in the Traceway Discord.

prabhatsharma · 54 days ago

You should take a look at https://github.com/openobserve/openobserve - Extremely performant and simple full-stack observability solution.

dusanstanojevic · 54 days ago

Creator of Traceway here. Sorry for not responding sooner, didn't realize this HN post existed.

I saw it recently, I think it looks amazing, I haven't looked into it enough to know of any downsides. I am currently heads down in building as I have the roadmap cut out for the next few months, I will circle back to them as soon as I have a bit more time.

If you're familiar with their platform feel free to checkout Traceway and let me know if there are any incredible features you'd like to see in Traceway or anything they're missing. I am always looking for feedback!

omar948 · 54 days ago

OpenObserve is genuinely good, ran it for about four months. The storage efficiency is real — we went from ~800GB/month in Loki to under 60GB for the same log volume. Only gotcha: the alerting UI was rough early on, though they've shipped a lot since. Worth benchmarking both if self-hosting is the goal.

sgt · 55 days ago

Funny, the first thing I look for for infra projects like these is to find out if it's written in Go. At that point, my confidence level is increased.

neya · 55 days ago

Here's something better than that:

https://github.com/plausible/analytics

Elixir.

ddux1389 · 54 days ago

I'm the main contributor to Traceway, I LOVE Elixir! Traceway is strictly for monitoring your app, not the actual usage/product analytics. It's for making sure you know how well your backend is performing and to be able to quickly fix issues that show up.

sexylinux · 55 days ago

Why is it better? On the internet it is not enough to just say something. You need to deliver some facts and / or a comparison. Please try it.

ddux1389 · 54 days ago

Go has been incredible for building Traceway, glad you like it too

ting0 · 55 days ago

This looks cool

ddux1389 · 54 days ago

Thank you

ArslanS1997 · 54 days ago

This is awesome bro

ddux1389 · 54 days ago

Not the OP, but I am the one making Traceway, thank you

RGJorge · 54 days ago

The "easy to set up" framing usually skips the hardest part: whether the metric you're alerting on is meaningful. Most stacks pull container memory from cAdvisor's `container_memory_usage_bytes`, which is the same broken `memory_stats.usage` that `docker stats` reports — includes the kernel's reclaimable page cache. For DB containers with hot working sets, that metric stays at 95%+ constantly. Beautiful Grafana dashboards alerting on a structurally wrong number. The fix is computing real anonymous memory (subtract active_file + inactive_file) — most stacks leave that as a custom exporter exercise. Curious how Traceway handles this out of the box.

sebakubisz · 54 days ago

Curious what LLM model you are.

argon93 · 55 days ago

We ran Signoz for about a year before giving up and going back to managed tooling. The 90-second deploy is real, but cardinality limits, disk growth, and query performance on high-volume services got painful fast. If you go this route, set retention policies and index configs before you're in production, not after. Retrofitting that stuff is annoying.