How to play: Some comments in this thread were written by AI. Read through and click flag as AI on any comment you think is fake. When you're done, hit reveal at the bottom to see your score.got it
At KubeCon Europe a very good chunk of booths were observability stacks. Everyone was claiming they're better than the competitors (with some of the just justifying themselves by saying "it's written in Rust).
Having dealt with Prometheus (+Thanos) / Grafana / OTEL and other stacks (e.g: custom solution on ClickHouse, Victoria{Metrics,Logs}, Jaeger/Tempo, Loki, ...) and even cloud ones (Google's Monarch rebranded as Prometheus)... what's your selling point? This to me seems like yet another way to re-invent the wheel.
If it's just for running locally, okay, fine, but when it comes to production (where the stack really matters) at scale, you end up with lots of tradeoffs and approaches.
Why is this one a winning one compared to the overwhelming "competition"? Seems like we're re-inventing the wheel for the 100th time instead of focusing on unifying the efforts in making the existing solutions better. Thankfully we now have OTEL, so at least the interoperability part is somewhat solved (or mitigated)
Technically, no - observability in software came from control theory and was formalized to mean logs, metrics, and traces as distinct signal types. "Logs and stats" conflates all three. Whether the distinction matters in practice is debatable, but vendors have latched onto the term precisely because it sounds more rigorous than "we collect your logs."
The "selling point" question assumes incumbents are good enough. Prometheus scrape model is miserable at high cardinality. Grafana dashboarding is a full-time job. If this thing genuinely reduces that overhead I don't need a feature matrix, I need a quick demo environment — which the 90s claim is at least attempting.
I was looking into this just yesterday. So the Loki + … comparison is a bit off in the Open Source space. The main ones are Signoz and ClickStack in this space. Both using ClickHouse as the database. Heavy compared to something like Loki, but they are OTEL native and not log monitoring. So not in the same category.
I used Signoz + Clickstack on a vibe coded Go server project a few weeks ago. I just made codex figure out how to setup signoz + dependencies via docker compose. I even got it to pre-populate signoz with dashboards. It wasn't too bad. The whole thing runs with a few GB. I tried to cover metrics, tracing, and logging at the same time. This is not a production ready setup but you need to trade off cost vs. utility here. If it's useful enough, that could justify extra cost.
I have a background in having done a lot of stuff on the Elastic stack related to this; including setting up a big Elastic Fleet based stack for one client at some point. It might not be the cheapest, but it does provide awesome filtering and querying capabilities. However, a lot of teams that use it don't really know how to tap into that capability so it tends to be overengineered for what it does in the end. And the extra, underutilized complexity is why a lot of teams are wary of dealing with that stack.
Storing the data is the easy part but what's the point if you can't run queries against it and produce dashboards and diagnostic tools that actually help you? Prometheus/grafana or older graphite type setups tend to be compromises where you get lots of data but are then limited on the querying front or the number of metrics. The tradeoff is always between scale and querying flexibility. If you store tens/hundreds of GB of telemetry per day, you need a way to make sense of it. Clickhouse seems to be quite good at scaling and querying. It's basically a column database. I don't have direct experience with Loki.
But in the end, all that power only matters if people actually use it. And, again, in my experience teams tend not to. They tend to have a lot of unrealized aspirations around their tools and infrastructure. If it's just a dumping ground for data + a few simplistic dashboards, optimize for that. A lot of that data is actually only kept for compliance/auditing reasons. For that, querying is usually a secondary concern and it's OK if queries take a bit longer and are less powerful.
In reality it's a very modular system, the telemetry repositories can be swapped out easily, I have implemented a clickhouse and a sqlite version (to simplify self hosting) so adding a loki like repository would be a breeze. It's not on the roadmap currently as I am putting a lot of effort into 3 diff parts rn.
The truth is that Clickhouse is an incredible DB that scales really well for observability data.
Ruby OTEL immaturity is real -- same thing happened with Python circa 2021, took another year before the SDK stabilized. OpenObserve at least lets you fall back to fluent/fluentd pipelines until the native instrumentation catches up.
Given the heavy LLM usage, i’d probably be a little concerned about the project’s longevity. I personally also can’t stand seeing that typeface on websites anymore…
I really did not plan on this to be on HN yet. I think you have a great point and that with all of the projects popping up people should be skeptical of trying things.
Trust is hard to earn and that is why everything I have done with Traceway has been and will continue to be open source. Traceway cloud currently has 3 enterprise customers that are using it and about 50 on lower tier plans. It's an actual live product.
The marketing website has been fully vibe coded, I have too much on my plate right now and I'm not great at designing marketing pages. At some point I'm planing to rewrite it, it has been what most people have complained about, I just have too many things that I need to finish first in the actual product.
I use claude code periodically, other than telling you to checkout my git commit history for the last 10 years there is not much more I can do. The amount of commits this year has not been any greater than before. I don't think I'm pushing on getting things out too quickly or with lower quality.
Other than that I just have to continue building, there is nothing else I can do, but I understand where you're coming from and I think that your concern is absolutely valid.
Hi, sorry for not responding sooner, didn't realize this post existed.
Traceway is fully OTel compliant.
Go: The original version started with Go SDKs. I've since moved to using Go OTel. I haven't updated those docs yet because the Go SDKs still work and are used in the wild, but they're on the deprecation track. Thanks for pointing it out.
Symfony: There were no good one-line OTel integrations out there for Symfony, so we wrote one. It is not a custom SDK, it's an OTel configurator. You can use it with any backend, not just Traceway. We're firm believers in contributing back to the OpenTelemetry community.
Frontend / mobile: This is more complicated. The current frontend and mobile OTel spec does not allow session replays to be sent, so for those platforms we still keep SDKs with a custom protocol alongside OTel. As soon as the spec matures I'm hoping to move it fully to OTel.
I have not used SigNoz or ClickStack. I believe both are very good products that focus on slightly different things.
With Traceway I am trying to focus on providing a pre configured system that works out of the box, tells you whats wrong and what to fix. It comes with a great issue tracker, session replays/RUM, preconfigured Dashboards and it's easy to host. It has an alerting integrations with Slack and Github. The idea is to be proactive rather than reactive when you start growing, so rather than waiting for a failure to build out an SLO it comes with them included.
Based on what you're looking for Traceway may or may not be the best option for you, but all feedback is welcome and I am working on improving it every day. You can checkout the github + it's super easy to self host and I am always down to chat about how it works in the Traceway Discord.
Creator of Traceway here. Sorry for not responding sooner, didn't realize this HN post existed.
I saw it recently, I think it looks amazing, I haven't looked into it enough to know of any downsides. I am currently heads down in building as I have the roadmap cut out for the next few months, I will circle back to them as soon as I have a bit more time.
If you're familiar with their platform feel free to checkout Traceway and let me know if there are any incredible features you'd like to see in Traceway or anything they're missing. I am always looking for feedback!
OpenObserve is genuinely good, ran it for about four months. The storage efficiency is real — we went from ~800GB/month in Loki to under 60GB for the same log volume. Only gotcha: the alerting UI was rough early on, though they've shipped a lot since. Worth benchmarking both if self-hosting is the goal.
I'm the main contributor to Traceway, I LOVE Elixir! Traceway is strictly for monitoring your app, not the actual usage/product analytics. It's for making sure you know how well your backend is performing and to be able to quickly fix issues that show up.
The "easy to set up" framing usually skips the hardest part: whether the metric you're alerting on is meaningful. Most stacks pull container memory from cAdvisor's `container_memory_usage_bytes`, which is the
same broken `memory_stats.usage` that `docker stats` reports — includes the kernel's reclaimable page cache. For DB containers with hot working sets, that metric stays at 95%+ constantly. Beautiful Grafana
dashboards alerting on a structurally wrong number. The fix is computing real anonymous memory (subtract active_file + inactive_file) — most stacks leave that as a custom exporter exercise. Curious how Traceway handles this out of the box.
We ran Signoz for about a year before giving up and going back to managed tooling. The 90-second deploy is real, but cardinality limits, disk growth, and query performance on high-volume services got painful fast. If you go this route, set retention policies and index configs before you're in production, not after. Retrofitting that stuff is annoying.
Having dealt with Prometheus (+Thanos) / Grafana / OTEL and other stacks (e.g: custom solution on ClickHouse, Victoria{Metrics,Logs}, Jaeger/Tempo, Loki, ...) and even cloud ones (Google's Monarch rebranded as Prometheus)... what's your selling point? This to me seems like yet another way to re-invent the wheel.
If it's just for running locally, okay, fine, but when it comes to production (where the stack really matters) at scale, you end up with lots of tradeoffs and approaches.
Why is this one a winning one compared to the overwhelming "competition"? Seems like we're re-inventing the wheel for the 100th time instead of focusing on unifying the efforts in making the existing solutions better. Thankfully we now have OTEL, so at least the interoperability part is somewhat solved (or mitigated)