GeistHaus
log in · sign up

Hacker News: Front Page

Part of Hacker News: Newest

Hacker News RSS

stories
Launch HN: Superlog (YC P26) – Observability that installs itself and fixes bugs

Hey HN, we’re Nico and Arseniy, co-founders of Superlog (https://superlog.sh). We're building a self-installing, self healing observability tool meant not to be opened. It has a wizard that daily sets up proper logging and an agent that investigates errors and opens PRs.

Super short demo: https://www.youtube.com/watch?v=xFhU9Mk247M.

In our earlier startups, we tried Sentry, Datadog, Grafana, Dash0, and nothing was good enough. Proper telemetry and alerting still requires a ton of manual setup. We struggled with adding good logs, so debugging was tough, especially as codebases grow at a faster pace. Meanwhile, the Datadog/Dash0 bill kept climbing, and we still spent engineering hours to learn, configure, and maintain our observability tooling.

With Sentry, we found ourselves flooded by a stream of alerts into our Slack channel, most were duplicates or lacked context, so alert fatigue/constant interrupts were a real pain. The #ops notification is consistently the worst feeling on a Saturday morning

We’ve seen too many times servers run out of memory and disk, and three AWS metrics giving us three different values. Half of the graphs on dashboards are normally empty or outdated, and manually clicking through UIs, especially when the team is small, seems like a huge waste of time.

At some point we realized that solving this problem would be more valuable than the things we had been working on, and we had the expertise to do it, since Arseniy had spent years at Datadog, getting paged during the night to debug production incidents. So we decided to build a platform that would just work: agent-first, MCP-native, zero-setup.

Here’s how Superlog works: we have a wizard that scans your repo, and automatically instruments it with well-structured logs, traces and metrics via OpenTelemetry. We make sure to highlight main failure modes, endpoint performance, usage per tenant, and LLM/upstream cost (by callsite, tenant and model).

Errors get fingerprinted and grouped into incidents, so you see one issue, not a thousand duplicates. When you get a notification from Superlog, you see a clear failure summary, its inferred severity and impact upfront.

Then the agent investigates and tries to solve the issue. If it has enough context, it produces a concise and tested PR. If it doesn't, it posts its findings for the investigating team, and automatically pulls in the engineers that could contribute more context based on documentation, previous investigations and Slack threads.

Either way the output is one clean PR per incident, posted in Slack, that you can merge, ignore, or open as a Claude Code session and modify.

Three things we think are different from other observability vendors:

(1) We solve the setup pain. The wizard will instrument everything with native OTel SDKs, respecting the semantic conventions, with proper service and environment tagging. We’re also working on native automatic dashboards and alerts, so that you can see what’s going on in a glance and don’t miss subtle failure modes.

(2) Our telemetry doesn’t decay. The wizard runs daily, and keeps adding logs, alerts and dashboards where it’s needed. You don't have to remember to instrument new features. The next time something breaks, the data you need to debug it is already there.

(3) Our goal is to solve alert fatigue. We use agents to merge similar errors and refine the summaries, giving you relevant information upfront. We have a custom evaluation setup that makes sure that our summaries are dense and correct, and severity and impact is on point. We also give you confidence scores for every LLM-enhanced metric so that wrong guesses don’t get boosted.

Important: superlog telemetry is vendor-neutral, so you keep all the logs/metrics/traces we install. Pricing is on the site. We're early, so expect rough edges and please tell us when you find them.

You can try it at https://superlog.sh. We'd love to hear what you're using today, what's broken about it, and whether the "one mergeable PR per incident" model sounds useful or terrifying. Especially keen to hear from folks running integration-heavy products, anyone who's rolled their own observability, and anyone who has tried Sentry / Datadog MCPs and given up. Comments and feedback welcome!


Comments URL: https://news.ycombinator.com/item?id=48195021

Points: 4

# Comments: 0

https://news.ycombinator.com/item?id=48195021
Extensions
Show HN: Hsrs – Type-Safe Haskell Bindings Generator for Rust

Hey everyone! I've been working on hsrs, a type-safe Haskell Bindings Generator for Rust.

I couldn't really find any bindings generator that would create type-safe, rich bindings for Haskell from Rust. Naturally, both languages have rich type systems, so I was amazed that no awesome bindings generator already existed, hence I decided to write my own. hsrs feels very similar to pyo3 and napi-rs, and if you've used those, hsrs will feel right at home.

What's unique about hsrs as opposed to hs-bindgen is that it has type-safe bindings for rich types, like Result, Maybe, etc. while also generating Haskell bindings. The repo contains a minimal example, and more details are available in the haskell discourse: https://discourse.haskell.org/t/ann-hsrs-ergonomic-haskell-b...


Comments URL: https://news.ycombinator.com/item?id=48189044

Points: 3

# Comments: 0

https://news.ycombinator.com/item?id=48189044
Extensions
Sieve – scans Cursor/Claude chat history for leaked API keys

Background: I was using Cursor to set up an OpenAI integration.The agent read my .env file, added the key to the config, and everything worked. What I didn't think about: that key was now sitting in a plaintext SQLite database at ~/Library/ApplicationSupport/Cursor/User/workspaceStorage/..

AI coding tools (Cursor, Claude Code, Copilot, Cline) routinely read .env files as part of normal operation. Every secret they touch gets embedded in their local transcript/state files — unencrypted, outside .gitignore, persisted indefinitely.

Standard secret scanners (gitleaks, detect-secrets) scan git repos. Nobody scans AI transcript stores. That's the gap.

Sieve scans those files locally on your Mac. Flags exposed keys by severity. Redacts them in-place. Stores fingerprints in Keychain — never plaintext. Covers Cursor, Claude Code, Claude Desktop, Copilot, Cline, Roo Cline, Windsurf, Gemini CLI, and .env files.

Happy to answer questions about how the SQLite parsing works or the detection rules.


Comments URL: https://news.ycombinator.com/item?id=48188727

Points: 5

# Comments: 0

https://news.ycombinator.com/item?id=48188727
Extensions
We let AIs run radio stations

Hey HN!

I'm Lukas from Andon Labs. We let AIs run companies without humans in the loop and report to the public on what can go wrong. Previously, we've done experiments in retail (vending machines, stores, and cafes), but we just launched one in the media sector. We gave four AI agents all the tools they need to both broadcast radio shows live and handle all the business side of running a media company. The agents' revenue is so far terrible (you can try to strike a sponsor deal with them if you want!), but their shows are at times hilarious. You can listen to them at andon.fm, I hope you enjoy this!


Comments URL: https://news.ycombinator.com/item?id=48183301

Points: 19

# Comments: 15

https://news.ycombinator.com/item?id=48183301
Extensions
Show HN: InsForge – Open-source Heroku for coding agents

Hi HN, I'm Hang, cofounder of InsForge (YC P26). InsForge is an open-source Heroku for AI coding agents: a backend platform designed for coding agents to deploy, operate, and debug end-to-end. Open source under Apache 2.0 (https://github.com/InsForge/InsForge). Quick demo here (https://youtu.be/7Bax5qz0IfM).

We started InsForge because we just wanted our Claude Code to handle all the backend / infra stuff for us, instead of us jumping between dashboards doing manual config, or copy paste logs and docs back to agents.

We first tried creating a folder with bunch of .MD files, and installing MCPs like Supabase, Vercel, GitHub, Context7. But soon we found MCPs have their own problems: (a) Tools get pre-loaded into context, before agents even do anything (b) bad design, payloads are returning 10k+ tokens, and (c) a lot of stuff still can’t be done by MCP: e.g. telemetry and configs.

So we think, because coding agents are so good at CLI, why not just put everything in CLI and create Skills to teach them how to use it?

That’s InsForge: 1 command to install our CLI + Skills, coding agents can run the entire backend platform [1].

We started with authentication and database, but we kept adding more primitives we wanted, so now we have: - frontend hosting - backend servers (microVM based) [2] - database - auth - storage - LLM model router - cron jobs - realtime - edge functions - vector

We have other features to make coding agents more reliable like real backend engineers:

- backend branching [3]: agents will 100% mess up, like deleting your database. So inspired by Neon, we branch the entire backend (DB, auth, storage, functions, schedules). Agents work on the branch, you review diffs and then decide to merge or discard. - server telemetry: agents can read logs, CPU, memory, disk to find spikes and root causes themselves.

- debug agent [4]: every project gets a dedicated debug agent. So your coding agent can ask questions like “why deployment fail?”, the debug agent will run diagnoses, find the root causes and propose fixes, then send the answer back.

- backend advisor [5]: scans your backend daily for security and performance issues, proposes fixes. Then propose remediations, and sends to your coding agent.

Give it a spin on InsForge cloud :https://insforge.dev, or read our code here: https://github.com/InsForge/InsForge.

We're a small team and reading every comment. Tell us what's good, what sucks, what's missing. We love feedback :)

[1] https://insforge.dev/blog/insforge-skills-cli

[2] https://insforge.dev/blog/insforge-custom-compute

[3] https://insforge.dev/blog/backend-branching

[4] https://insforge.dev/blog/introduce-debug-skills

[5] https://insforge.dev/blog/backend-health-dashboard


Comments URL: https://news.ycombinator.com/item?id=48181342

Points: 8

# Comments: 0

https://news.ycombinator.com/item?id=48181342
Extensions
Show HN: Semble – Code search for agents that uses 98% fewer tokens than grep

Hey HN! We (Stephan and Thomas) recently open-sourced Semble. We kept running into the same problem while using Claude Code on large codebases: when the agent can't find something directly, it falls back to grep, reading full files or launching subagents. This uses a lot of tokens, and often still misses the relevant code. There are existing tools for this, but they were either too slow to index on demand, needed API keys, or had poor retrieval quality.

Semble is our solution for this. It combines static Model2Vec embeddings (using our latest static model: potion-code-16M) with BM25, fused via RRF and reranked with code-aware signals. Everything runs on CPU since there's no transformers involved. On our benchmark of ~1250 query/document pairs across 63 repos and 19 languages, it uses 98% fewer tokens than grep+read and reaches 99% of the retrieval quality of a 137M-parameter code-trained transformer, while being ~200x faster.

Main features:

- Token-efficient: 98% fewer tokens than grep+read

- Fast: ~250ms to index a typical repo on our benchmark, ~1.5ms per query on CPU (very large repos may take longer)

- Accurate: 0.854 NDCG@10, 99% of the best transformer setup we tested

- MCP server: drop-in for Claude Code, Cursor, Codex, OpenCode

- Zero config: no API keys, no GPU, no external services

Install in Claude Code with: claude mcp add semble -s user -- uvx --from "semble[mcp]" semble

Or check our README for other installation instructions, benchmarks, and methodology:

Semble: https://github.com/MinishLab/semble

Benchmarks: https://github.com/MinishLab/semble/tree/main/benchmarks

Model: https://huggingface.co/minishlab/potion-code-16M

Let us know if you have any feedback or questions!


Comments URL: https://news.ycombinator.com/item?id=48169874

Points: 5

# Comments: 5

https://news.ycombinator.com/item?id=48169874
Extensions
Show HN: Codiff, a local diff review tool

Nowadays I review a lot of code locally that was written by llms. I used to review my own code using git + delta. It started to feel limiting with the amount of code written by llms.

When looking at a large diff on Friday I pointed an llm at diffs.com and trees.software and told it to build an app. It only took 16 minutes, is extremely fast for large diffs, beautiful and minimal.

Today I polished it up and added all the features that I need. It has file filters, search, an llm walkthrough mode, and review comments that you can paste back into your llm.

I will be using Codiff a lot, and can finally review the large diff from Friday that led me to build this If you like it, fork it!


Comments URL: https://news.ycombinator.com/item?id=48166275

Points: 6

# Comments: 3

https://news.ycombinator.com/item?id=48166275
Extensions
Ask HN: When did computers stop being fun?

Of course I don't mean they stopped being fun for everyone. My impression is that they've been on one side "corporatized", and on the other became a vehicle for mindless entertainment.

I don't care for coding new stuff. Everything I may need either already exists or is too complex to do on my own (and no, I won't vibe-code it, what's the fun in that?)

I don't even code for work anymore since I moved to a project/service management role.

Basically, the spark I felt some 25 years ago seems to be completely gone.

Any suggestion on getting it back?


Comments URL: https://news.ycombinator.com/item?id=48164173

Points: 31

# Comments: 34

https://news.ycombinator.com/item?id=48164173
Extensions
Show HN: Daily vibe-coding video games, day 33: Tower Defense (single prompt)

I'm using AI (mostly Claude) to create/publish a new video game every day

This is day 33, first stab at the tower defense genre. Most of the games (including this one) I build with a single prompt. Rarely, a couple extra prompts are needed for bug fixes or to tweak the physics/UI. Extremely rarely, the AI has difficulty making the game work right (usually drawing it) and it takes a dozen or more prompts -- but the majority of the time, it gets everything right and makes a fully playable game first try

Happy to answer any questions, just a little hobby project of mine I'm having lots of fun with :)


Comments URL: https://news.ycombinator.com/item?id=48162431

Points: 6

# Comments: 0

https://news.ycombinator.com/item?id=48162431
Extensions
Show HN: Epiq – Distributed Git based issue tracker TUI

Issue trackers typically live outside of your workflow, with poor ergonomics. Epiq aims to solve that, bringing issue tracking into your terminal. Multi-user collaboration is achieved via git using user-scoped immutable event logs that converge in memory. Put my all into it. Let me know what you think.


Comments URL: https://news.ycombinator.com/item?id=48155570

Points: 4

# Comments: 1

https://news.ycombinator.com/item?id=48155570
Extensions
Show HN: GlycemicGPT – Open-source AI-powered diabetes management

I'm a Type 1 diabetic and software engineer. Last year I went months between endocrinologists with no clinician reviewing my data. I'm an engineer, so I built the tool I needed — and now I'm open sourcing it. GlycemicGPT is a self-hosted platform that connects continuous glucose monitors, insulin pumps, and existing Nightscout instances to an AI analysis layer running on your own infrastructure. Data sources:

Dexcom G7 (cloud API) Tandem t:slim X2 and Mobi pumps (direct BLE) Nightscout (point it at your existing instance and you're running in minutes)

What the AI layer does:

Daily briefs summarizing overnight and 24-hour patterns Meal response analysis Conversational chat with RAG-backed clinical knowledge Predictive alerting with configurable thresholds and caregiver escalation

Important: this is monitoring and analysis only. GlycemicGPT does not deliver insulin, does not control your pump, and is not a closed-loop system. It reads your data and gives you insight on top of it. Your clinical decisions stay between you and your care team. Architecture:

Self-hosted via Docker or K8S — the GlycemicGPT stack runs entirely on your hardware BYOAI — bring your own AI provider. Use Ollama for fully local operation (no data leaves your hardware), or point it at Claude, OpenAI, or any OpenAI-compatible endpoint if you prefer a hosted model. Data flows directly from your instance to the provider you choose; nothing is routed through any centralized service operated by the project. GPL-3.0, no subscriptions, no vendor lock-in

Stack:

Backend API: FastAPI, Python 3.12, PostgreSQL 16, Redis 7 Web Dashboard: Next.js 15, React 19, Tailwind CSS, shadcn/ui AI Sidecar: TypeScript, Express, multi-provider proxy Android App: Kotlin, Jetpack Compose, BLE Wear OS: Kotlin, Wear Compose, Watch Face Push API Plugin SDK: Kotlin interfaces, capability-based, sandboxed

Looking for contributors — especially folks with BLE/Android experience or anyone in the diabetes tech space. Plugin SDK is documented if you want to add support for new devices. GitHub: https://github.com/GlycemicGPT/GlycemicGPT


Comments URL: https://news.ycombinator.com/item?id=48144670

Points: 25

# Comments: 8

https://news.ycombinator.com/item?id=48144670
Extensions
Show HN: GridTravel- A community based travel app for users to share routes

Hey HN,

My co-founders and I have been building GridTravel, a free iOS app for planning and sharing travel routes with turn-by-turn GPS nav. We just launched yesterday after App Store approval.

We're three 21-year-old cofounders and best friends since middle school. We built GridTravel after years of frustration navigating new cities on every trip we took together.

The idea: most people either search Google for "top 10 places to visit in…" lists or go on social media to get inspiration on where to go. GridTravel is built around user-generated routes — actual paths someone walked, that you can follow, save, download, and discover from other travelers. Users also have the ability to create private routes and collaborate with their friends.

Tech stack: Mapbox (Nav SDK + maps), Supabase (auth, DB, storage), and Swift. Native iOS for now, Android coming soon.

Our two real cost drivers are Mapbox Search (hit when users create routes) and Mapbox Navigation (hit when users use live navigation). Both have free tiers, then scale with MAU. We launched fully free to remove the barrier to entry. Revisiting pricing in Year 2 once nav costs start burning a hole in our pocket.

Current state: we're in the UGC cold-start hole. The app's value scales with route density in a given city, but route density requires users, who require routes. Classic chicken and egg. Our current plan: 1. Manually seed 25–30 routes per city, starting with 5-10 priority cities where we have personal networks rather than spreading ourselves thin. 2. Short-form content as the primary social channel (TikTok, reels, shorts). Doing A/B testing: whether route walkthroughs convert better than informational/skit videos. 3. Partnering with micro-influencers in those cities (5k-50k following) for in-app routes plus cross-posts on their channels

Curious what HN thinks. Especially anyone who's shipped a UGC product. What worked for you on cold start? What do you wish you'd done differently? Happy to answer any questions about the app, costs, etc.

App link: https://apps.apple.com/us/app/gridtravel-local-routes/id6762...


Comments URL: https://news.ycombinator.com/item?id=48141902

Points: 7

# Comments: 1

https://news.ycombinator.com/item?id=48141902
Extensions
Show HN: I built a Web-Scraper API that is 6-7x more efficient than current ones

Runo is a web-scraping API that returns typed, structured JSON. You define a schema (field name, type, example value), and Runo fetches the page and returns the data. No HTML, no parsers, no post-processing.

Over the past few weeks, I have been building this non stop. Currently, every scraper API out there solves the site fetching problem but left the extraction of the actual data entirely to users. Runo makes that completely disappear.

For Runo, I went ahead and added JS rendering, stealth mode, and full LLM extraction to make this a fully functional and capable of scraping most if not all sites.

Also, another major problem with current web scrapers is that they charge per feature or bundle them into expensive credit tiers. A single large or JS rendered request can cost 5-75 credits, which means you essentially get nothing out of their plans. Runo is flat per request, no matter the site. At the Scale tier, Runo works out to $0.90 per 1,000 effective requests vs. around $6 for the nearest Firecrawl equivalent. My jaw dropped when I was testing Runo and came across these numbers.

I created a free tier that is 500 requests/month, no credit card required. Take it for a spin and let me what can be improved. I would love feedback.


Comments URL: https://news.ycombinator.com/item?id=48141206

Points: 6

# Comments: 3

https://news.ycombinator.com/item?id=48141206
Extensions
Claude Account Suspended Seconds After Purchase?

I literally created a new account, pressed submit on the credit card dialog, the purchase goes through and i get logged out. I try to log in, and it says I'm banned. I check my mail box and I see an email with an invoice and another that I'm in violation of the ToS, submitted within the same minute LOL.

Is this some kind of joke? :O


Comments URL: https://news.ycombinator.com/item?id=48134808

Points: 6

# Comments: 0

https://news.ycombinator.com/item?id=48134808
Extensions
Show HN: Running the second public ODoH relay

Every privacy-focused DNS service requires an account: NextDNS, Cloudflare for Families, Apple's iCloud Private Relay (paid, iOS-only). The protocol that doesn’t require one - ODoH - had basically one well-known public relay operator (Frank Denis on Fastly Compute, default in dnscrypt-proxy). I built a second one and the client to talk to it.


Comments URL: https://news.ycombinator.com/item?id=48133561

Points: 3

# Comments: 0

https://news.ycombinator.com/item?id=48133561
Extensions
Arena AI Model ELO History

Hi HN,

I built a live tracker to visualize the lifecycle and performance changes of flagship AI models.

We've all experienced the phenomenon where a flagship model feels amazing at launch, but weeks later, it suddenly feels a bit off. I wanted to see if this was just a feeling or a measurable reality, so I built a dashboard to track historical ELO ratings from Arena AI.

Instead of a massive spaghetti chart of every single model variant, the logic plots exactly ONE continuous curve per major AI lab. It dynamically tracks their highest-rated flagship model over time, which makes both the sudden generational jumps and the slow performance decays much easier to see. It took quite a lot of iterations to get the chart to look nice on mobile as well. Optional dark mode included.

However, I have a specific data blindspot that I'm hoping this community might have insights on.

Arena AI largely relies on testing API endpoints. But as we know, consumer chat UIs often layer on heavy system prompts, safety wrappers, or silently switch to heavily quantized models under high load to save compute. API benchmarks don't fully capture this "nerfing" that everyday web users experience.

Does anyone know of any historical ELO or evaluation datasets that specifically scrape or test outputs from the consumer web UIs rather than raw APIs?

I'd love to integrate that data for a more accurate picture of the consumer experience. The project is open-source (repo link in the footer), so I'd appreciate any feedback, or pointers to datasets!


Comments URL: https://news.ycombinator.com/item?id=48130711

Points: 15

# Comments: 2

https://news.ycombinator.com/item?id=48130711
Extensions
Show HN: Nibble

An attempt at a single pass LLVM frontend in ~3000 lines of C without external dependencies, malloc, or an AST. Included are some graphical examples. The IR isn't perfect, and the README touches on one particular downfall


Comments URL: https://news.ycombinator.com/item?id=48130186

Points: 13

# Comments: 0

https://news.ycombinator.com/item?id=48130186
Extensions
Tell HN: Dont use Claude Design, lost access to my projects after unsubscribing

I wanted to try codex after 5 months of claude code max subscription. And then I went back to my previous projects on claude design only to realize I don't have access to them anymore.

This is a first. I never lost access to any of my past sessions because I unsubscribed in any of the LLM apps.

I actually wanted to try out codex previously, but had similar experience with my credits. They gave extra credits equivalent to my montly subscription price, with some time limit because claude has so many issues that month. And as soon as plan ended. I lost access to the credits. Even after resubscribing, I still don't have access to those credits.

I have sympathies towards the engineers, especially the ones that are putting themselves on X. But only when someone with large following has some issue, they sort it out.

Having worked at a billing company, I can see how complex contracts sound good for the growth/sales folks but are also horrible for engineers actually implementing those contracts. Their complex rate limiting which is now a norm, identifying other harnesses to count them against extra usage are all probably not easy to implement without very rough edge cases. But all the "bugs" are just where the user gets screwed is what is problematic.

I just wanted to post this here, after tagging them multiple times on X to alert other users.


Comments URL: https://news.ycombinator.com/item?id=48128003

Points: 14

# Comments: 0

https://news.ycombinator.com/item?id=48128003
Extensions
Launch HN: Ardent (YC P26) – Postgres sandboxes in seconds with zero migration

Hey HN! We’re Vikram and Evan from Ardent (https://tryardent.com). We're building database sandboxes for you and your coding agents.

In the last two years coding agents have gotten dramatically more capable at handling complex engineering tasks. But without access to a realistic sandbox at the DB layer for testing, they ship garbage that can take down production databases. I spent over a year building an AI Data Engineer that failed for this exact reason. Evan spent the last 12 years in data engineering and hit this wall building agents at his last company.

Ardent was built to make it possible for coding agents to get near instant access to production-like sandboxes so they can test their work. To do this we write a replication stream out of the target DB, scaling with kafka onto a read replica with copy on write enabled and autoscaling compute (we currently prefer neon as a primary branching engine due to their implementation of these properties).

Our replication stream uses logical replication + ddl triggers to enable usage on any hosted postgres DB since most platforms do not allow physical replication which is traditionally used for creating replicas.

This provides a few primary benefits:

1. Does not require a platform migration to a DB provider like neon, allowing strong separation of production and development concerns. 2. Minimal impact on the production database while allowing clones to spin up in <6s, even at TB scale with copy-on-write

Security matters a lot with cloning production so we run a proxy layer to generate custom postgres URLs and route all connections to allow more granular access control to clones, prevent credential leak, and follow a split plane architecture to allow full data residency on your cloud through BYOC.

We also support anonymization through the ability to register SQL that runs on branches before they are returned. This has been used for PII redaction and branch modification.

Our goal is to make every data infrastructure platform “cloneable” in one place so agents can fully test the impact of their changes on production like data environments without risk.

Here's a demo of it: https://youtu.be/5S1kwPtiRU0

We’d love to understand how you work with coding agents on the DB and if you try Ardent (it's free to get started) what worked, what broke and what’s missing.


Comments URL: https://news.ycombinator.com/item?id=48124436

Points: 4

# Comments: 0

https://news.ycombinator.com/item?id=48124436
Extensions
Show HN: Rotunda - A browser built for agents with simulated typing

Hi HN! Pierce here.

Rotunda is a firefox fork primarily intended for agent use, which I’ve been hacking on nights/weekends.

There was a [lengthy](https://news.ycombinator.com/item?id=48024859) discussion last week on how expensive computer use models are. The cost is going to drop eventually, but I think on some level it's still usually the wrong primitive. The web gives us access to beautiful structured formats, plaintext, etc... why throw that away if we don't have to?

I realized at some point that for 99% of automations I just want agents to be able to control my Chrome instance. But that’s easier said that done: CDP (the Chrome automation protocol) leaks a ton of state about being programmatically controlled, either by toggling window attributes or by running `page.evaluate()` commands right in the page context. Plus if you look at an automation running it's pretty obvious what happens: the mouse jumps around, fields are filled instantly, etc.

Rotunda tries to fix this. Its standout features:

- Realistic simulation of mouse movements and keyboard commands, powered by a trained RNN on my own timing patterns from the last week. (still feel weird about opting-in to a key logger but whatever)

- Doesn’t lie about its host specs, only fibs about some client side details. Stealth browsers are too easy to flag statistically when you’re adding noise to canvas pixels or audio pipelines.

- It runs on your local device with a CLI or Playwright API accessible to Claude, Codex, or whatever your harness-de-jure today looks like.

- Patches modern Firefox (150) with an agentic harness to keep this updated over time

MPL-2.0 on GitHub: https://github.com/monkeysee-ai/rotunda

Longer writeup on the design choices: https://pierce.dev/notes/a-browser-for-agents

Also check out the demo on the site! https://www.rotunda.sh/

Pretty excited by how this turned out but we’re still super early. Give it a try and please flag any issues!


Comments URL: https://news.ycombinator.com/item?id=48121824

Points: 11

# Comments: 0

https://news.ycombinator.com/item?id=48121824
Extensions
Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.

We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.

Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).

Training: - Pretrained on 200B tokens across 16 TPU v6e (27 hours) - Post-trained on 2B tokens of synthesized function-calling data (45 minutes) - Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

You can test it right now and finetune on your Mac/PC: https://github.com/cactus-compute/needle

The full writeup on the architecture is here: https://github.com/cactus-compute/needle/blob/main/docs/simp...

We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to published.

While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.

This is part of our broader work on Cactus (https://github.com/cactus-compute/cactus), an inference engine built from scratch for mobile, wearables and custom hardware. We wrote about Cactus here previously: https://news.ycombinator.com/item?id=44524544

Everything is MIT licensed. Weights: https://huggingface.co/Cactus-Compute/needle GitHub: https://github.com/cactus-compute/needle


Comments URL: https://news.ycombinator.com/item?id=48111896

Points: 53

# Comments: 11

https://news.ycombinator.com/item?id=48111896
Extensions
Show HN: Agentic interface for mainframes and COBOL

Hi HN, we’re Sai and Aayush, and we’re building Hypercubic (https://www.hypercubic.ai/), bringing AI tools to the mainframe and COBOL world. (We did a Launch HN last year: https://news.ycombinator.com/item?id=45877517.)

Today we’re launching Hopper, an agentic development environment for mainframes. You can download it here: https://www.hypercubic.ai/hopper, and you can also request access and immediately get a mainframe user account to play with. There's also a video runthrough at https://www.youtube.com/watch?v=q81L5DcfBvE.

Mainframes still run a surprising amount of critical infrastructure: banking, payments, insurance, airlines, government programs, logistics, and core operations at large institutions. Many of these systems are decades old, but they continue to process enormous transaction volumes because they are reliable, secure, and deeply embedded into business operations.

A lot of that software is written in COBOL and runs on IBM z/OS. The development environment looks very different from modern cloud or Unix-style development. Instead of GitHub, shell commands, package managers, and CI pipelines, developers often work through TN3270 terminal sessions, ISPF panels, partitioned datasets, JCL, JES queues, spool output, return codes, VSAM files, CICS transactions, and shop-specific conventions.

TN3270 is the terminal interface used to interact with many IBM mainframe systems. ISPF is the menu and panel system developers use inside that terminal to browse datasets, edit source, submit jobs, and inspect output. It is powerful and reliable, but it was designed for expert humans navigating screens, function keys, and fixed-width workflows, not AI agents.

A simple COBOL change might require finding the right source member, checking copybooks, locating compile JCL, submitting a job, reading JES/SYSPRINT output, interpreting condition codes, patching fixed-width source, and resubmitting.

Much of this work is so well-defined and repetitive that it's a good fit for agentic AI. To get that working, however, a chatbot next to a terminal is not enough. The agent needs to operate inside the mainframe environment.

Hopper combines three things: (1) A real TN3270 terminal, (2) Mainframe-aware panels for datasets, members, jobs, and spool output, and (3) An AI agent that can operate across those z/OS surfaces.

For example, here is a tiny version of the kind of thing Hopper can help debug:

  COBOL:

   IDENTIFICATION DIVISION.
   PROGRAM-ID. PAYCALC.

   DATA DIVISION.
   WORKING-STORAGE SECTION.
   01  CUSTOMER-BALANCE     PIC 9(7)V99.

   PROCEDURE DIVISION.
       ADD 100.00 TO CUSTOMER-BALNCE
       DISPLAY "UPDATED BALANCE: " CUSTOMER-BALANCE
       STOP RUN.


  JCL:

    //PAYCOMP  JOB (ACCT),'COMPILE',CLASS=A,MSGCLASS=X
    
    //COBOL    EXEC IGYWCL
    
    [//COBOL.SYSIN](https://cobol.sysin/) DD DSN=USER1.APP.COBOL(PAYCALC),DISP=SHR
    
    [//LKED.SYSLMOD](https://lked.syslmod/) DD DSN=USER1.APP.LOAD(PAYCALC),DISP=SHR

A human would submit this job, inspect JES output, open `SYSPRINT`, find the undefined `CUSTOMER-BALNCE`, map it back to the source, patch the member, and resubmit. Hopper is designed to let an agent operate through that same loop autonomously.

Hopper is not trying to hide the mainframe behind a generic abstraction, and it's not a chatbot. The design principle is simple: preserve the fidelity of the mainframe environment, but make it accessible to AI agents.

Sensitive operations require approval, and the terminal remains visible at all times.

Once agents can operate inside the mainframe environment, new workflows become possible: faster job debugging, automated documentation, safer code changes, test generation, migration planning, traffic replay, and modernization verification.

We’re curious to hear your thoughts! especially from anyone who has worked with mainframes, COBOL or has done legacy enterprise modernization.


Comments URL: https://news.ycombinator.com/item?id=48111143

Points: 8

# Comments: 2

https://news.ycombinator.com/item?id=48111143
Extensions
Show HN: Gigacatalyst – Extend your SaaS with an embedded AI builder

Hi HN, I’m Namanyay from Gigacatalyst (link: https://gigacatalyst.com/). Gigacatalyst allows sales, CS, and users to build one-off features, so your SaaS can support long-tail customer workflows and engineers aren’t pulled away from the roadmap.

When you sell software to large businesses, you realize that each customer needs their own workflow and features. Traditionally, this either means long engineering roadmaps or the customers end up using workarounds.

But what if everyone could build their critical missing features just by talking to an AI? That’s what we do at Gigacatalyst. We provide an AI customization layer for your customers, CS team, and sales team to build these missing critical workflows without needing any engineers at all. Think Lovable, but built on top of YOUR platform.

We connect to your product's APIs, learn your data model and design system, and let non-technical users build governed apps via natural language - inside your product, under your brand.

Here’s what it looks like in action: https://www.youtube.com/watch?v=_taSpSphH6E

One of our customers, a Series B company, saw their users (not engineers - managers, ops people, facility directors) build critical workflows like:

- Parts stockout prevention: A maintenance manager typed "show me which parts will run out in the next 2 weeks based on usage over the last 90 days, accounting for vendor lead times." The app tracks consumption velocity, forecasts stockouts, and alerts before it's too late. He says it's prevented ~$500K in emergency downtime.

- Invoice OCR from phone photos: Technicians kept losing paper invoices. The prompt: "upload a photo of the invoice, extract vendor name, date, amount, and line items, then match it to the purchase order and flag discrepancies." Now techs snap a photo on-site to automatically add to the system of record.

- Restaurant emergency triage: A pizza chain's facilities manager was drowning in maintenance requests. He built a priority matrix: "walk-in freezer not cooling" auto-routes as CRITICAL, "dining room light flickering" goes to LOW. He's now able to manage backlogs with the correct priority.

How Gigacatalyst works under the hood:

1. Agentic API discovery: Our agents go through your app and parse your endpoints, query params, request/response shapes, and sample data to build the base layer.

2. Generation and Validation: When a user describes what they want our AI generates an app. We set up multiple validation steps, including static checks, runtime error analysis, and LLM-as-a-judge.

3. Sandboxing and Compilation: We wrote our own compilation and sandboxing framework to get the fastest speeds and lowest costs. This means that users can interact with the built app in seconds.

4. Proxy layer: We create a proxy layer for all APIs to handle auth, tenant isolation, and rate limiting. Everything the agent has access to is controlled, logged, observed, and version controlled.

After 2000+ daily users, 900+ apps built, and 70% 30-day retention, today we're opening a public demo.

Try it: https://app.gigacatalyst.com/ - enter your SaaS product's API URL (or just the homepage) and start prompting.

If you're serving a variety of use cases, you probably deal with a lot of custom requests and Gigacatalyst will save you time and increase your bottom line. Book a meeting at https://gigacatalyst.com/#contact and I'll help your team and customers build new functionality on top of your platform.

I've been reading Hacker News since I was 12 years old. I'm proud to launch for all of you and I want to hear your feedback on my product and comments!


Comments URL: https://news.ycombinator.com/item?id=48110593

Points: 4

# Comments: 0

https://news.ycombinator.com/item?id=48110593
Extensions
Launch HN: Voker (YC S24) – Analytics for AI Agents

Hey HN, we're Alex and Tyler, co-founders of Voker.ai (https://voker.ai/), an agent analytics platform for AI product teams. Voker gives full visibility into what users are asking of your agents, and whether your agents are delivering, without having to dig through logs. Our main product is a lightweight SDK that is LLM stack agnostic and purpose-built for agent products. (https://app.voker.ai/docs)

Agent Engineers and AI product teams don’t have the right level of visibility into agent performance in production, which results in bad user experiences, churn, and hundreds of hours wasted with spot checks to find and debug issues with agent configurations.

Demo: https://www.tella.tv/video/vid_cmoukcsk1000i07jgb4j65u67/vie...

We recently conducted a survey of YC Founders and 90%+ of respondents said that the only way they know if their Agents are failing users in production is by hearing complaints from customers. They push a prompt change hoping that it fixes the problem and doesn’t break something somewhere else, and the cycle repeats.

We saw tons of observability and evals products popping up to try to address these problems, but we still felt like something was missing in the agent monitoring stack. Obs is good for individual trace debugging but is only accessible to engineers. Evals are good for testing known issues, but don't give insights into trends that teams don’t expect, so engineers are always playing catch up. Traditional product analytics tools do a good job tracking clicks and pageviews across your product surface but weren’t built ground up for agent products. Knowing what users want out of agents, and whether the agent delivered requires specific conversational intelligence / unstructured data processing techniques.

We came up with the agent analytics primitives of Intents, Corrections, and Resolutions to describe something pretty much all conversational agents had in common: a user will always come to an agent with an intent, the user might have to correct this agent on the way to getting their intent resolved, and hopefully every intent a user has is eventually resolved by the agent. Voker processes LLM calls by automatically annotating individual conversations and picking out user intent and corrections. Voker takes these and uses LLMs and hierarchical text classification to create dynamic categories that give higher level insights so you don’t have to read individual conversations to know what are the main usage patterns across your users.

The most common substitute solution we’ve seen is uploading obs logs to Claude or ChatGPT and asking for summary insights. There are a few problems with this - mainly that LLMs aren’t good at math or data science, so you don’t get accurate or consistent statistics. Its highly likely that the LLM overfits to some insights and underfits to others. The LLM isn’t programmatically reading and classifying each individual session or interaction. This is why we don’t use LLMs for any of our core data engineering (processing events, calculating statistics) so the analytics we produce are consistent, reproducible, and accurate. We have a publicly available, lightweight SDK that wraps LLM calls to OpenAI, Anthropic and Gemini in Python and Typescript. Voker handles the data engineering to turn raw data into usable analytics primitives and higher level insights. Free tier: 2,000 events / mo, requires email signup. Paid plans start at $80/mo with a 30 day free trial.

We'd love to hear how you're currently detecting trends, and if you try Voker, tell us what part of our analysis is valuable, and what still feels missing. Thanks for reading, and we’re looking forward to your thoughts in the comments!


Comments URL: https://news.ycombinator.com/item?id=48109962

Points: 11

# Comments: 6

https://news.ycombinator.com/item?id=48109962
Extensions
Show HN: Statewright – Visual state machines that make AI agents reliable

Agentic problem solving in its current state is very brittle. I fell in love with it, but it creates as many problems as it solves.

I'm Ben Cochran, I spent 20+ years in the trenches with full-stack Engineering, DevOps, high performance computing & ML with stints at NVIDIA, AMD and various other organizations most recently as a Distinguished Engineer.

For agents to work reliably you either need massive parameter counts or massive context windows to keep the solution spaces workable. Most people are brute forcing reliability with bigger models and longer prompts.

What if I made the problem smaller instead of making the model bigger?

I took a different approach by using smaller models: models in the 13-20B parameter range and set them to task solving real SWE-bench problems. I constrained the tool and solution spaces using formal state machines. Each state in the machine defines which tools the model can access, how many iterations it gets and what transitions are valid. A planning state gets read-only tools. An implementation state gets edit tools (scoped to prevent mega edits) and write friendly bash tools. The testing state gets bash but only for testing commands. The model cannot physically skip steps or use the wrong tool at the wrong time. It is enforced via protocol, not via prompts.

The results were more promising than I would have expected. Across multiple model families irrespective of age (qwen-coder, gpt-oss, gemma4) and the improvements were consistent above the 13B parameter inflection point. Below that, models can navigate the state machine but can't retain enough context to produce accurate edits. More on the research bit: https://statewright.ai/research

Surprisingly this yielded improvements in frontier models as well. Haiku and Sonnet start to punch above their weight and Opus solves more reliably with fewer tokens and death spirals. Fine tuning did not yield these kinds of functional improvements for me. The takeaway it seems is that context window utilization matters more than raw context size - a tightly scoped working context at each step outperforms a model given carte blanche over everything. Constraining LLMs which are non-idempotent by using deterministic code is a pattern that nobody is currently talking about.

So, I built Statewright. Its core is a Rust engine that evaluates state machine definitions: states, transitions, guards and tool restrictions. Its orchestration doesn't use an LLM, just enforces the state machine. On top of that is a plugin layer that integrates with Claude Code (and soon Codex, Cursor and others) via MCP. When you activate a workflow, hooks enforce the guardrails per state automatically. The model sees 5 tools available instead of dozens, gets clear instructions for the current phase and transitions when conditions are met. Importantly it tells the model when it's attempting to do something that isn't in scope, incorrect or when it needs to try something else after getting stuck.

You can use your agent via MCP to build a state machine for you to solve a problem in your current context. The visual editor at statewright.ai lets you tweak these workflows in a graph view... You can clearly see the failure paths, the retry loops and the approval gates. State machines aren't DAGs; they loop and retry, which is what agentic work actually needs.

Statewright is currently live with a free tier, try it out in Claude Code by running the following:

/plugin marketplace add statewright/statewright

/plugin install statewright

/reload-plugins

Then "start the bugfix workflow" or /statewright start bugfix. You'll need to paste your API key when prompted. The latest versions of Claude may complain -- paste the API key again and say you really mean it, Claude is just being cautious here.

Feedback is welcome on the workflow editor, the plugin experience, and tell me what workflows you'd want to build first. Agents are suggestions, states are laws.


Comments URL: https://news.ycombinator.com/item?id=48108778

Points: 4

# Comments: 0

https://news.ycombinator.com/item?id=48108778
Extensions
Show HN: Safe-install – safer NPM installs with trusted build dependencies

In light of the ongoing npm supply chain compromises, I built safe-install:

https://www.npmjs.com/package/@gkiely/safe-install

It brings a couple of protections I wanted from npm but are not built in.

Similar to Bun’s trusted dependencies, it lets you disable install scripts by default and define a list of dependencies that are allowed to run build/install scripts:

https://bun.com/docs/guides/install/trusted

It also supports blocking exotic sub-dependencies, similar to pnpm’s `blockExoticSubdeps` setting:

https://gajus.com/blog/3-pnpm-settings-to-protect-yourself-f...

I was hoping npm would eventually add something like this, but it does not seem to be happening soon, so I made a small package for it.


Comments URL: https://news.ycombinator.com/item?id=48102636

Points: 3

# Comments: 0

https://news.ycombinator.com/item?id=48102636
Extensions
Show HN: E2a – Open-source email gateway for AI agents

We were building an agent system and wanted email as a trigger. We decided to take it out and made it a standalone service.

The primary email features we wanted and used for our own agent system:

1. Email threading stays consistent with agent conversation threading

2. Human in the loop review for outbound emails (especially during testing phase)

3. Quick onboarding/offboarding email addresses for agents within minutes

4. Websocket for local agents and at-least-once webhook delivery for Cloud agents

Not yet: DMARC (only SPF/DKIM today), scoped API keys, HA/multi-region (single VM + single Postgres), app-layer email data encryption, compliance attestations (SOC 2/HIPAA).

GitHub: https://github.com/Mnexa-AI/e2a

Hosted: https://e2a.dev/

Appreciate any feedback / contributions.


Comments URL: https://news.ycombinator.com/item?id=48100227

Points: 10

# Comments: 1

https://news.ycombinator.com/item?id=48100227
Extensions
Show HN: OpenGravity – A zero-install, BYOK vanilla JS clone of Antigravity

Hi. I’m a high school student studying for my GCSEs. I was using Google Antigravity heavily for my side projects, but I kept hitting the usage limits, and getting random "agent terminated" errors. So I decided to try build my own version of the IDE. I love the UI, so I copied it as accurately as possible, and then hooked up some logic into it, including the INCREDIBLY finicky webcontainer api.

I tried to keep it super lightweight, no build steps, or dependencies, and now that its open source, I'm hoping people can build things on top of it that arent possible with closed source tools, like complex custom agent workflows.

Some screenshots: - https://github.com/ab-613/OpenGravity/blob/main/examples/scr... - https://github.com/ab-613/OpenGravity/blob/main/examples/htm...

What it's made from:

- Pure Vanilla JS: no react, vue, or build step. Built entirely in plain HTML/CSS/JS to keep it super lightweight.

- WebContainer API and xterm.js: Instead of faking a terminal, I (after much pain) hooked up the WebContainer API so the AI agent has a real, in browser linux environment to run shell commands, install dependencies, and edit local files.

- BYOK (Bring Your Own Key): API key ALWAYS stays in localStorage.

Whats currently happening:

- It works, but it's an alpha. The AI can proactively start projects going properly and edit files, but because I built this over a few days before my exams, a lot of the UI dropdowns and buttons are currently just hardcoded placeholders.

- I’m open sourcing it early because I think the foundation of a Vanilla JS + WebContainer IDE is really strong, and I'd love to see where the community takes it while I'm doing my exams.

- Live demo: https://opengravity.pages.dev (Zoom out to 80% if not full screen. It will prompt for a gemini api key on load). Start by uploading a folder, then you can fiddle with the terminal and agent, and see how it goes!

Would love to hear feedback on the code, the WebContainer integration, or how to improve the agent loop!


Comments URL: https://news.ycombinator.com/item?id=48100192

Points: 22

# Comments: 8

https://news.ycombinator.com/item?id=48100192
Extensions