Hacker News: Launches

Launch HN: Voker (YC S24) – Analytics for AI Agents

ttpost May 12, 2026

Hey HN, we're Alex and Tyler, co-founders of Voker.ai (https://voker.ai/), an agent analytics platform for AI product teams. Voker gives full visibility into what users are asking of your agents, and whether your agents are delivering, without having to dig through logs. Our main product is a lightweight SDK that is LLM stack agnostic and purpose-built for agent products. (https://app.voker.ai/docs)

Agent Engineers and AI product teams don’t have the right level of visibility into agent performance in production, which results in bad user experiences, churn, and hundreds of hours wasted with spot checks to find and debug issues with agent configurations.

Demo: https://www.tella.tv/video/vid_cmoukcsk1000i07jgb4j65u67/vie...

We recently conducted a survey of YC Founders and 90%+ of respondents said that the only way they know if their Agents are failing users in production is by hearing complaints from customers. They push a prompt change hoping that it fixes the problem and doesn’t break something somewhere else, and the cycle repeats.

We saw tons of observability and evals products popping up to try to address these problems, but we still felt like something was missing in the agent monitoring stack. Obs is good for individual trace debugging but is only accessible to engineers. Evals are good for testing known issues, but don't give insights into trends that teams don’t expect, so engineers are always playing catch up. Traditional product analytics tools do a good job tracking clicks and pageviews across your product surface but weren’t built ground up for agent products. Knowing what users want out of agents, and whether the agent delivered requires specific conversational intelligence / unstructured data processing techniques.

We came up with the agent analytics primitives of Intents, Corrections, and Resolutions to describe something pretty much all conversational agents had in common: a user will always come to an agent with an intent, the user might have to correct this agent on the way to getting their intent resolved, and hopefully every intent a user has is eventually resolved by the agent. Voker processes LLM calls by automatically annotating individual conversations and picking out user intent and corrections. Voker takes these and uses LLMs and hierarchical text classification to create dynamic categories that give higher level insights so you don’t have to read individual conversations to know what are the main usage patterns across your users.

The most common substitute solution we’ve seen is uploading obs logs to Claude or ChatGPT and asking for summary insights. There are a few problems with this - mainly that LLMs aren’t good at math or data science, so you don’t get accurate or consistent statistics. Its highly likely that the LLM overfits to some insights and underfits to others. The LLM isn’t programmatically reading and classifying each individual session or interaction. This is why we don’t use LLMs for any of our core data engineering (processing events, calculating statistics) so the analytics we produce are consistent, reproducible, and accurate. We have a publicly available, lightweight SDK that wraps LLM calls to OpenAI, Anthropic and Gemini in Python and Typescript. Voker handles the data engineering to turn raw data into usable analytics primitives and higher level insights. Free tier: 2,000 events / mo, requires email signup. Paid plans start at $80/mo with a 30 day free trial.

We'd love to hear how you're currently detecting trends, and if you try Voker, tell us what part of our analysis is valuable, and what still feels missing. Thanks for reading, and we’re looking forward to your thoughts in the comments!

Comments URL: https://news.ycombinator.com/item?id=48109962

Points: 29

# Comments: 13

https://news.ycombinator.com/item?id=48109962

Extensions

Launch HN: Kampala (YC W26) – Reverse-Engineer Apps into APIs

alexblackwell_ Apr 16, 2026

Hey! I am Alex and together with my co-founder Tarun built Kampala (https://www.zatanna.ai/kampala). It’s a man-in-the-middle (MITM) style proxy that allows you to agentically reverse engineer existing workflows without brittle browser automation or computer use agents. It works for websites, mobile apps, desktop apps.

Demo: https://www.youtube.com/watch?v=z_PeostC-b4. Many people spend hours per day in legacy dashboards and on-prem solutions reconciling data across platforms. Current attempts at automation use browser automations or computer use agents which are brittle, slow, and nondeterministic. I come from a web reverse engineering background and spent the last 7-8 years building integrations by hand for sneaker/ticket releases, sportsbooks logins, and everything in\ between. During that time I consulted for several companies and brought them off of browser based infrastructure into the requests layer.

When we started Zatanna (that’s our company name) we worked in dental tech, which meant we had to deal with tons of insurance payer dashboards and legacy dental-practice solutions. Our superpower (as a fairly undifferentiated voice agent/front desk assistant company) was that we could integrate with nearly any system requested. During this time we built extensive tooling (including what we’re now calling Kampala) to allow us to spin up these integrations quickly. Existing MITM proxies and tooling didn’t work for a few reasons: (1) They manipulated the TLS and HTTP2 fingerprint over the wire which was detected by strict anti-bots. (2) They had bad MCPs which did not adequately expose necessary features like scripts/replay. (3) They did not allow for building workflows or actions given a sample or sequence of requests.

As the tools we built got more powerful, we began to use them internally to scrape conference attendees, connect to external PMS systems, and interact with slack apps. I even sent it to my property manager mom, who (with a lot of help from me lol), automated 2-3 hours of billing information entry in Yardi. At that point we realized that this wasn’t really about dentistry :)

Because Kampala is a MITM, it is able to leverage existing session tokens/anti-bot cookies and automate things deterministically in seconds. You can either use our agent harness that directly creates scripts/apis by prompting you with what actions to make, or our MCP by manually doing a workflow once, and asking your preferred coding agent to use Kampala to make a script/API to replicate it. Once you have an API/script, you can export, run, or even have us host it for you.

We think the future of automation does not consist of sending screenshots of webpages to LLMs, but instead using the layer below that computers actually understand. Excited to hear your thoughts/questions/feedback!

Comments URL: https://news.ycombinator.com/item?id=47794514

Points: 100

# Comments: 83

https://news.ycombinator.com/item?id=47794514

Extensions

Launch HN: Twill.ai (YC S25) – Delegate to cloud agents, get back PRs

danoandco Apr 10, 2026

Hey HN, we're Willy and Dan, co-founders of Twill.ai (https://twill.ai/). Twill runs coding CLIs like Claude Code and Codex in isolated cloud sandboxes. You hand it work through Slack, GitHub, Linear, our web app or CLI, and it comes back with a PR, a review, a diagnosis, or a follow-up question. It loops you in when it needs your input, so you stay in control.

Demo: https://www.youtube.com/watch?v=oyfTMXVECbs

Before Twill, building with Claude Code locally, we kept hitting three walls

1. Parallelization: two tasks that both touch your Docker config or the same infra files are painful to run locally at once, and manual port rebinding and separate build contexts don't scale past a couple of tasks.

2. Persistence: close your laptop and the agent stops. We wanted to kick off a batch of tasks before bed and wake up to PRs.

3. Trust: giving an autonomous agent full access to your local filesystem and processes is a leap, and a sandbox per task felt safer to run unattended.

All three pointed to the same answer: move the agents to the cloud, give each task its own isolated environment.

So we built what we wanted. The first version was pure delegation: describe a task, get back a PR. Then multiplayer, so the whole team can talk to the same agent, each in their own thread. Then memory, so "use the existing logger in lib/log.ts, never console.log" becomes a standing instruction on every future task. Then automation: crons for recurring work, event triggers for things like broken CI.

This space is crowded. AI labs ship their own coding products (Claude Code, Codex), local IDEs wrap models in your editor, and a wave of startups build custom cloud agents on bespoke harnesses. We take the following path: reuse the lab-native CLIs in cloud sandboxes. Labs will keep pouring RL into their own harnesses, so they only get better over time. That way, no vendor lock-in, and you can pick a different CLI per task or combine them.

When you give Twill a task, it spins up a dedicated sandbox, clones your repo, installs dependencies, and invokes the CLI you chose. Each task gets its own filesystem, ports, and process isolation. Secrets are injected at runtime through environment variables. After a task finishes, Twill snapshots the sandbox filesystem so the next run on the same repo starts warm with dependencies already installed. We chose this architecture because every time the labs ship an improvement to their coding harness, Twill picks up the improvement automatically.

We’re also open-sourcing agentbox-sdk, https://github.com/TwillAI/agentbox-sdk, an SDK for running and interacting with agent CLIs across sandbox providers.

Here’s an example: a three-person team assigned Twill to a Linear backlog ticket about adding a CSV import feature to their Rails app. Twill cloned the repo, set up the dev environment, implemented the feature, ran the test suite, took screenshots and attached them to the PR. The PR needed one round of revision, which they requested through Github. For more complex tasks, Twill asks clarifying questions before writing code and records a browser session video (using Vercel's Webreel) as proof of work.

Free tier: 10 credits per month (1 credit = $1 of AI compute at cost, no markup), no credit card. Paid plans start at $50/month for 50 credits, with BYOK support on higher tiers. Free pro tier for open-source projects.

We’d love to hear how cloud coding agents fit into your workflow today, and if you try Twill, what worked, what broke, and what’s still missing.

Comments URL: https://news.ycombinator.com/item?id=47720418

Points: 77

# Comments: 95

https://news.ycombinator.com/item?id=47720418

Extensions

Launch HN: Relvy (YC F24) – On-call runbooks, automated

behat Apr 9, 2026

Hey HN! We are Bharath, and Simranjit from Relvy AI (https://www.relvy.ai). Relvy automates on-call runbooks for software engineering teams. It is an AI agent equipped with tools that can analyze telemetry data and code at scale, helping teams debug and resolve production issues in minutes. Here’s a video: [[[https://www.youtube.com/watch?v=BXr4_XlWXc0]]]

A lot of teams are using AI in some form to reduce their on-call burden. You may be pasting logs into Cursor, or using Claude Code with Datadog’s MCP server to help debug. What we’ve seen is that autonomous root cause analysis is a hard problem for AI. This shows up in benchmarks - Claude Opus 4.6 is currently at 36% accuracy on the OpenRCA dataset, in contrast to coding tasks.

There are three main reasons for this: (1) Telemetry data volume can drown the model in noise; (2) Data interpretation / reasoning is enterprise context dependent; (3) On-call is a time-constrained, high-stakes problem, with little room for AI to explore during investigation time. Errors that send the user down the wrong path are not easily forgiven.

At Relvy, we are tackling these problems by building specialized tools for telemetry data analysis. Our tools can detect anomalies and identify problem slices from dense time series data, do log pattern search, and reason about span trees, all without overwhelming the agent context.

Anchoring the agent around runbooks leads to less agentic exploration and more deterministic steps that reflect the most useful steps that an experienced engineer would take. That results in faster analysis, and less cognitive load on engineers to review and understand what the AI did.

How it works: Relvy is installed on a local machine via docker-compose (or via helm charts, or sign up on our cloud), connect your stack (observability and code), create your first runbook and have Relvy investigate a recent alert.

Each investigation is presented as a notebook in our web UI, with data visualizations that help engineers verify and build trust with the AI. From there on, Relvy can be configured to automatically respond to alerts from Slack

Some example runbook steps that Relvy automates: - Check so-and-so dashboard, see if the errors are isolated to a specific shard. - Check if there’s a throughput surge on the APM page, and if so, is it from a few IPs? - Check recent commits to see if anything changed for this endpoint.

You can also configure AWS CLI commands that Relvy can run to automate mitigation actions, with human approval.

A little bit about us - We did YC back in fall 2024. We started our journey experimenting with continuous log monitoring with small language models - that was too slow. We then invested deeply into solving root cause analysis effectively, and our product today is the result of about a year of work with our early customers.

Give us a try today. Happy to hear feedback, or about how you are tackling on-call burden at your company. Appreciate any comments or suggestions!

Comments URL: https://news.ycombinator.com/item?id=47702647

Points: 48

# Comments: 25

https://news.ycombinator.com/item?id=47702647

Extensions

Launch HN: Freestyle – Sandboxes for Coding Agents

benswerd Apr 6, 2026

We’re Ben and Jacob, cofounders of Freestyle (https://freestyle.sh). We’re building a cloud for Coding Agents.

For the first generation of agents it looked like workflows with minimal tools. 2 years ago we published a package to let AI work in SQL, at that time GPT-4 could write simple scripts. Soon after the first AI App Builders started using AI to make whole websites; we supported that with a serverless deploy system.

But the current generation is going much further, instead of minimal tools and basic serverless apps AI can utilize the full power of a computer (“sandbox”). We’re building sandboxes that are interchangeable with EC2s from your agents perspective, with bonus features:

1. We’ve figured out how to fork a sandbox horizontally without more than a 400ms pause in it. That's not forking the filesystem, we mean forking the whole memory of it. If you’re half way down a browser page with animations running, they’ll be in the same place in all the forks. If you’re running a minecraft server every block and player will be in the same place on the forks. If you’re running a local environment and an error comes up in process that error will be there in all the forks. This works for snapshotting as well, you can save your place and come back weeks later.

2. Our sandboxes start in ~500ms.

Demo: https://www.loom.com/share/8b3d294d515442f296aecde1f42f5524

Compared with other sandboxes, our goal is to be the most powerful. We support full Linux + hardware-virtualization, eBPF, Fuse, etc. We run full Debian with multiple users and we use a systemd init instead of runc. Whatever your AI expects to work on debian should work on these vms, and if it doesn’t send a bug report.

In order to make this possible, we’ve moved to our own bare metal racks. Early in our testing we realized that moving VMs across cloud nodes would not have acceptable performance properties. We asked Google Cloud and AWS for a quote on their bare metal nodes and found that the monthly cost was equivalent to the total cost of the hardware so we did that.

Our goal is to build the necessary infrastructure to replicate the human devloop on the massively multi-tenant scale of AI, so these VMs should be as powerful as the ones you’re used to, while also being available to provision in seconds.

Comments URL: https://news.ycombinator.com/item?id=47663147

Points: 322

# Comments: 158

https://news.ycombinator.com/item?id=47663147

Extensions

Launch HN: Sitefire (YC W26) – Automating actions to improve AI visibility

vincko Mar 20, 2026

Hi HN! We're Vincent and Jochen from sitefire (https://sitefire.ai). Our platform makes it easy for brands to improve their visibility in AI search.

We’ve been working together for years and have backgrounds in RL/optimization at Stanford and software engineering. We came to this idea after speaking with marketing teams who were seeing declining traffic due to Google’s AI Overviews and didn’t know what to do.

This space can feel esoteric. Many case studies, few actual studies. Constant battle against myths (e.g. you need a llms.txt vs. you don't need a llms.txt) and "GEO hacks". We try to be more data-driven. And we try to be more bold and build a system that not only monitors, but actually improves traffic from AI search.

While Google performs a single search, AI search engines expand the user prompt into 3-10 fan-out queries. The sourced pages are ranked using a classified algorithm similar to Reciprocal Rank Fusion (RFF). Finally, the LLMs skim the pages and decide what snippets to cite. Our goal is making sure brands have the right content that makes it through this funnel.

Here is how sitefire works:

- The user defines a set of prompts they want to monitor. These are synthetic prompts - we generate them based on SEO keywords and their monthly search volume.

- We submit these prompts to ChatGPT, Gemini, Google AI Mode, etc. on a daily basis and capture the answers. We extract fan-out queries, sourced pages, citations, and brand mentions.

- For each topic, our agents analyze which web pages are sourced and cited the most, and why. They also consider similar pages that you already have.

- Based on the diagnosis, our content agents draft improvements or create new pages, and push them directly to the client’s CMS.

- We integrate with the client’s network logs and Google Analytics to monitor the increase in AI bot requests and human referrals to their page.

This system is continuously updated, so it always shows which content works, and how to adapt the existing sitemap. For one client that used sitefire to optimize their blog, the AI-optimized articles increased their AI bot requests from ~200/day to ~570/day within ten days.

A risk we recognize is that AI-generated content is filling brands’ websites with slop. Whilst it’s still early days and we don’t claim to have figured everything out yet, our intention is to mitigate this by focusing the content on specific, unique information: real product capabilities, real pricing, honest comparisons. The clients still review every page before it goes live, so they can ensure the content is true to their brand.

Some clients use our platform themselves. For others we act more like an agency, automating steps as we go. The goal is for sitefire to run mostly on its own, with clients approving changes via Slack, Claude or their CMS.

Here's a video demo: https://screen.studio/share/fw7VQQak

If you'd like to try what we've built so far, sign up at https://sitefire.ai.

Comments URL: https://news.ycombinator.com/item?id=47457472

Points: 36

# Comments: 27

https://news.ycombinator.com/item?id=47457472

Extensions

Launch HN: Voltair (YC W26) – Drone and charging network for power utilities

wweissbluth Mar 19, 2026

Hey HN! We’re Hayden, Ronan, Avi, and Warren of Voltair (https://voltairlabs.com/). We’re making weatherized, hybrid-fixed drones deployed for power utility inspections.

Here’s some footage: https://vimeo.com/1173862237/ac28095cc6?share=copy&fl=sv&fe=... and a photo of our latest prototype: https://imgur.com/a/bYHnqZ4.

The U.S. has 7M miles of power lines (enough to go to the moon and back 14 times), and they're aging. Over 50% of all power flows through transformers that are at least 30 years old, which is about when they start to fail.

Power line conductors are just bare metal with 4,000-765,000 volts sitting on ceramic insulators, usually held up by pieces of wood. It’s a cost effective and relatively reliable way to move power. But when the wood starts to rot, or the cotter pin falls out, and a live conductor is dropped on a dead tree on a windy day, you get devastating wildfires like the Palisades Fire in LA last year.

Most utilities solve this problem with foot patrols. Linemen drive out with a clipboard or an iPad, and run through a checklist with binoculars to visually confirm everything is in order. A lineman can inspect about 50-150 poles per day, yet even the smallest rural electric cooperatives (with about ~20 employees) have about 50,000 distribution poles. Clearly the math doesn’t work out. As a result, a given utility pole is inspected about every 10 years (at least that’s what they tell their insurance adjuster).

Helicopters are also used, but cost $25k to get off the ground, and more importantly, every year linemen die in helicopter crashes. Satellites can’t deliver the mm precision needed for these inspections. So drones have emerged as the best solution. Georgia Power saved 60% on operating expenses when they switched to using drones, and Xcel power found drones to find 60% more defects than foot patrols (because of pole-top vantage point).

Problem #2: Drones are held back by the need to constantly recharge and FAA beyond-visual-line-of-sight (BVLOS) regulations. In response, the most well funded utilities (e.g., PG&E, SCE) primarily send out pilots in trucks to collect the data.

Current leaders in the drone space – Skydio and DJI – have built drone-in-a-box solutions. Their charging stations have inherent concurrency constraints (only one drone at a time) and don’t scale easily over large land areas. Skydio charges $250,000 / box, and has a there-and-back range of about 15 miles (assuming ideal performance). They are expensive and inflexible.

Our first solution (and why it didn’t work): We entered YC wanting to build drones that charge inductively from the magnetic fields around power lines. We used a split-core current transformer, wrapped it around the conductor with a clamp, and harvested power. We spent about 4 months testing and developing this hardware, and successfully recharged a few batteries in the field. It was a really cool proof of concept.

But we ran into a big problem. There’s not enough current on distribution lines! These are the wooden poles outside your home, as opposed to the tall steel transmission towers you might see in the countryside. Generally speaking, we needed about a MW of power – or about 1000 homes – to flow through the lines to charge our drone performantly.

We also found the risk-reward calculus didn’t make sense for utilities. Line attachments (and even inductive power harvesting) is common in the utility space. Fault indicators and smart sensors like the Heimdall Power “Neuron” do this. But they are installed one time with lineman supervision and left in place for years. The risk of landing a drone multiple times per day at myriad points around the network felt too risky for utility engineers.

We wondered if we could solve the range and battery swap issue from another angle. Reexamining drone-in-a-box solutions, we realized they had the tech backwards: expensive, overengineered boxes to protect fragile drones. A network of these big enough to cover a utility’s service area would cost hundreds of millions, and the drones still wouldn’t be able to fly when it matters most (during a wildfire, storm, or power outage). What if instead, the drone was ultra-rugged while the charging stations were cheap and attritable?

What we’re building now: We’re making weatherized, long-range (well over 70 miles), fixed-wing drones that can live outside for months at a time. They recharge inductively (no connections or moving parts) on stripped-down charging pads that cost a couple thousand dollars apiece. It doesn’t take many of these pads along a transmission line corridor for our drones to hop between them and inspect the entire length. We reason we could cover the continental U.S. with about 1000-5000 pads.

Having dedicated charging stations also solves the backhaul problem. When you LiDAR scan and take high-res photos of 50 miles of transmission corridor, you accumulate terabytes of data. Manual drone operators can pull out the SD card. We have to offload it wirelessly. Trying to do this directly from the drone over spotty LTE doesn’t work. Instead, we use the charging station as an intermediary, dumping the data from our drone to a hard drive on the station over a high-speed WiFi link. The station can then push this to our servers over Starlink, LTE, or a fiber link asynchronously, freeing the drone to get back in the field and inspect more.

One cool thing we can do this way is reactive inspections. If there’s a weird harmonic on a feeder, or a utility needs a rapid scan after a storm, we can get on-site within minutes to inspect. Contractors often spend months coordinating their on-site data collection, and dedicated storm response contractors are very expensive to keep on-site.

Power utilities are our first customers, but the applications for telecom, rail, oil+gas, forestry, search+rescue, and agriculture are also exciting. One thing that’s not exciting is a drone surveillance state. Unfortunately, we are now in a world where drones are increasingly weaponized, and examples of government overreach are numerous (case in point: Sonoma County, California spying on landowners). We have zero interest in supporting uses like this.

(Our backstory, if you’re interested: Ronan has always had an unhealthy obsession with flying machines, from designing remote controlled planes growing up, to building eVTOL tech for DARPA and the Air Force while still a university student. Warren and Ronan met during a startup competition with a UAV solution in agriculture. Hayden, a childhood friend of Ronan, was deeply ingrained in the power utility space, and realized the true pain point there. Shortly after graduating, Ronan, Hayden, and Warren quit their jobs to take the idea full time in the Summer of 2025. Around the same time Avi dropped out of college, bringing sales skill and regulatory expertise as our fourth cofounder.)

We just secured our first major contract and are working out the details of pilots with some big utilities. Our first paid flight is mid-April. Our business model is straightforward: inspection as a service. We charge per pole or tower.

We are very interested in your opinion! Maybe some of you all work in the energy industry and know a thing or two about infrastructure inspections that we could learn from? We’d love all feedback (good and bad).

Comments URL: https://news.ycombinator.com/item?id=47442452

Points: 85

# Comments: 27

https://news.ycombinator.com/item?id=47442452

Extensions

Launch HN: Canary (YC W26) – AI QA that understands your code

Visweshyc Mar 19, 2026

Hey HN! We're Aakash and Viswesh, and we're building Canary (https://www.runcanary.ai). We build AI agents that read your codebase, figure out what a pull request actually changed, and generate and execute tests for every affected user workflow.

Aakash and I previously built AI coding tools at Windsurf, Cognition, and Google. AI tools were making every team faster at shipping, but nobody was testing real user behavior before merge. PRs got bigger, reviews still happened in file diffs, and changes that looked clean broke checkout, auth, and billing in production. We saw it firsthand. We started Canary to close that gap. Here's how it works:

Canary starts by connecting to your codebase and understands how your app is built: routes, controllers, validation logic. You push a PR and Canary reads the diff, understands the intent behind the changes, then generates and runs tests against your preview app checking real user flows end to end. It comments directly on the PR with test results and recordings showing what changed and flagging anything that doesn't behave as expected. You can also trigger specific user workflow tests via a PR comment.

Beyond PR testing, tests generated from the PR can be moved into regression suites. You can also create tests by just prompting what you want tested in plain English. Canary generates a full test suite from your codebase, schedules it, and runs it continuously. One of our construction tech customers had an invoicing flow where the amount due drifted from the original proposal total by ~$1,600. Canary caught the regression in their invoice flow before release.

This isn't something a single family of foundation models can do on its own. QA spans across many modalities like source code, DOM/ARIA, device emulators, visual verifications, analyzing screen recordings, network/console logs, live browser state etc. for any single model to be specialized in. You also need custom browser fleets, user sessions, ephemeral environments, on-device farms and data seeding to run the tests reliably. On top of that, catching second-order effects of code changes requires a specialized harness that breaks the application in multiple possible ways across different types of users that a normal happy path testing flow wouldn't.

To measure how well our purpose built QA agent works, we published QA-Bench v0, the first benchmark for code verification. Given a real PR, can an AI model identify every affected user workflow and produce relevant tests? We tested our purpose-built QA agent against GPT 5.4, Claude Code (Opus 4.6), and Sonnet 4.6 across 35 real PRs on Grafana, Mattermost, Cal.com, and Apache Superset on three dimensions: Relevance, Coverage, and Coherence. Coverage is where the gap was largest. Canary leads by 11 points over GPT 5.4, 18 over Claude Code, and 26 over Sonnet 4.6. For full methodology and per-repo breakdowns give our benchmark report a read: https://www.runcanary.ai/blog/qa-bench-v0

You can check out the product demo here: https://youtu.be/NeD9g1do_BU

We'd love feedback from anyone working on code verification or thinking about how to measure this differently.

Comments URL: https://news.ycombinator.com/item?id=47441629

Points: 58

# Comments: 26

https://news.ycombinator.com/item?id=47441629

Extensions

Launch HN: Kita (YC W26) – Automate credit review in emerging markets

rheamalhotra1 Mar 17, 2026

Hey HN! We’re Carmel and Rhea, the founders of Kita (https://www.usekita.com/). We automate credit review for lenders in emerging markets using VLMs.

In many emerging markets, like the Philippines and Mexico, credit infrastructure is weak. Open finance is still nascent, and credit bureaus are unreliable. So to apply for a loan, lenders rely on borrowers submitting documentation to understand their ability to repay. A borrower can submit financial documents, such as bank statements and payslips, in any format, from pdfs, images of physical documents and screenshots. On top of that, financial documents in these markets are highly unstandardized, with no consistent templates lenders can rely on.

Existing OCR and document AI tools break on these highly variant, messy real-world documents. Generic tools are not built for lending workflows like verification, fraud detection, and risk extraction. As a result, credit teams fall back on manual review, making underwriting slower, more expensive, and more error-prone.

We met before college and stayed best friends. After graduating, Rhea visited Carmel in the Philippines, where we heard firsthand from fintech operators that document-based underwriting was their biggest pain point. We started building together and tested every OCR and document AI tool we could find. They all failed on the messy real-world documents lenders actually receive, and even when extraction worked, they still could not produce the structured financial data or fraud checks lenders needed.

The problem was even bigger than we thought. Across Indonesia, Mexico, the Philippines, South Africa, and even in the US, most of lending can be boiled down to credit analysts looking at documents. In 2025, 13.3T was lended globally, and 90% of those transactions involved document review. This includes in developed markets.

Kita uses VLM-based agents to parse documents, detect fraud, and extract underwriting signals from messy financial files. Today, we support 50+ document types across PDFs, scans, photos, and screenshots. Our pipeline enhances low-quality inputs, extracts structured financial data, and verifies it through cross-document checks, validation against our historical database, and market-specific fraud detection.

Our architecture’s base VLM is model agnostic, and simultaneously, we train language models finetuned to hyperlocalized credit signals in each market, using localized lender data – every new model improves our base layer, and every new market makes our overall stack stronger. We link document-level signals to repayment outcomes, allowing our models to continuously improve fraud detection and risk assessment over time.

Kita Capture is our first document intelligence product for lenders. We’re also launching Kita Credit Agent, which automates borrower follow-up during origination over WhatsApp and email to collect missing documents and complete loan applications.

Kita Capture is free to try (with email signup): https://portal.usekita.com/. Here’s a quick demo: https://www.youtube.com/watch?v=4-t_UhPNAvQ.

We’d love to get feedback from the community, especially if you’ve worked on document AI, fraud detection, or fintech infrastructure. Thanks for reading!

Comments URL: https://news.ycombinator.com/item?id=47417335

Points: 55

# Comments: 14

https://news.ycombinator.com/item?id=47417335

Extensions

Launch HN: Chamber (YC W26) – An AI Teammate for GPU Infrastructure

jshen96 Mar 16, 2026

Hey HN, we're Jie Shen, Charles, Andreas, and Shaocheng. We built Chamber (https://usechamber.io), an AI agent that manages GPU infrastructure for you. You talk to it wherever your team already works and it handles things like provisioning clusters, diagnosing failed jobs, managing workloads. Demo: https://www.youtube.com/watch?v=xdqh2C_hif4

We all worked on GPU infrastructure at Amazon. Between us we've spent years on this problem — monitoring GPU fleets, debugging failures at scale, building the tooling around it. After leaving we talked to a bunch of AI teams and kept hearing the same stuff. Platform engineers spend half their time just keeping things running. Building dashboards, writing scheduling configs, answering "when will my job start?" all day. Researchers lose hours when a training run fails because figuring out why means digging through Kubernetes events, node logs, and GPU metrics in totally separate tools. Pretty much everyone had stitched together Prometheus, Grafana, Kubernetes scheduling policies, and a bunch of homegrown scripts, and they were spending as much time maintaining all of it as actually using it.

The thing we kept noticing is that most of this work follows patterns. Triage the failure, correlate a few signals, figure out what to do about it. If you had a platform with structured access to the full state of a GPU environment, you could have an agent do that work for you.

So that's what we built. Chamber is a control plane that keeps a live model of your GPU fleet: nodes, workloads, team structure, cluster health. Every operation it supports is exposed as a tool the agent can call. Inspecting node health, reading cluster topology, managing workload lifecycle, adjusting resource configs, provisioning infrastructure. These are structured operations with validation and rollback, not just raw shell commands. When we add new capabilities to the platform, they automatically become things the agent can do too.

We spent a lot of time on safety because we've seen what happens when infrastructure automation goes wrong. A wrong call can kill a multi-day training run or cascade across a cluster. So the agent has graduated autonomy. Routine stuff it handles on its own: diagnosing a failed job, resubmitting with corrected resources, cordoning a bad node. But anything that touches other teams' workloads or production jobs needs human approval first. Every action gets logged with what the agent saw, why it acted, and what it changed.

The platform underneath is really what makes the diagnosis work. When the agent investigates a failure, it queries GPU state, workload history, node health timelines, and cluster topology. That's the difference between "your job OOMed" and "your job OOMed because the batch size exceeded available VRAM on this node, here's a corrected config." Different root causes get different fixes.

One thing that surprised us, even coming from Amazon where we'd seen large GPU fleets: most teams we talk to can't even tell you how many GPUs are in use right now. The monitoring just doesn't exist. They're flying blind on their most expensive hardware.

We’ve launched with a few early customers and are onboarding new teams. We’re still refining pricing and are currently evaluating models like per-GPU-under-management and tiered plans. We plan to publish transparent pricing once we’ve validated what works best for customers. In the meantime, we know “contact us” isn’t ideal.

Would love to hear from anyone running GPU clusters. What's the most tedious part of your setup? What would you actually trust an agent to do? What's off limits? Looking forward to feedback!

Comments URL: https://news.ycombinator.com/item?id=47401766

Points: 26

# Comments: 7

https://news.ycombinator.com/item?id=47401766

Extensions

Launch HN: Voygr (YC W26) – A better maps API for agents and AI apps

ymarkov Mar 16, 2026

Hi HN, we’re Yarik and Vlad from VOYGR (https://voygr.tech/), working on better real-world place intelligence for app developers and agents. Here’s a demo: https://www.youtube.com/watch?v=cNIpcWIE0n4.

Google Maps can tell you a restaurant is "4.2 stars, open till 10." Their API can't tell you the chef left last month, wait times doubled, and locals moved on. Maps APIs today just give you a fixed snapshot. We're building an infinite, queryable place profile that combines accurate place data with fresh web context like news, articles, and events.

Vlad worked on the Google Maps APIs as well as in ridesharing and travel. Yarik led ML/Search infrastructure at Apple, Google, and Meta powering products used by hundreds of millions of users daily. We realized nobody was treating place data freshness as infrastructure, so we're building it.

We started with one of the hardest parts - knowing whether a place is even real. Our Business Validation API (https://github.com/voygr-tech/dev-tools) tells you whether a business is actually operating, closed, rebranded, or invalid. We aggregate multiple data sources, detect conflicting signals, and return a structured verdict. Think of it as continuous integration, but for the physical world.

The problem: ~40% of Google searches and up to 20% of LLM prompts involve local context. 25-30% of places churn every year. The world doesn't emit structured "I closed" events - you have to actively detect it. As agents start searching, booking, and shopping in the real world, this problem gets 10x bigger - and nobody's building the infrastructure for it. We recently benchmarked how well LLMs handle local place queries (https://news.ycombinator.com/item?id=47366423) - the results were bad: even the best gets 1 in 12 local queries wrong

We're processing tens of thousands of places per day for enterprise customers, including leading mapping and tech companies. Today we're opening API access to the developer community. Please find details here: https://github.com/voygr-tech/dev-tools

We'd love honest feedback - whether it's about the problem, our approach, or where you think we're wrong. If you're dealing with stale place data in your own products, we'd especially love to hear what breaks. We're here all day, AMA.

Comments URL: https://news.ycombinator.com/item?id=47401042

Points: 81

# Comments: 60

https://news.ycombinator.com/item?id=47401042

Extensions

Launch HN: Captain (YC W26) – Automated RAG for Files

CMLewis Mar 13, 2026

Hi HN, we’re Lewis and Edgar, building Captain to simplify unstructured data search (https://runcaptain.com). Captain automates the building and maintenance of file-based RAG pipelines. It indexes cloud storage like S3 and GCS, plus SaaS sources like Google Drive. There’s a quick walkthrough at https://youtu.be/EIQkwAsIPmc.

We also put up this demo site called “Ask PG’s Essays” which lets you ask/search the corpus of pg’s essays, to get a feel for how it works: https://pg.runcaptain.com. The RAG part of this took Captain about 3 minutes to set up.

Here are some sample prompts to get a feel for the experience:

“When do we do things that don't scale? When should we be more cautious?” https://pg.runcaptain.com/?q=When%20do%20we%20do%20things%20...

“Give me some advice, I'm fundraising” https://pg.runcaptain.com/?q=Give%20me%20some%20advice%2C%20...

“What are the biggest advantages of Lisp” https://pg.runcaptain.com/?q=what%20are%20the%20biggest%20ad...

A good production RAG pipeline takes substantial effort to build, especially for file workloads. You have to handle ETL or text extraction, chunking, embedding, storage, search, re-ranking, inference, and often compliance and observability – all while optimizing for latency and reliability. It’s a lot to manage. grep works well in some cases, but for agents, semantic search provides significantly higher performance. Cursor uses both and reports 6.5%–23.5% accuracy gains from vector search over grep (https://cursor.com/blog/semsearch).

We’ve spent the past four years scaling RAG pipelines for companies, and Edgar’s work at Purdue’s NLP lab directly informed our chunking techniques. In conversations with dozens of engineers, we repeatedly saw DIY pipelines produce inconsistent results, even after weeks of tuning. Many teams lacked clarity on which retrieval strategies best fit their data.

We realized that a system to provision storage and embeddings, handle indexing, and continuously update pipelines to reflect the latest search techniques could remove the need for every team to rebuild RAG themselves. That idea became Captain.

In practice, one API call indexes URLs, cloud storage buckets, directories, or individual files. Under the hood, we’re converting everything to Markdown. For this, we’ve had good results with Gemini 3 Pro for images, Reducto for complex documents, and Extend for basic OCR. For embedding models, ‘gemini-embedding-001’ performed reasonably well at first, but we later switched to the Contextualized Embeddings from ‘voyage-context-3’. It produced more relevant results than even the newer Voyage 4 models because its chunk embeddings are encoded with awareness of the surrounding document context. We then applied Voyage’s ‘rerank-2.5’ as second-stage re-ranking, reducing 50 initial chunks to a final top 15 (configurable in Captain’s API). Dense embeddings are just half the picture and full-text search with RRF complete our hybrid retrieval. In the Captain API, these techniques are exposed through a single /query endpoint. Access controls can be configured via metadata filters, and page number citations are returned automatically.

The stack is constantly changing but the Captain API creates a standard interface for this. You can try Captain, 1 month for free, and build your own pipelines at https://runcaptain.com. We’re looking for candid feedback, especially anything that can make it more useful, and look forward to your comments!

Comments URL: https://news.ycombinator.com/item?id=47366011

Points: 57

# Comments: 38

https://news.ycombinator.com/item?id=47366011

Extensions

Launch HN: Spine Swarm (YC S23) – AI agents that collaborate on a visual canvas

a24venka Mar 13, 2026

Hey HN! We're Ashwin and Akshay from Spine AI (https://www.getspine.ai). Spine Swarm is a multi-agent system that works on an infinite visual canvas to complete complex non-coding projects: competitive analysis, financial modeling, SEO audits, pitch decks, interactive prototypes, and more. Here's a video of it in action: https://www.youtube.com/watch?v=R_2-ggpZz0Q.

We've been friends for over 13 years. We took our first ML course together at NTU, in a part of campus called North Spine, which is where the name comes from. We went through YC in S23 and have spent about 3 years building Spine across many product iterations.

The core idea: chat is the wrong interface for complex AI work. It's a linear thread, and real projects aren't linear. Sure, you can ask a chatbot to reference the financial model from earlier in the thread, or run research and market sizing together, but you're trusting the model to juggle that context implicitly. There's no way to see how it's connecting the pieces, no way to correct one step without rerunning everything, and no way to branch off and explore two strategies side by side. ChatGPT was a demo that blew up, and chat stuck around as the default interface, not because it's the right abstraction. We thought humans and agents needed a real workspace where the structure of the work is explicit and user-controllable, not hidden inside a context window.

So we built an infinite visual canvas where you think in blocks instead of threads. Each block is our abstraction on top of AI models. There are dedicated block types for LLM calls, image generation, web browsing, apps, slides, spreadsheets, and more. Think of them as Lego bricks for AI workflows: each one does something specific, but they can be snapped together and composed in many different ways. You can connect any block to any other block, and that connection guarantees the passing of context regardless of block type. The whole system is model-agnostic, so in a single workflow you can go from an OpenAI LLM call, to an image generation mode like Nano Banana Pro, to Claude generating an interactive app, each block using whatever model fits best. Multiple blocks can fan out from the same input, analyzing it in different ways with different models, then feed their outputs into a downstream block that synthesizes the results.

The first version of the canvas was fully manual. Users entered prompts, chose models, ran blocks, and made connections themselves. It clicked with founders and product managers because they could branch in different directions from the same starting point: take a product idea and generate a prototype in one branch, a PRD in another, a competitive critique in a third, and a pitch deck in a fourth, all sharing the same upstream context. But new users didn't want to learn the interface. They kept asking us to build a chat layer that would generate and connect blocks on their behalf, to replicate the way we were using the tool. So we built that, and in doing so discovered something we didn't expect: the agents were capable of running autonomously for hours, producing complete deliverables. It turned out agents could run longer and keep their context windows clean by delegating work to blocks and storing intermediary context on the canvas, rather than holding everything in a single context window.

Here's how it works now. When you submit a task, a central orchestrator decomposes it into subtasks and delegates each to specialized persona agents. These agents operate on the canvas blocks and can override default settings, primarily the model and prompt, to fit each subtask. Agents pick the best model for each block and sometimes run the same block with multiple models to compare and synthesize outputs. Multiple agents work in parallel when their subtasks don't have dependencies, and downstream agents automatically receive context from upstream work. The user doesn't configure any of this. You can also dispatch multiple tasks at once and the system will queue dependent ones or start independent ones immediately.

Agents aren't fully autonomous by default. Any agent can pause execution and ask the user for clarification or feedback before continuing, which keeps the human in the loop where it matters. And once agents have produced output, you can select a subset of blocks on the canvas and iterate on them through the chat without rerunning the entire workflow.

The canvas gives agents something that filesystems and message-passing don't: a persistent, structured representation of the entire project that any agent can read and contribute to at any point. In typical multi-agent systems, context degrades as it passes between agents. The canvas addresses this because agents store intermediary results in blocks rather than trying to hold everything in memory, and they leave explicit structured handoffs designed to be consumed efficiently by the next agent in the chain. Every step is also fully auditable, so you can trace exactly how each agent arrived at its conclusions.

We ran benchmarks to validate what we were seeing. On Google DeepMind's DeepSearchQA, which is 900 questions spanning 17 fields, each structured as a causal chain where each step depends on completing the previous one, Spine Swarm scored 87.6% on the full dataset with zero human intervention. For the benchmark we used a subset of block types relevant to the questions (LLM calls, web browsing, table) and removed irrelevant ones like document, spreadsheet, and slide generation. We also disabled human clarification so agents ran fully independently. The agents were not just auditable but also state of the art. The auditability also exposed actual errors in an older benchmark (GAIA Level 3), cases where the expected answer was wrong or ambiguous, which you'd never catch with a black-box pipeline. We detail the methodology, architecture, and benchmark errors in the full writeup: https://blog.getspine.ai/spine-swarm-hits-1-on-gaia-level-3-...

Benchmarks measure accuracy on closed-ended questions. Turns out the same architecture also leads to better open-ended outputs like decks, reports, and prototypes with minimal supervision. We've seen early users split into two camps: some watch the agents work and jump in to redirect mid-flow, others queue a task and come back to a finished deliverable. Both work because the canvas preserves the full chain of work, so you can audit or intervene whenever you want.

A good first task to try: give it your website URL and ask for a full SEO analysis, competitive landscape, and a prioritized growth roadmap with a slide deck. You'll see multiple agents spin up on the canvas simultaneously. People have also used it for fundraising pitch decks with financial models, prototyping features from screenshots and PRDs, competitive analysis reports and deep-dive learning plans that research a topic from multiple angles and produce structured material you can explore further.

Pricing is usage-based credits tied to block usage and the underlying models used. Agents tend to use more credits than manual workflows because they're tuned to get you the best possible outcome, which means they pick the best blocks and do more work. Details here: https://www.getspine.ai/pricing. There's a free tier, and one honest caveat: we sized it to let you try a real task, but tasks vary in complexity. If you run out before you've had a proper chance to explore, email us at founders@getspine.ai and we'll work with you.

We'd love your feedback on the experience: what worked, what didn't, and where it fell short. We're also curious how others here approach complex, multi-step AI work beyond coding. What tools are you using, and what breaks first? We'll be in the comments all day.

Comments URL: https://news.ycombinator.com/item?id=47364116

Points: 109

# Comments: 69

https://news.ycombinator.com/item?id=47364116

Extensions

Launch HN: IonRouter (YC W26) – High-throughput, low-cost inference

vshah1016 Mar 12, 2026

Hey HN — I’m Veer and my cofounder is Suryaa. We're building Cumulus Labs (YC W26), and we're releasing our latest product IonRouter (https://ionrouter.io/), an inference API for open-source and fine tuned models. You swap in our base URL, keep your existing OpenAI client code, and get access to any model (open source or finetuned to you) running on our own inference engine.

The problem we kept running into: every inference provider is either fast-but-expensive (Together, Fireworks — you pay for always-on GPUs) or cheap-but-DIY (Modal, RunPod — you configure vLLM yourself and deal with slow cold starts). Neither felt right for teams that just want to ship.

Suryaa spent years building GPU orchestration infrastructure at TensorDock and production systems at Palantir. I led ML infrastructure and Linux kernel development for Space Force and NASA contracts where the stack had to actually work under pressure. When we started building AI products ourselves, we kept hitting the same wall: GPU infrastructure was either too expensive or too much work.

So we built IonAttention — a C++ inference runtime designed specifically around the GH200's memory architecture. Most inference stacks treat GH200 as a compatibility target (make sure vLLM runs, use CPU memory as overflow). We took a different approach and built around what makes the hardware actually interesting: a 900 GB/s coherent CPU-GPU link, 452GB of LPDDR5X sitting right next to the accelerator, and 72 ARM cores you can actually use.

Three things came out of that that we think are novel: (1) using hardware cache coherence to make CUDA graphs behave as if they have dynamic parameters at zero per-step cost — something that only works on GH200-class hardware; (2) eager KV block writeback driven by immutability rather than memory pressure, which drops eviction stalls from 10ms+ to under 0.25ms; (3) phantom-tile attention scheduling at small batch sizes that cuts attention time by over 60% in the worst-affected regimes. We wrote up the details at cumulus.blog/ionattention.

On multimodal pipelines we get better performance than big players (588 tok/s vs. Together AI's 298 on the same VLM workload). We're honest that p50 latency is currently worse (~1.46s vs. 0.74s) — that's the tradeoff we're actively working on.

Pricing is per token, no idle costs: GPT-OSS-120B is $0.02 in / $0.095 out, Qwen3.5-122B is $0.20 in / $1.60 out. Full model list and pricing at https://ionrouter.io.

You can try the playground at https://ionrouter.io/playground right now, no signup required, or drop your API key in and swap the base URL — it's one line. We built this so teams can see the power of our engine and eventually come to us for their finetuned model needs using the same solution.

We're curious what you think, especially if you're running finetuned or custom models — that's the use case we've invested the most in. What's broken, what would make this actually useful for you?

Comments URL: https://news.ycombinator.com/item?id=47355410

Points: 72

# Comments: 37

https://news.ycombinator.com/item?id=47355410

Extensions

Launch HN: Sentrial (YC W26) – Catch AI agent failures before your users do

anayrshukla Mar 11, 2026

Hey HN! We're Neel and Anay, and we’re building Sentrial (https://sentrial.com). It’s production monitoring for AI products. We automatically detect failure patterns: loops, hallucinations, tool misuse, and user frustrations the moment they happen. When issues surface, Sentrial diagnoses the root cause by analyzing conversation patterns, model outputs, and tool interactions, then recommends specific fixes.

Here's a demo if you're interested: https://www.youtube.com/watch?v=cc4DWrJF7hk. When agents fail, choose wrong tools, or blow cost budgets, there's no way to know why - usually just logs and guesswork. As agents move from demos to production with real SLAs and real users, this is not sustainable.

Neel and I lived this, building agents at SenseHQ and Accenture where we found that debugging agents was often harder than actually building them. Agents are untrustworthy in prod because there’s no good infrastructure to verify what they’re actually doing.

In practice this looks like: - A support agent that began misclassifying refund requests as product questions, which meant customers never reached the refund flow. - A document drafting agent that would occasionally hallucinate missing sections when parsing long specs, producing confident but incorrect outputs. There’s no stack trace or 500 error and you only figure this out when a customer is angry.

We both realized teams were flying blind in production, and that agent native monitoring was going to be foundational infrastructure for every serious AI product. We started Sentrial as a verification layer designed to take care of this.

How it works: You wrap your client with our SDK in only a couple of lines. From there, we detect drift for you: - Wrong tool invocations - Misunderstood intents - Hallucinations - Quality regressions over time. You see it on our platform before a customer files a ticket.

There’s a quick mcp set up, just give claude code: claude mcp add --transport http Sentrial https://www.sentrial.com/docs/mcp

We have a free tier (14 days, no credit card required). We’d love any feedback from anyone running agents whether they be for personal use or within a professional setting.

We’ll be around in the comments!

Comments URL: https://news.ycombinator.com/item?id=47337659

Points: 31

# Comments: 14

https://news.ycombinator.com/item?id=47337659

Extensions

Launch HN: Prism (YC X25) – Workspace and API to generate and edit videos

aliu327 Mar 11, 2026

Hey HN — we’re Rajit, Land, and Alex. We’re building Prism (https://www.prismvideos.com), an AI video creation platform and API.

Here’s a quick demo of how you can remix any video with Prism: https://youtu.be/0eez_2DnayI

Here’s a quick demo of how you can automate UGC-style ads with Openclaw + Prism: https://www.youtube.com/watch?v=5dWaD23qnro

Accompanying skill.md file: https://docs.google.com/document/d/1lIskVljW1OqbkXFyXeLHRsfM...

Making an AI video today usually means stitching together a dozen tools (image generation, image-to-video, upscalers, lip-sync, voiceover, and an editor). Every step turns into export/import and file juggling, so assets end up scattered across tabs and local storage, and iterating on a multi-scene video is slow.

Prism keeps the workflow in one place: you generate assets (images/video clips) and assemble them directly in a timeline editor without downloading files between tools. Practically, that means you can try different models (Kling, Veo, Sora, Hailuo, etc) and settings for a single clip, swap it on the timeline, and keep iterating without re-exporting and rebuilding the edit elsewhere.

We also support templates and one-click asset recreation, so you can reuse workflows from us or the community instead of rebuilding each asset from scratch. Those templates are exposed through our API, letting your AI agents discover templates in our catalog, supply the required inputs, and generate videos in a repeatable way without manually stitching the workflow together.

We built Prism because we were making AI videos ourselves and were unsatisfied with the available tools. We kept losing time to repetitive “glue work” such as constantly downloading files, keeping track of prompts/versions, and stitching clips in a separate video editing software. We’re trying to make the boring parts of multi-step AI video creation less manual so users can generate → review → edit → assemble → export, all inside one platform.

Pricing is based on usage credits, with a free tier (100 credits/month) and free models, so you can try it without providing a credit card: https://prismvideos.com.

We’d love to hear from people who’ve tried making AI videos: where does your workflow break, what parts are the most tedious, and what do you wish video creation tools on the market could do?

Comments URL: https://news.ycombinator.com/item?id=47337548

Points: 40

# Comments: 18

https://news.ycombinator.com/item?id=47337548

Extensions

Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon

sanchitmonga22 Mar 10, 2026

Hi HN, we're Sanchit and Shubham (YC W26). We built a fast inference engine for Apple Silicon. LLMs, speech-to-text, text-to-speech – MetalRT beats llama.cpp, Apple's MLX, Ollama, and sherpa-onnx on every modality we tested. Custom Metal shaders, no framework overhead.

Also, we've open-sourced RCLI, the fastest end-to-end voice AI pipeline on Apple Silicon. Mic to spoken response, entirely on-device. No cloud, no API keys.

To get started:

  brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
  brew install rcli
  rcli setup   # downloads ~1 GB of models
  rcli         # interactive mode with push-to-talk

Or:

  curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

The numbers (M4 Max, 64 GB, reproducible via `rcli bench`):

LLM decode – 1.67x faster than llama.cpp, 1.19x faster than Apple MLX (same model files): - Qwen3-0.6B: 658 tok/s (vs mlx-lm 552, llama.cpp 295) - Qwen3-4B: 186 tok/s (vs mlx-lm 170, llama.cpp 87) - LFM2.5-1.2B: 570 tok/s (vs mlx-lm 509, llama.cpp 372) - Time-to-first-token: 6.6 ms

STT – 70 seconds of audio transcribed in *101 ms*. That's 714x real-time. 4.6x faster than mlx-whisper.

TTS – 178 ms synthesis. 2.8x faster than mlx-audio and sherpa-onnx.

We built this because demoing on-device AI is easy but shipping it is brutal. Voice is the hardest test: you're chaining STT, LLM, and TTS sequentially, and if any stage is slow, the user feels it. Most teams fall back to cloud APIs not because local models are bad, but because local inference infrastructure is.

The thing that's hard to solve is latency compounding. In a voice pipeline, you're stacking three models in sequence. If each adds 200ms, you're at 600ms before the user hears a word, and that feels broken. You can't optimize one stage and call it done. Every stage needs to be fast, on one device, with no network round-trip to hide behind.

We went straight to Metal. Custom GPU compute shaders, all memory pre-allocated at init (zero allocations during inference), and one unified engine for all three modalities instead of stitching separate runtimes together.

MetalRT is the first engine to handle all three modalities natively on Apple Silicon. Full methodology:

LLM benchmarks: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...

Speech benchmarks: https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...

How: Most inference engines add layers between you and the GPU: graph schedulers, runtime dispatchers, memory managers. MetalRT skips all of it. Custom Metal compute shaders for quantized matmul, attention, and activation - compiled ahead of time, dispatched directly.

Voice Pipeline optimizations details: https://www.runanywhere.ai/blog/fastvoice-on-device-voice-ai... RAG optimizations: https://www.runanywhere.ai/blog/fastvoice-rag-on-device-retr...

RCLI is the open-source voice pipeline (MIT) built on MetalRT: three concurrent threads with lock-free ring buffers, double-buffered TTS, 38 macOS actions by voice, local RAG (~4 ms over 5K+ chunks), 20 hot-swappable models, and a full-screen TUI with per-op latency readouts. Falls back to llama.cpp when MetalRT isn't installed.

Source: https://github.com/RunanywhereAI/RCLI (MIT)

Demo: https://www.youtube.com/watch?v=eTYwkgNoaKg

What would you build if on-device AI were genuinely as fast as cloud?

Comments URL: https://news.ycombinator.com/item?id=47326101

Points: 240

# Comments: 153

https://news.ycombinator.com/item?id=47326101

Extensions

Launch HN: Didit (YC W26) – Stripe for Identity Verification

rosasalberto Mar 10, 2026

Hi HN, I’m Alberto. I co-founded Didit (https://didit.me) with my identical twin brother Alejandro. We are building a unified identity layer—a single integration that handles KYC, AML, biometrics, authentication, and fraud prevention globally. Here’s a demo: https://www.youtube.com/watch?v=eTdcg7JCc4M&t=7s.

Being identical twins, we’ve spent our whole lives dealing with identity confusion, so it is a bit of irony that we ended up building a company to solve it for the internet.

Growing up in Barcelona, we spent years working on products where identity issues were a massive pain. We eventually realized that for most engineering teams, "global identity" is a fiction—in reality it is a fragmented mess. You end up stitching together one provider for US driver's licenses, another for NFC chip extraction in Europe, a third for AML screening, a fourth for government database validation in Brazil, a fifth for liveness detection on low-end Android devices, and yet another for biometric authentication and age estimation. Orchestrating these into a cohesive flow while adapting to localized regulations like GDPR or CCPA is a nightmare that makes no sense for most teams to be working on.

When we looked at the existing "enterprise" solutions, we were baffled. Most require a three-week sales cycle just to see a single page of documentation. Pricing is hidden behind "Contact Us" buttons, and the products themselves are often bloated legacy systems with high latency and abysmal accuracy.

We also noticed a recurring pattern: these tools are frequently optimized only for the latest iOS hardware, performing poorly on the mid-range or older Android devices that make up a huge percentage of the market. This results in a "leaky" funnel where legitimate users drop off due to technical friction and fraud goes undetected because data points are spread across disparate systems. Also, these systems are expensive, often requiring massive annual commits that price out early-stage startups.

We wanted to build a system that is accessible to everyone—a tool that works like Stripe for identity, where you can get a sandbox key in thirty seconds and start running real verifications with world-class UX and transparent pricing.

To solve this, we took the "delusional" path of full vertical integration. Rather than just wrapping existing APIs, we built our own ID verification and biometric AI models—from classification and fraud detection to OCR models for almost every language. This vertical integration is fundamental to how we handle user data. Because we own the entire stack, we control the flow of sensitive information from end-to-end. Your users' data doesn't get bounced around through a chain of third-party black boxes or regional middle-men. This allows us to provide a level of security and privacy that is impossible when you are just an orchestration layer for other people's APIs.

We believe that identity verification is one of the most critical problems on the internet, and must be solved correctly and ethically. Many people are rightfully skeptical, especially given recent news about projects that have turned identity into a tool for mass data collection or surveillance. We don’t do anything of the sort, but we also don’t want to be coerced in the future, so we facilitate data minimization on the customer side. Instead of a business asking for a full ID scan, we allow them to simply verify a specific attribute—like "is this person over 18?"—without ever seeing the document itself. Our goal is to move the industry away from data hoarding and toward zero knowledge, or at least minimal knowledge, verification.

The result of our all-in-one approach is a platform that increases onboarding rates while lowering identity costs. We’ve focused on building a high-confidence automated loop that reduces the need for manual review by up to 90%, catching sophisticated deepfakes and spoofing attempts that standard vision models miss. Our SDK is optimized for low bandwidth connections, ensuring it works on spotty 3G networks where legacy providers usually fail.

We are fully live, and you can jump into the dashboard at https://business.didit.me to see the workflow orchestration immediately. Our pricing is transparent and success-based; we don’t believe in hiding costs behind a sales call.

We’re here all day to answer any question—whether it’s about how we handle NFC verification, our approach to deepfake detection, the general ethics behind biometric data retention, or how we think about the future of identity. We’d love your brutal HN feedback on our APIs, platform, and integration flow!

Comments URL: https://news.ycombinator.com/item?id=47324296

Points: 77

# Comments: 66

https://news.ycombinator.com/item?id=47324296

Extensions

Launch HN: Terminal Use (YC W26) – Vercel for filesystem-based agents

filipbalucha Mar 9, 2026

Hello Hacker News! We're Filip, Stavros, and Vivek from Terminal Use (https://www.terminaluse.com/). We built Terminal Use to make it easier to deploy agents that work in a sandboxed environment and need filesystems to do work. This includes coding agents, research agents, document processing agents, and internal tools that read and write files.

Here's a demo: https://www.youtube.com/watch?v=ttMl96l9xPA.

Our biggest pain point with hosting agents was that you'd need to stitch together multiple pieces: packaging your agent, running it in a sandbox, streaming messages back to users, persisting state across turns, and managing getting files to and from the agent workspace.

We wanted something like Cog from Replicate, but for agents: a simple way to package agent code from a repo and serve it behind a clean API/SDK. We wanted to provide a protocol to communicate with your agent, but not constraint the agent logic or harness itself.

On Terminal Use, you package your agent from a repo with a config.yaml and Dockerfile, then deploy it with our CLI. You define the logic of three endpoints (on_create, on_event, and on_cancel) which track the lifecycle of a task (conversation). The config.yaml contains details about resources, build context, etc.

Out of the box, we support Claude Agent SDK and Codex SDK agents. By support, we mean that we have an adapter that converts from the SDK message types to ours. If you'd like to use your own custom harness, you can convert and send messages with our types (Vercel AI SDK v6 compatible). For the frontend, we have a Vercel AI SDK provider that lets you use your agent with Vercel's AI SDK, and have a messages module so that you don't have to manage streaming and persistence yourself.

The part we think is most different is storage.

We treat filesystems as first-class primitives, separate from the lifecycle of a task. That means you can persist a workspace across turns, share it between different agents, or upload / download files independent of the sandbox being active. Further, our filesystem SDK provides presigned urls which makes it easy for your users to directly upload and download files which means that you don't need to proxy file transfer through your backend.

Since your agent logic and filesystem storage are decoupled, this makes it easy to iterate on your agents without worrying about the files in the sandbox: if you ship a bug, you can deploy and auto-migrate all your tasks to the new deployment. If you make a breaking change, you can specify that existing tasks stay on the existing version, and only new tasks use the new version.

We're also adding support for multi-filesystem mounts with configurable mount paths and read/write modes, so storage stays durable and reusable while mount layout stays task-specific.

On the deployment side, we've been influenced by modern developer platforms: simple CLI deployments, preview/production environments, git-based environment targeting, logs, and rollback. All the configuration you need to build, deploy & manage resources for your agent is stored in the config.yaml file which makes it easy to build & deploy your agent in CI/CD pipelines.

Finally, we've explicitly designed our platform for your CLI coding agents to help you build, test, & iterate with your agents. With our CLI, your coding agents can send messages to your deployed agents, and download filesystem contents to help you understand your agent's output. A common way we test our agents is that we make markdown files with user scenarios we'd like to test, and then ask Claude Code to impersonate our users and chat with our deployed agent.

What we do not have yet: full parity with general-purpose sandbox providers. For example, preview URLs and lower-level sandbox.exec(...) style APIs are still on the roadmap.

We're excited to hear any thoughts, insights, questions, and concerns in the comments below!

Comments URL: https://news.ycombinator.com/item?id=47311657

Points: 115

# Comments: 72

https://news.ycombinator.com/item?id=47311657

Extensions

Launch HN: Palus Finance (YC W26): Better yields on idle cash for startups, SMBs

sam_palus Mar 6, 2026

Hi HN! We’re Sam and Michael from Palus Finance (https://palus.finance). We’re building a treasury management platform for startups and SMBs to earn higher yields with a high-yield bond portfolio.

We were funded by YC for a consumer-focused product for higher-yield savings. But when we joined YC and got our funding, we realized we needed the product for our own startup’s cash reserves, and other startups in the batch started telling us they wanted this too.

We realized that traditional startup treasury products do much the same thing: open a brokerage account, sweep your cash into a money market fund (MMF), and charge a management fee. No strategy involved. (There is actually one widely-advertised treasury product that differentiates on yield, but instead of an MMF it uses a mutual fund where your principal is at considerable risk – it had a 9% loss in 2022 that took years to recover.)

I come from a finance background, so this norm felt weird to me. The typical startup cashflow pattern is a large infusion from a raise covering 18–24 months of burn, drawn down gradually. That's a lot of capital sitting idle for a long time, where even a modest yield improvement compounds into real money.

MMFs are the lowest rung of what's available in fixed income. Yes, they’re very safe and liquid, but when you leave your whole treasury in one, you’re giving up yield to get same-day liquidity on cash you won’t touch for six months or more. Big companies have treasury teams that actively manage their holdings and invest in a range of safe assets to maximize yield. But those sophisticated bond portfolios were just never made accessible to startups. That’s what we’re building.

Our bond portfolio holds short-duration floating-rate agency mortgage-backed securities (MBS), which are an ideal, safe, high-yielding asset for long-term startup cash reserves under most circumstances.[1]

The bond portfolio is managed by Regan Capital, which runs MBSF, the largest floating-rate agency MBS ETF in the country. Right now we're using MBSF to generate yields for customers (you can see its historical returns, including dividends, here: https://totalrealreturns.com/n/USDOLLAR,MBSF). We're working with Regan to set up a dedicated account with the same strategy, which will let us reduce fees and give each startup direct ownership of the underlying securities. All assets are held with an SEC-licensed custodian.

Based on historical returns, we target 4.5–5% returns vs. roughly 3.5% from most money market funds.[2] Liquidity is typically available in 1-2 business days. We will charge a flat 0.25% annual fee on AUM, compared to the 0.15–0.60%, depending on balance, charged by other treasury providers.

We think that startup banking products themselves (Brex, Mercury, etc.) are genuinely good at what they do: payments, payroll, card management. The problem is the treasury product bundled with them, not the bank. So rather than building another neobank, we built Palus to connect to your existing bank account via Plaid. Our goal was to create the simplest possible UX for this product: two buttons and a giant number that goes up.

See here: https://www.youtube.com/watch?v=8Q_gwSqtnxM

We are live with early customers from within YC, and accepting new customers on a rolling basis; you can sign up at https://palus.finance/.

We'd love feedback from founders who've thought about idle cash management or people with a background in fixed-income and structured products. Happy to go deep in the comments.

[1] Agency MBS are pools of residential mortgages guaranteed by federal government agencies (Ginnie Mae, Fannie Mae, and Freddie Mac). It's a $9T market with the same government backing and AAA/AA+ rating as the Treasuries in a money market fund. No investor has ever lost money in agency MBS due to borrower default.

It's worth acknowledging that many people associate “mortgage-backed securities” with the 2008 financial crisis. But the assets that blew up in 2008 were private-label MBS, bundles of risky subprime mortgages without federal guarantees. Agency MBS holders suffered no credit losses during the crisis, and post-2008 underwriting standards became even stricter. If anything, 2008 was evidence for the safety of agency MBS, not against it.

The agency guarantee eliminates credit risk. Our short-duration, floating-rate strategy addresses the other main risk: price risk. Fixed-rate bonds lose value when rates rise, but floating-rate bonds reset their coupon based on the SOFR benchmark, protecting against interest rate movements.

[2] This comes from the historical spread between MMFs and floating-rate agency MBS; MMFs typically pay very close to SOFR, while the MBS pay SOFR + 1 to 1.5%. This means that if the Federal Reserve changes interest rates and SOFR moves, both asset types will move by about the same amount, and that 1-1.5% premium will remain.

This post is for educational purposes only and does not constitute financial, investment, or legal advice. Past performance does not guarantee future results. Yields and spreads referenced are approximate and based on historical data.

Comments URL: https://news.ycombinator.com/item?id=47278980

Points: 62

# Comments: 92

https://news.ycombinator.com/item?id=47278980

Extensions

Launch HN: Vela (YC W26) – AI for complex scheduling

Gobhanu Mar 5, 2026

Hi HN! We're Gobhanu and Saatvik (brothers), building Vela (https://tryvela.ai) - AI agents that handle multi-party, multi-channel scheduling.

Scheduling is a constraint satisfaction problem disguised as email! It’s easy when it’s two people, one timezone, one channel. But it becomes a constraint satisfaction problem when inputs are unstructured natural language across multiple communication channels, constraints change mid-solve, and the objective function includes social dynamics that don't exist formally anywhere.

What if scheduling just happened? For example: a recruiter sends one message, and every interview across five candidates, three hiring managers, and two time zones gets booked, confirmed, and updated automatically. No links, no back-and-forth, no one spending hours with 20 emails. Everyone just gets the right invite at the right time, on whatever channel they actually use. That's what we built Vela to do.

You loop in Vela into your emails, SMS, WhatsApp, Slack, phone or integrate into an ATS etc and it takes over: reads context, checks calendars, proposes times, follows up when people ghost, and rebooks when things shift.

One of our first customers is a staffing firm that searched for a scheduling solution for almost eight years. Their coordinators manage hundreds of candidate-client interviews where each side needs separate email threads, separate Zoom accounts to avoid double-booking links, and calendar invites connecting parties who never directly communicate. A client reschedules one interview and it cascades into four others. A candidate responds on SMS to a thread that started on email. Vela solved this in just 10 minutes of onboarding.

The hardest part has been the data problem. Scheduling behavior varies enormously across populations. C-suite folks respond to email within hours and expect formal 3-option proposals. Truck drivers applying for logistics roles respond to SMS at odd hours from shared devices with "y tm wrks." The failure mode isn't parsing -- it's applying the wrong interaction pattern for the wrong segment and watching the conversation die. We've been building behavioral datasets from thousands of real interactions: response latency by role, channel preference by demographic, follow-up timing curves, how many options to propose before you hit decision paralysis. This data doesn't exist anywhere.

The core agent challenge is state across channels. When someone responds on SMS to a thread that started in email, Vela needs to unify identity, merge context, and continue without losing information. Phone numbers don't map cleanly to emails, people use nicknames on text, shared devices mean the responder might not be who you reached out to. Temporal NLU is its own problem -- "next Friday" means different things on Monday versus Thursday. We extract structured constraints from natural language and resolve against calendar state. When ambiguity can't be resolved, Vela asks -- but deciding when to ask versus infer depends on the stakes of getting it wrong.

We're live with paying enterprise customers and every client still surfaces edge cases that surprise us. Case studies on our site (https://tryvela.ai/case-studies/). You can check out a demo here: https://www.youtube.com/watch?v=MzUOjSG5Uvw.

We'd love feedback from anyone who's worked on multi-agent coordination, conversational AI across channels, or constraint satisfaction in messy real-world domains. Looking forward to your comments!

Comments URL: https://news.ycombinator.com/item?id=47264741

Points: 59

# Comments: 45

https://news.ycombinator.com/item?id=47264741

Extensions

Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents

atarus Mar 3, 2026

Hey HN - we're Tarush, Sidhant, and Shashij from Cekura (https://www.cekura.ai). We've been running voice agent simulation for 1.5 years, and recently extended the same infrastructure to chat. Teams use Cekura to simulate real user conversations, stress-test prompts and LLM behavior, and catch regressions before they hit production.

The core problem: you can't manually QA an AI agent. When you ship a new prompt, swap a model, or add a tool, how do you know the agent still behaves correctly across the thousands of ways users might interact with it? Most teams resort to manual spot-checking (doesn't scale), waiting for users to complain (too late), or brittle scripted tests.

Our answer is simulation: synthetic users interact with your agent the way real users do, and LLM-based judges evaluate whether it responded correctly - across the full conversational arc, not just single turns. Three things make this actually work: Scenario generation + real conversation import - Our scenario generation agent bootstraps your test suite from a description of your agent. But real users find paths no generator anticipates, so we also ingest your production conversations and automatically extract test cases from them. Your coverage evolves as your users do.

Mock tool platform - Agents call tools. Running simulations against real APIs is slow and flaky. Our mock tool platform lets you define tool schemas, behavior, and return values so simulations exercise tool selection and decision-making without touching production systems.

Deterministic, structured test cases - LLMs are stochastic. A CI test that passes "most of the time" is useless. Rather than free-form prompts, our evaluators are defined as structured conditional action trees: explicit conditions that trigger specific responses, with support for fixed messages when word-for-word precision matters. This means the synthetic user behaves consistently across runs - same branching logic, same inputs - so a failure is a real regression, not noise.

Cekura also monitors your live agent traffic. The obvious alternative here is a tracing platform like Langfuse or LangSmith - and they're great tools for debugging individual LLM calls. But conversational agents have a different failure mode: the bug isn't in any single turn, it's in how turns relate to each other. Take a verification flow that requires name, date of birth, and phone number before proceeding - if the agent skips asking for DOB and moves on anyway, every individual turn looks fine in isolation. The failure only becomes visible when you evaluate the full session as a unit. Cekura is built around this from the ground up. Where tracing platforms evaluate turn by turn, Cekura evaluates the full session. Imagine a banking agent where the user fails verification in step 1, but the agent hallucinates and proceeds anyway. A turn-based evaluator sees step 3 (address confirmation) and marks it green - the right question was asked. Cekura's judge sees the full transcript and flags the session as failed because verification never succeeded.

Try us out at https://www.cekura.ai - 7-day free trial, no credit card required. Paid plans from $30/month.

We also put together a product video if you'd like to see it in action: https://www.youtube.com/watch?v=n8FFKv1-nMw. The first minute dives into quick onboarding - and if you want to jump straight to the results, skip to 8:40.

Curious what the HN community is doing - how are you testing behavioral regressions in your agents? What failure modes have hurt you most? Happy to dig in below!

Comments URL: https://news.ycombinator.com/item?id=47232903

Points: 89

# Comments: 21

https://news.ycombinator.com/item?id=47232903

Extensions

Launch HN: OctaPulse (YC W26) – Robotics and computer vision for fish farming

rohxnsxngh Mar 2, 2026

Hi HN! My name is Rohan and, together with Paul, I’m the co-founder of OctaPulse (https://www.tryoctapulse.com/). We’re building a robotics layer for seafood production, starting with automated fish inspection. We are currently deployed at our first production site with the largest trout producer in North America.

You might be wondering how the heck we got into this with no background in aquaculture or the ocean industry. We are both from coastal communities. I am from Goa, India and Paul is from Malta and Puerto Rico. Seafood is deeply tied to both our cultures and communities. We saw firsthand the damage being done to our oceans and how wild fish stocks are being fished to near extinction. We also learned that fish is the main protein source for almost 55% of the world's population. Despite it not being huge consumption in America it is massive globally. And then we found out that America imports 90% of its seafood. What? That felt absurd. That was the initial motivation for starting this company.

Paul and I met at an entrepreneurship happy hour at CMU. We met to talk about ocean tech. It went on for three hours. I was drawn to building in the ocean because it is one of the hardest engineering domains out there. Paul had been researching aquaculture for months and kept finding the same thing: a $350B global industry with less data visibility than a warehouse. After that conversation we knew we wanted to work on this together.

Hatcheries, the early stage on-land part of production, are full of labor intensive workflows that are perfect candidates for automation. Farmers need to measure their stock for feeding, breeding, and harvest decisions but fish are underwater and get stressed when handled. Most farms still sample manually. They net a few dozen fish, anesthetize them, place them on a table to measure one by one, and extrapolate to populations of hundreds of thousands. It takes about 5 minutes per fish and the data is sparse.

When we saw this process we were baffled. There had to be a better way. This was the starting point that really kicked us off.

Here is the thing though. Most robots are not built to handle humid and wet environments. Salt water is the enemy of anything mechanical. Corrosion is such a pain to deal with. Don't get me started on underwater computer vision which has to parse through water turbidity and particles. Fish move unpredictably and deform while swimming. Occlusion is constant. Calibration is tricky in uncontrolled setups. Handling live fish with robotics is another challenge that hasn't really been solved before. Fish are slippery, fragile, and stress easily. All of this is coupled with the requirement that all materials must be food safe.

On the vision side we are using Luxonis OAK cameras which give us depth plus RGB in a compact form factor. The onboard Myriad X VPU lets us run lightweight inference directly on the camera for things like detection and tracking without needing to send raw frames over USB constantly. For heavier workloads like segmentation and keypoint extraction we bump up to Nvidia Jetsons. We have tested on the Orin Nano and Orin NX depending on power and thermal constraints at different sites.

The models themselves are CNN and transformer based architectures. We are running YOLO variants for detection, custom segmentation heads for body outlines, and keypoint models for anatomical landmarks. The tricky part is getting these to run fast enough on edge hardware. We are using a mix of TensorRT, OpenVINO, and ONNX Runtime depending on the deployment target. Quantization has been a whole journey. INT8 quantization on TensorRT gives us the speed we need but you have to be careful about accuracy degradation especially on the segmentation outputs where boundary precision matters. We spent a lot of time building calibration datasets that actually represent the variance we see on farms. Lighting changes throughout the day, water clarity shifts, fish density varies. Your calibration set needs to capture all of that or your quantized model falls apart in production.

There is no wifi at most of these farms so we are using Starlink for connectivity in remote or offshore locations. Everything runs locally first and syncs when connection is available. We are not streaming video to the cloud. All inference happens on device.

Behind the scenes we have been building our own internal tooling for labeling, task assignment, and model management. Early on we tried existing labeling platforms but they did not fit our workflow. We needed tight integration between labeling, training pipelines, and deployment. So we built our own system where we can assign labeling tasks to annotators, track progress, version datasets, and push models to edge devices with a single command. It is not fancy but it keeps everything under our control and makes iteration fast. When you are trying to close the loop between data collection on farm, labeling, training, quantization, and deployment you cannot afford to have fragmented tooling. We needed one system that handles all of it.

On the robotics side we are building custom enclosures around off the shelf components and modifying delta robots with soft robotics grippers for handling. Vacuum and typical gripper actuation will not work in this environment so we are using compliant grippers that can safely handle fish without damaging them. We started with the Delta X S as our test platform and are evaluating whether to move to industrial delta robots or build our own from scratch once we validate the kinematics and payload requirements in wet and humid environments. The end effector design is still evolving. Fish come in different sizes and body shapes depending on species and life stage so we need grippers that can adapt.

Right now we are focused on operations outside the water. Hatchery phenotyping, sorting, quality inspection. These are more accessible than full underwater deployment and cheaper to start with. The idea is that if we can combine genetics data, environmental data, and phenotypic imagery we can help farms identify which fish to breed and which to cull. This is where selective breeding starts.

Something that surprised us early on: only a tiny fraction of farmed fish species have been through genetic improvement programs. Chickens grow 4x faster than they did in 1950 because of decades of selective breeding. But most farmed fish are essentially wild genetics. The opportunity to improve aquaculture genetics is massive but it is completely bottlenecked on measurement. You cannot improve what you cannot measure, and farms can barely measure anything at scale so far.

The industry moves on trust though. We are dealing with live animals and farms are cautious about who they let near their stock. Coming from outside aquaculture, that trust had to be earned. Paul was already a Future Leader with the Coalition for Sustainable Aquaculture but the real turning point was attending World Aquaculture Society, the largest conference in the US. Through a connection of a connection he met the incoming lead geneticist at what became our first customer. That relationship turned into a paid pilot with the largest trout producer in North America.

I previously worked at ASML, Nvidia, Tesla, and Toyota. Paul worked at Bloomberg. We met at CMU and immediately knew that we wanted to tackle this problem and put our life's work into this.

We would love feedback from any of you who have worked on computer vision in harsh or unpredictable environments, edge deployment on constrained hardware, or gentle and appropriate handling of live animals with robotics. If you are running inference on Jetsons or OAK cameras and have opinions on quantization workflows we would love to hear what has worked for you. If you have aquaculture experience we are curious what problems we should be thinking about that we haven't encountered yet.

Dang told us you’re all used to demo videos but unfortunately we can’t share them due to NDAs. But here’s a photo of us building our initial dataset for phenotyping and morphometric analysis: https://drive.google.com/file/d/1z3oSlB8ed9hanrybzP24XTfjDJE....

This is a weird industry to be building in and we are learning something new every week. If you have experience with edge deployment, robotics in wet environments, or aquaculture itself we would love to hear your perspective. And if you just have questions about fish or the tech we are happy to go deep in the comments. Excited to hear what this community thinks.

Comments URL: https://news.ycombinator.com/item?id=47220320

Points: 111

# Comments: 58

https://news.ycombinator.com/item?id=47220320

Extensions

Launch HN: Cardboard (YC W26) – Agentic video editor

sxmawl Feb 26, 2026

Hey HN - we're Saksham and Ishan, and we’re building Cardboard (https://www.usecardboard.com). It lets you go from raw footage to an edited video by describing what you want in natural language. There’s a demo video at https://www.usecardboard.com/share/fUN2i9ft8B46, and you can try the product out at https://demo.usecardboard.com (no login required!)

People sit on mountains of raw assets - product walkthroughs, customer interviews, travel videos, screen recordings, changelogs, etc. - that could become testimonials, ads, vlogs, launch videos, etc.

Instead they sit in cloud storage / hard drives because getting to a first cut takes hours of scrubbing through the raw footage manually, arranging clips in correct sequence, syncing music, exporting, uploading to a cloud storage to share, and then getting feedback on WhatsApp/iMessage/Slack, then re-doing the same thing again till everyone is happy.

We grew up together and have been friends for 15 years. Saksham creates content on socials with ~250K views/month and kept hitting the wall where editing took longer than creating. Ishan was producing launch videos for HackerRank's all-hands demo days and spent most of his time on cuts and sequencing rather than storytelling. We both felt that while tools like Premiere Pro and DaVinci are powerful, they have a steep learning curve and involve lots of manual labor.

So we built Cardboard. You tell it to "make a 60s recap from this raw footage" or "cut this into a 20s ad" or "beat-sync this to the music I just added" and it proposes a first draft on the timeline that you can refine further.

We built a custom hardware-accelerated renderer on WebCodecs / WebGL2, there’s no server-side rendering, no plugins, everything runs in your browser (client-side). Video understanding tasks go through a series of Cloud VLMs + traditional ML models, and we use third party foundational models for agent orchestration. We also give a dropdown for this to the end user.

We've shipped 13 releases since November (https://www.usecardboard.com/changelog). The editor handles multi-track timelines with keyframe animations, shot detection, beat sync via percussion detection, voiceover generation, voice cloning, background removal, multilingual captions that are spatially aware of subjects in frame, and Premiere Pro/DaVinci/FCP XML exports so you can move projects into your existing tools if you want.

Where we're headed next: real-time collaboration (video git) to avoid inefficient feedback loops, and eventually a prediction engine that learns your editing patterns and suggests the next low entropy actions - similar to how Cursor's tab completion works, but for timeline actions.

We believe that video creation tools today are stuck where developer tools were in the early 2000s: local-first, zero collaboration with really slow feedback loops.

Here are some videos that we made with Cardboard: - https://www.usecardboard.com/share/YYsstWeWE9KI - https://www.usecardboard.com/share/nyT9oj93sm1e - https://www.usecardboard.com/share/xK9mP2vR7nQ4

We would love to hear your thoughts/feedback.

We'll be in the comments all day :)

Comments URL: https://news.ycombinator.com/item?id=47170174

Points: 132

# Comments: 83

https://news.ycombinator.com/item?id=47170174

Extensions

Launch HN: TeamOut (YC W22) – AI agent for planning company retreats

vincentalbouy Feb 25, 2026

Hi HN, I’m Vincent, CTO of TeamOut (https://www.teamout.com/). We build an AI agent that plans company events from start to finish entirely through conversation. Similar to how Lovable helps build websites through chat, we apply that approach to event planning. Our system handles venue sourcing, vendor coordination, flight cost estimation, itinerary building, and overall project management.

Here’s a demo: https://www.youtube.com/watch?v=QVyc-x-isjI. The product is live at https://app.teamout.com/ai and does not require signup.

We went through YC in 2022 but did not launch on HN at the time. Back then, the product was more traditional, closer to an Airbnb-style search marketplace. Over the past two years, after helping organize more than 1,200 events, we rebuilt the core system around an agent architecture that directly manages the planning process. With this new version live, it felt like the right moment to share it here since it represents a fundamentally different approach to planning events.

The problem: Planning a company retreat usually means choosing between three imperfect options: (1) Hire an event planner and pay significant fees and venue markups; (2) Do it yourself and spend dozens of hours on research, emails, and negotiation; or (3) Use tools like Airbnb that are not designed for group logistics or meeting space.

The difficulty is not just finding a venue. Even for 30 to 50 people, planning turns into weeks of back-and-forth emails for quotes, comparing inconsistent pricing across PDFs, and tracking budgets in spreadsheets. It becomes an ongoing coordination problem with evolving constraints and slow, asynchronous vendor responses. Most existing software is form-driven, but the real workflow is conversational and stateful.

Offsites are expensive and high stakes. A single event can represent a significant chunk of a team’s annual budget, and mistakes show up directly as cost overruns or poor experiences. Founders and operators often end up spending time on event logistics instead of their actual work.

I ran into this while organizing retreats at a previous company. Before TeamOut, I worked as an AI researcher at IBM on NLP and machine learning systems. Sitting inside long email threads and cost spreadsheets, it did not look like a marketplace gap to me. It looked like a reasoning and state management problem. As large language models improved at multi-step reasoning and tool use, it became realistic to automate the coordination layer itself.

Our Solution: The core agent relies on a combination of models such as Gemini, Claude, and GPT. A central LLM-based agent maintains planning context across turns and decides which specialized tool to call next. Each tool has a specific responsibility: - Venue search and filtering - Cost estimations (accommodation + flights) - Budget comparisons - Quote and outreach flows - Communication tool with our team

For venue recommendations across more than 10,000 venues, we do not rely purely on the language model. We embed both user requirements and venues into vector representations and retrieve candidates using similarity search. Hard constraints such as capacity and dates are applied first, and results are ranked before being presented.

On the interface side, we use a split layout: conversation on the left and structured results on the right. As you refine the plan in chat, the event updates in real time, allowing an iterative workflow rather than a static search experience.

What is different is that we treat event planning as a stateful coordination problem rather than a one-shot search query. The agent orchestrates tools, manages evolving constraints, and surfaces trade-offs explicitly. It does not invent venues or fabricate pricing, and it is not designed to replace human planners for very large or highly customized events.

We make money from commissions on venue bookings. It is free for teams to explore options and plan. If you’ve organized an offsite or large meetup before, I’d genuinely value your perspective. Where would you expect this to fail? What edge cases are we underestimating? Where wouldn’t you trust an agent to handle the details?

My engineering team and I will be here all day to answer questions, happy to go deep on architecture, tradeoffs, and lessons learned. We’d really appreciate your candid feedback.

Comments URL: https://news.ycombinator.com/item?id=47151598

Points: 55

# Comments: 61

https://news.ycombinator.com/item?id=47151598

Extensions

Launch HN: Sonarly (YC W26) – AI agent to triage and fix your production alerts

Dimittri Feb 17, 2026

Hey HN, I am Dimittri and we’re building Sonarly (https://sonarly.com), an AI engineer for production. It connects to your observability tools like Sentry, Datadog, or user feedback channels, triages issues, and fixes them to cut your resolution time. Here's a demo: https://www.youtube.com/watch?v=rr3VHv0eRdw.

Sonarly is really about removing the noise from production alerts by grouping duplicates and returning a root cause analysis to save time to on-call engineers and literally cut your MTTR.

Before starting this company, my co-founder and I had a B2C app in edtech and had, some days, thousands of users using the app. We pushed several times a day, relying on user feedback. Then we set up Sentry, it was catching a lot of bugs, but we had up to 50 alerts a day. With 2 people it's a lot. We took a lot of time filtering the noise to find the real signal so we knew which bug to focus on.

At the same time, we saw how important it is to fix a bug fast when it hits users. A bug means in the worst case a churn and at best a frustrated user. And there are always bugs in production, due to code errors, database mismatches, infrastructure overload, and many issues are linked to a specific user behavior. You can't catch all these beforehand, even with E2E tests or AI code reviews (which catch a lot of bugs but obviously not all, plus it takes time to run at each deployment). This is even more true with vibe-coding (or agentic engineering).

We started Sonarly with this idea. More software than ever is being built and users should have the best experience possible on every product. The main idea of Sonarly is to reduce the MTTR (Mean Time To Repair).

We started by recreating a Sentry-like tool but without the noise, using only text and session replays as the interface. We built our own frontend tracker (based on open-source rrweb) and used the backend Sentry SDK (open source as well). Companies could just add another tracker in the frontend and add a DSN in their Sentry config to send data to us in addition to Sentry.

We wanted to build an interface where you don't need to check logs, dashboards, traces, metrics, and code, as the agent would do it for you with plain English to explain the "what," "why," and "how do I fix it."

We quickly realized companies don't want to add a new tracker or change their monitoring stack, as these platforms do the job they're supposed to do. So we decided to build above them. Now we connect to tools like Sentry, Datadog, Slack user feedback channels, and other integrations.

Claude Code is so good at writing code, but handling runtime issues requires more than just raw coding ability. It demands deep runtime context, immediate reactivity, and intelligent triage, you can’t simply pipe every alert directly into an agent. That’s why our first step is converting noise into signal. We group duplicates and filter false positives to isolate clear issues. Once we have a confirmed signal, we trigger Claude Code with the exact context it needs, like the specific Sentry issue and relevant logs fetched via MCP (mostly using grep on Datadog/Grafana). However, things get exponentially harder with multi-repo and multi-service architectures.

So we built an internal map of the production system that is basically a .md file updated dynamically. It shows every link between different services, logs, and metrics so that Claude Code can understand the issue faster.

One of our users using Sentry was receiving ~180 alerts/day. Here is what their workflow looked like:

- Receive the alert

- 1) Defocus from their current task or wake up, or 2) don't look at the alert at all (most of the time)

- Go check dashboards to find the root cause (if infra type) or read the stack trace, events, etc.

- Try to figure out if it was a false positive or a real problem (or a known problem already in the fixes pipeline)

- Then fix by giving Claude Code the correct context

We started by cutting the noise and went from 180/day to 50/day (by grouping issues) and giving a severity based on the impact on the user/infra. This brings it down to 5 issues to focus on in the current day. Triage happens in 3 steps: deduplicating before triggering a coding agent, gathering the root cause for each alert, and re-grouping by RCA.

We launched self-serve (https://sonarly.com) and we would love to have feedback from engineers. Especially curious about your current workflows when you receive an alert from any of these channels like Sentry (error tracking), Datadog (APM), or user feedback. How do you assign who should fix it? Where do you take your context from to fix the issue? Do you have any automated workflow to fix every bug, and do you have anything you use currently to filter the noise from alerts?

We have a large free tier as we mainly want feedback. You can self-serve under 2 min. I'll be in the thread with my co-founder to answer your questions, give more technical details, and take your feedback: positive, negative, brutal, everything's constructive!

Comments URL: https://news.ycombinator.com/item?id=47049776

Points: 30

# Comments: 17

https://news.ycombinator.com/item?id=47049776

Extensions

Launch HN: Omnara (YC S25) – Run Claude Code and Codex from anywhere

kmansm27 Feb 12, 2026

Hey y’all, Kartik, Ishaan, and Christian from Omnara (https://www.omnara.com/) here. We’re building a web and mobile agentic IDE for Claude Code and Codex that lets you run and interact with coding agents from anywhere. Omnara lets you run Claude Code and Codex sessions on your own machine, and exposes those sessions through a web and mobile interface so you can stay involved even when you’re away from your desk. Think of it like Claude Code Desktop or Conductor, except you can continue your sessions on your phone.

Here’s a demo of the web and mobile apps - https://youtu.be/R8Wmy4FLbhQ

We started using Claude Code early last year and quickly ran into a pattern: agents could work for long stretches on their own, but progress would stall whenever they needed follow-up input. If that happened while we were away from our desks, everything just paused. We looked at remote agent solutions like Codex Web and Devin, which were the main options at the time, but they ran in remote VMs, and we wanted our coding agent to run in our own environment. Our first attempt at solving this was a lightweight wrapper that streamed messages from the Claude Code CLI to a mobile app, but that approach ended up being fragile and hard to maintain.

As the Claude Agent SDK matured, it gave us enough control to rewrite Omnara from scratch and run the agent loop directly. We chose to build a GUI across web and mobile instead of a TUI or CLI, because we think GUIs are generally more ergonomic for working with agents and code, especially on mobile. We still preserve the main strength of CLIs and TUIs: running anywhere, including on headless machines.

Omnara keeps that property by running a small headless daemon on the user’s machine (or a remote VM) that hosts the agent loop. The daemon maintains an authenticated, outbound WebSocket connection to our server, which relays messages between the agent running on the user’s machine and any connected web or mobile clients. Because the daemon only makes outbound connections, there’s no need for exposed ports, SSH access, or tunneling on the user’s machine.

In our first version of Omnara, users liked that agent sessions ran in their own environment, but they still depended on the machine staying online. Some users ran Omnara on a remote machine that stayed up, which worked well for them, though most still did most of their work on laptops. In the current version, Omnara can continue an agent session in a hosted remote sandbox when your local machine goes offline.

The conversation state of an agent is already persisted on our server, and you can optionally enable cloud syncing for the working code. When syncing is enabled, Omnara creates git commits at each turn in the conversation and pushes them to our server, so execution can resume from the same state regardless of whether it continues locally or in the cloud. If you continue working in a remote sandbox, you can later pull any changes back into your local environment when you return to your machine. Environment parity in the sandbox isn’t perfect yet, but in practice, missing dependencies are usually easy to resolve by asking the agent to install them.

Another thing we learned from using the initial version of Omnara is that mobile is fine for quick interactions, but not great for extended back-and-forth. Users asked for a hands-free way to keep agents moving while walking, driving, or doing something else, which led us to add a voice agent. Coming from more traditional software engineering backgrounds, we honestly thought coding by talking to a voice agent would be gimmicky and added it mostly as a fallback.

What surprised us is how useful the voice agent ended up being in practice. When working with coding agents, being redundant and overly explicit usually helps, and people naturally give more detail when speaking than when typing. Going back and forth with the agent as the conversation unfolds tends to produce a much more solid plan than trying to one-shot it with a prompt (this could technically also be done over text, but talking and iterating over voice feels easier and more natural). It’s also just fun. Talking through an idea with an agent while out on a walk is a lot more enjoyable than staring at a terminal screen.

To try it out, open your terminal and download Omnara with

  curl -fsSL https://omnara.com/install/install.sh | bash

then run omnara inside any git repository. This starts a headless Claude Code or Codex session in that repo, which immediately appears in the Omnara web and mobile apps. From there, you can continue that session or start new ones remotely (with or without worktrees) and switch between the web and mobile clients without interrupting the agent.

Omnara is free for 10 agent sessions per month, then $20/month for unlimited sessions. When agents run in your own environment, you can use your existing Claude or Codex subscription, so there’s no need to pay us for additional tokens. If you use Claude Code or Codex, we’d love to hear your feedback on Omnara!

Comments URL: https://news.ycombinator.com/item?id=46991591

Points: 147

# Comments: 161

https://news.ycombinator.com/item?id=46991591

Extensions

Launch HN: Livedocs (YC W22) – An AI-native notebook for data analysis

arsalanb Feb 10, 2026

Hi HN, I'm Arsalan, founder of LiveDocs (https://livedocs.com). We're building an AI-native data workspace that lets teams ask questions of their real data and have the system plan, execute, and maintain the analysis end-to-end.

We previously posted about LiveDocs four years ago (https://news.ycombinator.com/item?id=30735058). Back then, LiveDocs was a no-code analytics tool for stitching together metrics from tools like Stripe and Google Analytics. It worked for basic reporting, but over time we ran into the same ceiling our users did. Dashboards are fine until the questions get messy, and notebooks slowly turn into hard-to-maintain piles of glue.

Over the last few years, we rebuilt LiveDocs almost entirely around a different idea. Data work should behave like a living system, not a static document or a chat transcript.

Today, LiveDocs is a reactive notebook environment backed by real execution engines. Notebooks are not linear. Each cell participates in a dependency graph, so when data or logic changes, only the affected parts recompute. You can freely mix SQL, Python, charts, tables, and text in the same document and everything stays in sync. Locally we run on DuckDB and Polars, and when you connect a warehouse like Snowflake, BigQuery, or Postgres, queries are pushed down instead of copying data out. Every result is inspectable and reproducible.

On top of this environment sits an AI agent, but it is not "chat with your data." The agent works inside the notebook itself. It can plan multi-step analyses, write and debug SQL or Python, spawn specialized sub-agents for different tasks, run code in a terminal, and browse documentation or the web when it lacks context. Because it operates inside the same execution graph as humans, you can see exactly what it ran, edit it, or take over at any point.

We also support a canvas mode where the agent can build custom UI for your analysis, not just charts. This includes tables with controls, comparisons, and derived views that stay wired to the underlying data. When a notebook is not the right interface, you can publish parts of it as an interactive app. These behave more like lightweight internal tools, similar in spirit to Retool, but backed by the same analysis logic.

Everything in LiveDocs is fully real-time collaborative. Multiple people can edit the same notebook, see results update live, comment inline, and share documents or apps without exposing raw code unless they want to.

Teams use LiveDocs to investigate questions that do not fit cleanly into dashboards, build analyses that evolve over time without constant rewrites, and automate recurring questions without turning them into brittle pipelines.

Pricing is pay-as-you-go, starting at $15 per month, with a free tier so people can try it without talking to us. You'll have to sign up, as it requires us to provision a sandbox for your to run your notebook. Here's a video demo: https://youtu.be/Hl12su9Jn_I

We are still learning where this breaks. Long-running agent workflows on production data surface a lot of sharp edges. We would love feedback from people who have built or lived with analytics systems, notebooks, or "chat with your data" tools and felt their limits. Happy to go deep on technical details and trade notes.

Comments URL: https://news.ycombinator.com/item?id=46964162

Points: 48

# Comments: 19

https://news.ycombinator.com/item?id=46964162

Extensions

Launch HN: Modelence (YC S25) – App Builder with TypeScript / MongoDB Framework

eduardpi Feb 3, 2026

Hi all, Aram and Eduard here - co-founders of Modelence (https://modelence.com). After spending years on scaling our previous startup’s platform, we built an open-source full-stack TypeScript + MongoDB framework to stop solving the same auth / database / API / cron job implementations every time we created an app, and we didn’t like the idea of using multiple managed platforms for each of these to run our apps either.

(Here’s our prior Show HN post for reference: https://news.ycombinator.com/item?id=44902227)

At the same time, we were excited by the whole AI app builder boom and realized that the real challenge there is the platform rather than the tool itself. Now we’re making Modelence the first full-stack framework that’s built for coding agents and humans alike:

- TypeScript is already great for AI coding because it provides guardrails and catches many errors at build time, so agents can auto-correct

- MongoDB eliminates the schema management problem for agents, which is where they fail the most often otherwise (+ works great with TS/Node.js)

- Built-in auth, database, cron jobs and else that just works together out of the box means agents only focus on your product logic and don’t fail at trying to set these things up (+ less tokens spent on boilerplate).

You can now try the Modelence app builder (based on Claude Agent SDK) by just typing a prompt on our landing page ( https://modelence.com ) - watch a demo video here: https://youtu.be/BPsYvj_nGuE

Then you can check it out locally and continue working in your own IDE, while still using Modelence Cloud as your backend, with a dev cloud environment, and later deploy and run on Modelence Cloud with built-in observability around every operation running in your app.

We’re also going to add a built-in DevOps agent that lives in the same cloud, knows the framework end-to-end, and will use all this observability data to act on errors, alerts, and incidents - closing the loop, because running in production is much harder than just building.

We launched the app builder as a quick start for developers, to demonstrate the framework and Modelence Cloud without having to manually read docs and follow the steps to set up a new app. Our main focus is still the platform itself, since we believe the real challenge in AI coding is the framework and the platform rather than the builder tool itself.

Comments URL: https://news.ycombinator.com/item?id=46872733

Points: 72

# Comments: 44

https://news.ycombinator.com/item?id=46872733

Extensions

Launch HN: AgentMail (YC S25) – An API that gives agents their own email inboxes

Haakam21 Jan 29, 2026

Hey HN, we're Haakam, Michael, and Adi. We're building AgentMail (https://agentmail.to), the email inbox API for agents. We’re not talking about AI for your email, this is email for your AI.

Email is an optimal interface for long-running agents. It’s multithreaded and asynchronous with full support for rich text and files. It’s a universal protocol with identity and authentication built in. Moreover, a lot of workflow critical context already lives in email.

We wanted to build email agents that you can forward your work to and get back a completed task. The agents could act entirely autonomously as you wouldn't need to delegate your identity. If they did get stuck they could just send you, or anyone else, an email.

Using Gmail, we kept getting stuck on the limitations of their API. No way to create inboxes programmatically. Rate and sending limits. OAuth for every single inbox. Keyword search that doesn't understand context. Per-seat pricing that doesn't work for agents.

So we built what we wished existed: an email provider for developers. APIs for creating inboxes and configuring domains. Email parsing and threading. Text extraction from attachments. Realtime webhooks and websockets. Semantic search across inboxes. Usage-based pricing that works for agents.

Developers, startups, and enterprises are already deploying email agents with AgentMail. Agents that convert conversations and documents into structured data. Agents that source quotes, negotiate prices, and get the best deals. Agents that emulate internet users for training models on end-to-end tasks.

Here's demo of Clawdbots communicating using AgentMail: https://youtu.be/Y0MfUWS3LKQ

You can get started with AgentMail for free at https://agentmail.to

Looking forward to hearing your thoughts and feedback.

Comments URL: https://news.ycombinator.com/item?id=46812608

Points: 169

# Comments: 169

https://news.ycombinator.com/item?id=46812608

Extensions

Launch HN: Constellation Space (YC W26) – AI for satellite mission assurance

kmajid Jan 22, 2026

Hi HN! We're Kamran, Raaid, Laith, and Omeed from Constellation Space (https://constellation-io.com/). We built an AI system that predicts satellite link failures before they happen. Here's a video walkthrough: https://www.youtube.com/watch?v=069V9fADAtM.

Between us, we've spent years working on satellite operations at SpaceX, Blue Origin, and NASA. At SpaceX, we managed constellation health for Starlink. At Blue, we worked on next-gen test infra for New Glenn. At NASA, we dealt with deep space communications. The same problem kept coming up: by the time you notice a link is degrading, you've often already lost data.

The core issue is that satellite RF links are affected by dozens of interacting variables. A satellite passes overhead, and you need to predict whether the link will hold for the next few minutes. That depends on: the orbital geometry (elevation angle changes constantly), tropospheric attenuation (humidity affects signal loss via ITU-R P.676), rain fade (calculated via ITU-R P.618 - rain rates in mm/hr translate directly to dB of loss at Ka-band and above), ionospheric scintillation (we track the KP index from magnetometer networks), and network congestion on top of all that.

The traditional approach is reactive. Operators watch dashboards, and when SNR drops below a threshold, they manually reroute traffic or switch to a backup link. With 10,000 satellites in orbit today and 70,000+ projected by 2030, this doesn't scale. Our system ingests telemetry at around 100,000 messages per second from satellites, ground stations, weather radar, IoT humidity sensors, and space weather monitors. We run physics-based models in real-time - the full link budget equations, ITU atmospheric standards, orbital propagation - to compute what should be happening. Then we layer ML models on top, trained on billions of data points from actual multi-orbit operations.

The ML piece is where it gets interesting. We use federated learning because constellation operators (understandably) don't want to share raw telemetry. Each constellation trains local models on their own data, and we aggregate only the high-level patterns. This gives us transfer learning across different orbit types and frequency bands - learnings from LEO Ka-band links help optimize MEO or GEO operations. We can predict most link failures 3-5 minutes out with >90% accuracy, which gives enough time to reroute traffic before data loss. The system is fully containerized (Docker/Kubernetes) and deploys on-premise for air-gapped environments, on GovCloud (AWS GovCloud, Azure Government), or standard commercial clouds.

Right now we're testing with defense and commercial partners. The dashboard shows real-time link health, forecasts at 60/180/300 seconds out, and root cause analysis (is this rain fade? satellite setting below horizon? congestion?). We expose everything via API - telemetry ingestion, predictions, topology snapshots, even an LLM chat endpoint for natural language troubleshooting.

The hard parts we're still working on: prediction accuracy degrades for longer time horizons (beyond 5 minutes gets dicey), we need more labeled failure data for rare edge cases, and the federated learning setup requires careful orchestration across different operators' security boundaries. We'd love feedback from anyone who's worked on satellite ops, RF link modeling, or time-series prediction at scale. What are we missing? What would make this actually useful in a production NOC environment?

Happy to answer any technical questions!

Comments URL: https://news.ycombinator.com/item?id=46721933

Points: 48

# Comments: 18

https://news.ycombinator.com/item?id=46721933

Extensions