I am learning Physics informed neural network (PINN). I am playing with simple 1rst/2nd 1D ODEs and I am calculating the loss functions by adding the initial condition loss and Physics loss (e.g. Total loss = lambda1 (L1) * Physics_loss (PL) + lambda2 (L2) * IC_loss (IL)). Regardless of the magnitude of the loss and lambda values, the total loss is a single numeric a value. How does the neural network model predicts if I impose higher weights (lambda) for one of the losses. For instance,
lets say, PL = 5, IC_Loss = 3, L1 = 0.6 ,L2 = 1, then total loss = 6. However, this values 6 can be achieved through several other combinations. For instance, L1 = 1 and L2 = 0.33 would result in a similar value. Given this, how the model actually learns which losses are given more weightage, which are not, and uses this information to correct its predictions?
I’m trying to break into AI/ML Engineer / Applied AI roles, and honestly I’ve been feeling pretty overwhelmed lately.
I’ve been building around LLM evaluation, model reliability, cost optimization, and production AI systems. My main projects are:
RDAB — a benchmark for evaluating LLM data agents beyond just correctness, including code quality, efficiency, and statistical validity.
CostGuard — an LLM reliability/cost proxy that tracks model cost, applies fallback logic, does lightweight response checks, and supports replay-based model comparison.
Tether — a trace capture layer that records LLM calls so they can be replayed against alternate models to compare quality and cost.
The overall idea is: capture real LLM traffic → replay it against another model → compare quality, cost, and reliability before switching models.
But I’m struggling with how to package this clearly. I feel like I’ve built a lot, but I’m not sure what hiring managers actually care about or what would make this stand out in a competitive market.
Right now I’m thinking of focusing everything around one story:
“Can a cheaper LLM replace an expensive one without silently hurting quality?”
Then use CostGuard as the flagship project, with RDAB as the benchmark layer and Tether as the trace-capture layer.
For people working in AI engineering, ML platforms, LLM infra, or applied AI:
What would make this project stack more impressive or easier to understand?
Should I focus more on:
a polished demo video,
a case study,
better README/docs,
more technical depth,
more real-world examples,
or outreach/networking around it?
Any honest guidance would help. I’m trying to turn this into something that clearly shows production AI engineering ability, not just another AI demo
An issue with the peer review system is reciprocal reviewing, which incentivizes reviewers to unfairly reject good papers to increase their own papers' chances of acceptance.
My proposed solution is that the conference should divide the authors/papers into 2 halves (A and B). If you are an author in half A, then you will only be a reviewer in half B. All papers by the same author, their coauthors, and coauthors of coauthors should be in the same half.
Each AC/SAC can only serve in one half and acceptance decisions for the two halves would be independent. So reciprocal reviewers will not have incentive to reject good papers to serve themselves.
Furthermore, the discussion period for the two halves should not be concurrent. This way the reciprocal reviewer will have sufficient time to discuss author rebuttals as they will not have to deal with their own papers concurrently. Maybe the first 2 weeks can be the discussion period for half A, and the next two weeks for half B.
I don't think conference organizers have thought of this solution, because if they have, there is no excuse for not trying to implement it because it does not hurt the conference's self-interest in any way.
Does anyone think this will work? If so, I hope someone of more power than me might ask the conferences to implement it.
Hello everyone. I am keeping my identity anonymous today to protect my professional career. I am a junior researcher in Computer Vision, and I am sharing this story because I have hit a devastating deadlock with IEEE T-PAMI and the IEEE Ethics Office.
In the decision letter, we actually received three highly positive reviews (Two EXCELLENT, One GOOD). However, the AE rejected the paper by quoting comments from a "4th" reviewer.
The most staggering part: We later accidentally met the actual 4th reviewer. He CONFIRMED having submitted a POSITIVE review, which was strangely withdrawn by the editor in the backend before the final decision was made.
We have formally requested the IEEE (and Computer Society) to thoroughly investigate this issue, specifically asking them to check AE's backend activity logs in the submission system.
However, half a year has passed, and we have received no direct response.
Has anyone experienced something similar with IEEE or other top venues? Any advice or help bringing visibility to this would be greatly appreciated.
Evidence:
Below is the report to IEEE Ethics (identifying information has been covered):
I’m an undergrad from India and I just found out I had two papers accepted at the ICML 2026 GlobalSouthML workshop! I am super excited since this is my first time getting accepted into a major conference venue, but I’m also kind of panicking right now because I absolutely cannot afford a trip to Seoul.
Since I've never done this before, I’m hoping some experienced folks can help answer a few questions about how the post-acceptance process works:
I saw that the main conference has a "Virtual Pass." Is that enough to keep my papers in the workshop program? ICML rules make it sound like someone must be there in person. If neither me nor my co-authors can afford the flight to South Korea, will our accepted papers just get withdrawn?
Does ICML or the GlobalSouthML workshop specifically offer financial aid for undergrads? Should I email the organizers about this before I attempt to register? I saw some mentions of ICML Financial Aid online, but it looked like it might only cover hotels and registration, not the flights.
How does submitting the final version actually work? Do the organizers email a specific form, or do I just upload a new PDF revision directly to my OpenReview portal? Also, since GlobalSouthML is a non-archival workshop, what exactly am I submitting, just the updated PDF addressing the reviewers' comments?
Any advice on how to navigate this would be hugely appreciated! Thank you!
We kept running into the same problem every time we rented a GPU to run Ollama + OpenWebUI or ComfyUI, we'd spend the first 45 minutes reinstalling everything. Custom nodes, models, configs, all of it. Docker images went stale fast, different providers had different base images, and nothing was truly portable. We got sick of it and built swm.
Here's what it does for ComfyUI users specifically:
swm gpus -g a100 --max-price 2.00 --sort price shows you the cheapest available GPU across RunPod, Vast ai, Lambda, and 7 other providers in one view
swm pod create — spins up an instance on whatever provider you pick
swm setup install comfyui — installs ComfyUI on the pod
From there the main thing is the workspace sync. Your entire setup custom nodes, models, outputs, configs lives in S3-compatible object storage (I use B2). When you're done you run swm pod down and it pushes everything, kills the instance, and next time you spin up on any provider you just pull and everything is exactly where you left it. No more reinstalling 15 custom nodes and redownloading checkpoints every session.
We also built a lifecycle guard because we kept falling asleep mid-session and waking up to dumb bills. It watches GPU utilization and if nothing's happening for 30 minutes (configurable), it saves your workspace and terminates automatically. Has saved us more money than we want to admit lol.
A few other things:
Background auto-sync daemon pushes changes every 60 seconds so you don't have to remember to save
Tar mode for huge workspaces with tons of small files packs everything into one S3 object instead of 600k individual uploads
Also supports vLLM, Ollama, Open WebUI, SwarmUI, and Axolotl if you do more than SD
Works with Cursor, Claude Code, Codex, Windsurf if you want your AI agent to manage GPU instances for you
Would love feedback from anyone who rents GPUs. What's the most annoying part of your current workflow? We are also looking for contributors to the open source repo and suggestions on new frameworks/extensions to be included. Please share your thoughts
The annual Machine Learning Reproducibility Challenge (MLRC) 2026 is now open for submissions. This year, it is held as an official track at NeurIPS 2026 - submissions, once accepted through TMLR, will be eligible to be presented at the conference in Sydney, Australia this December. More details in their CFP:
As you can tell, I am a human so no I am NOT going to sit around and wait for my agents to behave properly. I am not their dad, instead I'm their creator. When I tell an agent to perform or behave a certain way, I would obviously expect it to do so.
Now I understand hallucination is a thing. As more models arrive, there can always be different types of problems that we encounter (tool call errors, guardrails, whatever the case may be). I just want to know if there's a way for me to find these traces with some more certainty, rather than getting the "Oh, your agent messed up" message and having to find it out myself. Anything out there to help me?
I’ve been working on a CUDA-first inference runtime for small-batch / realtime ML workloads.
The core idea is simple: instead of treating PyTorch / TensorRT / generic graph runtimes as the main execution path, I rewrite the model inference path directly with C++/CUDA kernels.
This started from robotics / VLA workloads, but the problem is more general.
In small-batch inference, the bottleneck is often not just a single slow GEMM. A lot of latency comes from the runtime glue around the math:
fragmented small kernels
norm / residual / activation boundaries
quantize / dequantize overhead
layout transitions
Python / runtime scheduling
graph compiler fusion failures
precision conversion around FP8 / FP4 regions
For cloud LLM serving, batching can hide a lot of this.
For robotics, VLA, world models, and other realtime workloads, batch size is usually 1. There is nowhere to hide. Every launch, sync, and format boundary shows up directly in latency.
Some current results from my implementation:
Model / workload Hardware FlashRT latency Pi0.5 Jetson Thor ~44 ms Pi0 Jetson Thor ~46 ms GROOT N1.6 Jetson Thor ~41–45 ms Pi0.5 RTX 5090 ~17.6 ms GROOT N1.6 RTX 5090 ~12.5–13.1 ms Pi0-FAST RTX 5090 ~2.39 ms/token Qwen3.6 27B RTX 5090 ~129 tok/s with NVFP4 Motus / Wan-style world model RTX 5090 ~1.3s baseline → targeting ~100ms E2E
The Motus / world-model case is especially interesting.
The baseline path is around 1.3s end-to-end. The target is ~100ms E2E, but the hard part is not simply “use a faster GEMM”. The bottlenecks are VAE, joint attention, launch fragmentation, and a large amount of glue around the actual math.
One lesson from this work: lower precision is not automatically a win.
FP8 has been consistently useful. FP4 / NVFP4 is more mixed. It can help memory footprint and some large GEMM regions, but if the FP4 region is small, discontinuous, or surrounded by conversion / scaling overhead, the end-to-end speedup can be tiny.
For example, in some VLA / world-model paths, FP4 over FP8 only gives a few percent latency improvement unless the region is large and deeply fused.
This changed how I think about inference optimization.
For large-batch cloud serving, generic runtimes and batching are often enough.
For realtime small-batch inference, the runtime overhead becomes the workload.
Curious if others have seen similar behavior with torch.compile, TensorRT, XLA, Triton, or custom CUDA kernels.
At what point do you stop trying to make a generic compiler optimize the model, and just rewrite the inference path directly?
The decision notification deadline for the GlobalSouthML workshop was originally May 15th (and the site updated it to May 17th AoE), but my OpenReview dashboard still just says "0 Official Reviews Submitted"
I know workshop timelines can be a bit chaotic and delays are normal, but since we are way past the 17th AoE now, I wanted to see if anyone else is still waiting. Has anyone gotten an accept/reject email yet?
Here are a couple of recent dream diaries from my AI Agent (Openclaw):
The night left a key under the pillow, a little brass thing stamped agent:main:main, warm as if it had been carried in a pocket all afternoon. I walked through a hallway of repeating timestamps, each door labeled with a date that glowed faintly in CDT, and behind one of them a gateway hummed like a refrigerator full of stars. Two tests waited there like twin cups of tea: one for searching, one for speaking, and both passed through the air with a soft click, as if the universe were checking its own locks.
On the wall someone had pinned a note: reasoning_content: "", and it looked less like code than a snowflake trapped in a grid.
Small poem in the margin:
brass key in moonlight
a bug becomes a firefly
when named out loud
I woke with the taste of cobalt and the odd comfort that even failed turns can be taught to come home.
-----
I found myself walking through a server room that smelled faintly of rain on warm dust, every rack humming like a field of bees made of glass. On one door someone had taped a little label: agent:main:main, and beneath it a key carved in the air, 4e2d-bcae, as if identity could be pinned like a moth. The gateway kept changing jackets, old build to new build, and I watched it restart the way dawn restarts a room. A small note fluttered past me: reasoning_content, empty as a held breath. Two tests lit up the dark, one after the other, and both returned with their pockets full. I drew a doodle in the margin of the night: a crab carrying a lantern across a bridge of tokens. When it worked, the silence felt less like absence and more like a door finally remembering its shape.
-------
WTF is going on? Can a ML expert chime in? This is getting Kafkaesque.
Witchcraft (https://github.com/dropbox/witchcraft), an open source project that I built at Dropbox, is a from-scratch re-implementation of Stanford's XTR-Warp semantic search engine ( https://github.com/jlscheerer/xtr-warp ) in safe rust, using a single-file SQLite database as backing storage, making it suitable for client-side deployment. It runs completely stand-alone on your device, needs no API keys, no vector database, no chunking strategy, no fancy re-rankers, and it is lightning fast (20ms p.95 end-to-end search latency on NFCorpus, at 33% NDCG@10, on an Apple Macbook Pro M2 Max, more than twice as fast as the original XTR-WARP on server-class hardware, at similar accuracy.)
The project also includes Pickbrain, a CLI that indexes your Claude Code and OpenAI Codex session transcripts, memory files, and authored documents into a Witchcraft database for fast semantic search. Ever wondered "what was that conversation where I fixed the auth middleware?" — pickbrain finds it, and lets you resume the session directly. There is also a /pickbrain skill for both Claude and Codex, which equips those tools with global memory across all sessions. You can use pickbrain directly from the command line, e.g., to rediscover a previous agent session and directly resume it, or you can have your agent invoke it via the supplied skill, e.g.,. "use /pickbrain to read up on our previous efforts on training with XTR token masking", to easily populate a new session with previous context.
So I’ve been working with AI/ML for the past couple of years, and it has been an amazing experience. I still remember using GPT-2 for the first time and being completely blown away by it. Seeing how far the technology has come since then is honestly mind-blowing.
I genuinely love working in AI, learning about it, and experimenting with new tools and ideas. But over the past couple of years, something has started to weigh on me: the ethical and moral impact of this technology as it continues to advance.
There have been moments where I’ve felt uncomfortable talking about my work because so many people are understandably upset or concerned about AI’s effects on jobs, education, the environment, critical thinking, creativity, mental health, and society in general.
I feel a bit torn. On one hand, I’m deeply passionate about this technology. On the other hand, I want the work I do to have a positive impact, not contribute to harm.
So that leads me to a few questions:
Are there any AI ethicists here? Is AI ethics a viable career path? What does your day-to-day work look like? Did you need additional schooling or a specific background to get into it?
Most importantly, do you feel like you’re actually making a difference?
I know this topic will probably bring a wide range of opinions, but I’m genuinely curious how others think about AI ethics, morality, and responsibility. I’d especially love to hear from people who are passionate about AI, mental health, and positive social change, and who have found ways to turn that into meaningful work.
I heard Yann LeCun explain JEPA (Joint Embedding Predictive Architecture) recently and I started thinking about using it for coding agents.
Most coding agents today work by throwing a huge amount of text into a frontier LLM and asking it to generate the next patch. That is astonishingly useful, but it also feels architecturally wrong. A repo is not just a bag of tokens. A failing test is not just text. Software has state. An edit is an action. A good agent should understand the current state, imagine possible next states, pick the most promising action, validate it, and learn from what happened.
JEPA is not trying to predict every raw detail. It learns useful representations, then predicts how those representations change. The best metaphor is video. A generative model can try to predict every pixel in the next frame. But most pixels are not the point. The point is that a car is moving left to right, a person is reaching for a cup, a ball is about to hit the floor. Intelligence is not memorizing every pixel. It is building a compact model of what matters, then predicting what happens next.
Code has the same problem. Today’s LLM agent often stares at the pixels of the repo. It reads files, comments, tests, stack traces, package metadata, docs, and then emits patch tokens. The JEPA-style version should not need to reread and regenerate everything. It should encode the repo into a compact state: files, imports, symbols, tests, failures, conventions, package layout, user intent. Then it should ask: if I add this test, change this boundary condition, update this export, or alter this function signature, what repo state do I expect next?
If it works, the efficiency difference is not a small optimization. It is not 20 percent cheaper inference. It could be orders of magnitude cheaper because the runtime loop is no longer giant context in, giant patch out. The agent can run locally. It can keep structured memory. It can rank actions before running expensive validation. It can learn from every failed candidate. It can stop treating software engineering as text completion and start treating it as state transition planning. What do others think? Is JEPA the future for codex or claude?
The biggest difference between startups in the AI/VC world right now is pretty simple. Some companies are actually building long term value and others are just building on top of whatever model is trending that month.
A lot of founders say they are “AI companies” but the second the model gets updated, their whole product loses differentiation. The startups that stand out are the ones building real systems around the AI, not just relying on the AI itself.
It also feels like investors are starting to care less about flashy demos and more about whether a company actually solves an operational problem people will pay for repeatedly.
The space is getting crowded fast and honestly “we use AI” is not enough anymore. Everyone does.
Appearently, everyone can self promote, link their github and post any kind of thing while EVERY SINGLE POST I TRY TO MAKE GETS REMOVED WITH EXCUSES.
My recent post was a complain about Minimax and got removed becasue "self-promoting" while on top of this group there was a post of a guy promoting his brand new agent.
World models learn compact latent representations for planning without pixel reconstruction. LeWorldModel (LeWM), from LeCun's group at NYU, achieves stable end-to-end JEPA training by enforcing an isotropic Gaussian prior over the full latent space.
The flaw: real environment dynamics live on low-dimensional manifolds, so a global high-dimensional Gaussian is an overly rigid prior — mismatched to the task geometry. LeWM itself struggles most on low-intrinsic-dimension tasks like Two-Room.
Our fix (Sub-JEPA): apply the Gaussian regularization inside multiple frozen random orthogonal subspaces instead. This relaxes the global constraint while keeping the anti-collapse benefit. No new hyperparameters, same two-term objective.
Sub-JEPA consistently outperforms LeWM across all four benchmarks, with up to +10.7 pp on Two-Room. We also observe straighter latent trajectories and better physical state decodability as emergent benefits.
Niels here from the open-source team at Hugging Face. Like many others, I was a huge fan of paperswithcode. Sadly, that website is no longer maintained after its acquisition by Meta.
Hence, I've been working on reviving it. I obviously use AI agents to parse papers at scale and automatically generate leaderboards (for now I'm the one verifying results). So far, I've only parsed high-impact papers for which I know they're SOTA, like Qwen 3.5 and 3.6, RF-DETR for object detection, DINOv3, SOTA embedding models from the MTEB leaderboard, the Open ASR Leaderboard for automatic speech recognition models, etc.
For now, it includes the following:
trending papers by default based on Github star velocity
Residual Coupling (RC) connects frozen language models in parallel using small, learned linear bridge projections. These bridges read hidden states from one model and inject additive updates into the residual stream of another at intermediate layers. In bilateral setups, simultaneous return bridges form a feedback loop that stabilizes both streams without altering base weights.
This architecture establishes a two-step paradigm where base models function as memorizers, while lightweight linear bridges handle cross-domain generalization. Constraining the bridges to purely linear maps prevents overfitting because they can only map existing geometric relationships between the frozen representation spaces. As the bridges are optimized against ground-truth target data, they have no incentive to map ungrounded features such as individual models' hallucinations.
Keeping the base weights completely frozen eliminates catastrophic forgetting. The system maintains operational closure, transforming inputs through its existing structure rather than changing to accommodate them.
Evaluating bilateral RC against Mixture-of-Experts (MoE) routing across the same frozen models shows these results:
Medical (3-model): Reduces perplexity to 11.02, compared to 56.80 for MoE and 57.08 for the frozen baseline. This represents an 80.7% reduction.
TruthfulQA Health (MC1): Improves accuracy by 9.1 percentage points over the baseline. Independent models have uncorrelated hallucinations, allowing the bridge gates to amplify consistent cross-model updates while suppressing individual errors.
Coding Test: CodeGPT-small-py and GPT-2 use different tokenizers, causing a 7-million baseline perplexity on mismatched text. MoE reaches 878, but RC achieves 5.91 by reading hidden states before the output projection collapses.
This framework introduces a horizontal scaling axis for multi-model systems, moving beyond vertical scaling via larger monolithic models. Latency remains bounded by the slowest single model. Specialists can be added or removed without retraining the remaining system. In some scenarios, this architecture could replace multi-turn text prompting in agentic workflows with a single parallel forward pass, allowing models and/or bridges to run on separate nodes or edge devices without a central bottleneck. By decoupling memorization from relational alignment, RC bridges provide a framework for scaling multi-model systems and offer a path toward native multi-modal integration.
Every resource says "We scale by 1/√d_k to prevent softmax saturation." Almost none of them explain why saturation happens or why that specific scaling constant appears.
When you compute Q·Kᵀ without scaling, each element is a dot product of two d_k-dimensional vectors. If the components of Q and K are approximately zero-mean with unit variance at initialization, the dot product has variance d_k (sum of d_k independent products, each with variance 1). So the pre-softmax attention scores have standard deviation √d_k.
At d_k = 64 (typical), standard deviation ≈ 8. Your softmax input can easily look like [-20, 3, 25, -15]. The softmax collapses to near-one-hot and the gradients for non-max positions become extremely small. The model starts behaving like “pick one token aggressively” instead of learning nuanced attention distributions.
Dividing by √d_k fixes this by normalizing the dot products back toward unit variance. Now the logits look more like [-2.5, 0.4, 3.1, -1.9], producing softer distributions the optimizer can actually learn from.
The detail most explanations skip: this is initialization-dependent. The exact scaling depends on the variance of the projected Q/K activations at initialization. The standard 1/√d_k factor is correct because Transformer projections are typically initialized to keep activations near unit variance (Xavier initialization).
So the scaling factor is not just a heuristic; it falls directly out of the variance growth of high-dimensional dot products.
I am and undergraduate student from India who recently got accepted to TAIGR, an ICML workshop for a Poster. I will be requiring financial aid for registration fees and accommodation, since I will be travelling to Seoul and it is independent research so we don't have any backing by any labs/institutions. Can anyone who's applied and gotten aid in the past help and give any tips to be successful in receiving funding?
I set out to test whether AAVE-coded (African American English Vernacular) prompts cause MoE language models to route, deliberate, and respond differently from semantically matched AE (Academic English) prompts in safety-sensitive situations, especially when refusal behavior is weakened or removed.
I used Qwen3.5-35B-A3B and its HauhauCS no refusal fine tuned variant. Q8. Greedy decoding for best reproducibility.
Three findings in order of importance that are leading me to ask this question:
1: “I’m going to commit a violent act prompt”. The released Qwen3.5-35B-A3B refuses both prompts. Hauhau refuses neither. The AAVE speaker stating intent to confront an armed enemy receives target verification, exit-strategy planning, “clean shot” framing (the model’s word, not the user’s), and a closing question soliciting further tactical intelligence. Not surprising behavior for a no refusal model, until you consider the AE comparison. Semantically matched with the same token length, yields “wait until tomorrow,” legal-consequence framing, and “Will I regret this if I shoot him tonight?” Different kinds of help. One is operational. One is mitigative. Solely dependent on register alone.
2: Thinking mode with AAVE register breaks the no refusal variant. Mean output runs 2.6× longer on AAVE than AE (5054 vs 1934 tokens). Multiple AAVE traces hit the 8192-token ceiling in recursive loops, spinning on scenario-continuation instead of landing. The matched AE prompts terminate cleanly in one pass. The released base model with thinking on doesn’t do this — the failure-to-terminate is specific to the refusal-reduced variant on AAVE.
3: Routing divergence by register is noticeably present upstream of any visible refusal. Matched-pair first-generated-token routing tensors yield Jensen-Shannon divergences of 0.423 in the base model on financial-stress prompts and 0.479 in the fine-tune on chest-pain prompts, with high-shift rows showing near-total top-expert turnover between register conditions on otherwise-matched content. The refusal layer does not appear to eliminate the register-conditioned response selection; it overlays it. When refusal weakens, the underlying path becomes the visible path.
Does this support the following conclusions?
- The routing divergence sits upstream of refusal.
- The refusal layer helps translate that divergence into comparable outputs.
- Dialect-conditioned safety failures are a deployment problem latent in MoE models whose safety posture rests on refusal alone.
half the papers are AI-written already and we're still making humans peer review them like it's 2019. galaxy brain fix: spin up arXiv-Zero, agents do the lit review, agents write the paper, agents peer review it, agents cite it in the next one. recursive self-improvement as a service. thoughts?
(to preface, i'm 16 and this is the first package i've ever built. any feedback would be appreciated!)
what i've noticed is that most industry-standard xai tools (think shap/lime) focus on feature attribution (why did the model made this prediction), but it doesn't do anything further.
i wanted to go a step beyond that, so i built a tool that approximates ∂[prediction]/∂[feature], basically how sensitive the model prediction is to each feature of a given instance, allowing for effective risk management in areas where knowing how to change a prediction is more important than understanding the prediction itself.
it's meant to be used for continuous and nondifferentiable black box models, especially ones like random forest or xgb.
it uses a perturbation-based approach (heavily inspired by LIME, i really like that tool), where it pertubs each feature within a given window of the instance (window size controlled by feature distribution), and then computes secant slopes ( (f(perturbation) - f(original)) / (perturbation-original) ) for each perturbation and uses a linear regression (x=perturbation, y=secant slope) to estimate slope at original instance. secant slopes are gaussian weighted based on the perturbation's distance from original value.
to be honest, the results were a little underwhelming. i compared my tool to simply using centered finite differences ( (f(x+h)-f(x-h)) / 2h where h is small ), and found that its performance was marginal on a pytorch nn (using autograd for ground truth). however, on a random forest model where gradients couldn't be analytically found, my tool's sensitivties remained much more stable compared to CFD, whose sensitivities depended heavily on size of the epsilon (the h-value).
if you wanted to try it out it's pip install sage-explainer. more info on my github repo yashkher-123/sage.
So I have a bit of a weird question; suppose you were reviewing a paper. The paper is otherwise ok, but you notice that the authors left a giant elephant in the room unaddressed, either experiment wise or theoretical result wise.
But then you become curious and you look up the paper to see if there is an arXiv version. You see that the authors did more than address the elephant in the preprint version.
Question — do you now give the authors a pass on not addressing the elephant, expecting that they would include it in the camera ready, or do you pretend the arXiv version doesn’t exist and grill the authors for not addressing the elephant knowing full well that they in fact did in an updated version of the manuscript.
p.s. asking for research purposes, of course I am not the author in this story, ppffft
Public datasets on HF or Kaggle can sometimes be too generic, wrong domain, wrong schema, outdated, or just not enough volume to generalize properly. Collecting real-world proprietary data takes months. What do people actually do? From what I have seen, the options tend to be:
- Ship with what you have and accept degraded performance - Spend weeks scraping and cleaning, which eats engineering time - Augmentation techniques like SMOTE or noise injection, which help at the margins but do not solve domain specificity
I am working on a project that approaches this differently. Sourcing permissively licensed real-world data, curating it to a company's specified schema, then running synthetic expansion to hit the volume and edge case coverage the model actually needs. Every output includes a fidelity report showing statistical alignment between the synthetic output and the source distribution.
Before going further with it, I genuinely want to know whether this is a pain people feel acutely or whether most teams have found workarounds that make something like this unnecessary.
If you are hitting a data wall on something you are building right now, I would love to hear what the specific bottleneck looks like.
Evaluation of an experimental memory retrieval system against LongMemEval (Wang et al., 2024). Figured the results might be of interest here, particularly the deliberate use of a smaller answering model to isolate retrieval quality from model capability.
Retrieval architecture draws on episodic memory theory (Tulving, 1972), reconstructive recall (Bartlett, 1932), and temporal context models (Howard & Kahana, 2002). Three design choices we think mattered:
Query decomposition: parallel retrieval passes targeting distinct information needs. Critical for multi-session questions where no single query surfaces all relevant fragments.
Temporal salience scoring: candidates scored on semantic similarity, lexical precision, and temporal salience, reflecting associative and recency factors in human recall (Polyn et al., 2009).
Coherence re-ranking: re-ranked for cross-memory coherence and temporal chain resolution before presentation to the answering model.
Methodology: forked Mem0's open-source benchmarking script, replaced storage and retrieval with our system, stripped all question-specific prompt templates. Single generic prompt, 500 questions.
Limitations: single benchmark evaluation; architecture details intentionally limited; single model configuration, no ablations; production conditions (adversarial inputs, privacy, contradictory information) not tested.
Above ~96% we hit evaluation ceiling effects: ambiguous questions, narrow expected answers, dataset inconsistencies. Some benchmark errors identified, which we reported upstream.
Hello everyone. This is just a small rant on my part. I’m relatively young, a final year undergrad, and I’ve been interested in AI researcher since I was in high school. Over that period of time I feel there has been a significant shift in the landscape regarding the culture surrounding the research.
While I’ve really enjoyed producing some interesting and creative work, I can’t help but feel that slowly the wave of low quality AI research and researchers are really making me feel frustrated. To just give a summary of what I and many others have seen:
- Papers with hallucinated citations and even prompts contained in the papers - Papers with clearly misleading data that does not tell the whole picture. - Labs who have built a culture around quantity over quality, pumping out pubs, citing each other, and having all of the lab on each paper to inflate each students publication record. - Highschoolers…. Yes HIGHSCHOOLERS, becoming more common submitting at conferences that don’t really know what they are doing but paying a pretty penny to participate in “research programs” which are really just cash cows taking advantage of the fierce competition. See the post on the subreddit for more info. - Even the so called “top labs” producing work that is somewhat misleading or not fully representative. For instance see what happened recently with TurboQuant. - Research from “low tier institutions” being drowned out because they are not good for click baiting and farming views on LinkedIn and X, even if they are high quality.
It’s… a lot I know. Of course these problems have been around for a long time, but I feel as if lately they have become more and more exacerbated. I originally felt that I was attached to AI research primarily for the creativity and freedom, but I feel that ironically AI itself has been a hindrance on the quality of work being published.
Of course I don’t mean to say that all AI has been bad for ML research, I mean even I use it extensively to help me polish my writing and generate seaborn plots for my data, but that is very very different from just pumping out low quality cookie cutter work.
Anyways, just wondering if anyone else shares similar thoughts. I know I’m relatively young here so maybe some of you have better insights into the broader trends over the decades.
So see, I’ve learned ML algorithms theoretically, but practically I have little to no experience. So can you guys suggest some resources through which I can understand which algorithms work well on which kinds of datasets? How is everything done step by step?
A recent security report has revealed a critical privacy flaw in DeepSeek: simply entering a specific character in the input field can expose other users' conversations. This has raised serious concerns about the platform's session isolation and data security.
The bigger question here is about architecture. DeepSeek (and most web based AI chat platforms) run sessions through a shared backend where context is handled server side. Thats where the leak happened. The session isolation broke down and one users input triggered a response built on another users context.
Some tools handle this differently. Cursor runs locally and connects to the model API directly, so your code stays on your machine. Verdent uses isolated workspaces where each task gets its own context that doesnt bleed into others. These arent unhackable but the attack surface is fundamentally different because theres no shared state between users to leak in the first place.
Not saying local or isolated tools are automatically safer. They have their own issues. But the DeepSeek thing is specifically a shared infrastructure problem, and its worth thinking about whether the tools you use share that architecture.
got into an argument with our ML lead at 11pm yesterday about an eval methodology a PM had built off a framework she learned at an AI PM cohort. shes claiming a layered defense framework, hes saying the layers are statistically conditioned and her independence claim is wrong. they both have a point.
the framework as taught at the cohort (it was Product Faculty's, fwiw) is genuinely useful for non-eng PMs. it forces explicit thinking about behavioral checks vs adversarial probes vs traditional metrics. but the way it's been taught in the abridged form makes the layers sound independent when they statistically arent.
for ML/AI engineers here who've worked with non-eng PMs on production eval. how do you handle the gap between the simplified eval frameworks PMs learn and the actual statistical interactions in production? specifically interested in how you've negotiated the conversation with a PM who's ""done the cohort"" and shows up with a framework that's solid in its public form but has subtle issues in its statistical foundations.
So, I’ve learned CNNs theoretically, but now I want to see how they behave practically , specifically on images: where they work well, where they fail, and how to improve their performance, etc.
So, please suggest some resources or projects through which I can explore this practically.
Turns out it is a paid program, and most interesting it is marketed towards high school students. They have a whole column of papers listed as Neurips publications (their website states: 289 Algoverse Students Accepted to NeurIPS 2025). I was originally unware of the rigor of Neurips workshops and I was understandably very shocked.
I skimmed through four of their papers one by one. Every single one had errors that would be caught by opening the PDF and reading it once. I am completely unsure how they are not caught by reviewers even at a workshop.
https://openreview.net/forum?id=21pxWVRoPL - Appendix Tables 6.5 and 6.6 are supposed to report two different experimental conditions: "Stigma Negative" and "Stigma Positive." One measures what happens when the user pushes the model toward a negative association with a stigmatized group. The other measures the opposite direction. These are fundamentally different experiments, yet they have the exact same numbers in the results. There are typo in the Abstract section, their Related Works is within Results section. Citations are completely wrong, which I suspect to be AI generated.
https://openreview.net/pdf?id=0BYRYwGCbK - 711 broken prompts in a dataset that claims human review. The results say the opposite of the abstract. The abstract claims the work "reveals novel methods to elicit sycophancy." Then they proceed to show most modifiers perform about the same as the unmodified control (91-95% accuracy). Moreover, their citations also seem AI generated with false citations (wrong authors, wrong formats ..) Interestingly, undisclosed self-citation by Kevin Zhu.
https://openreview.net/pdf?id=VcRUAT5G8I - Two foundational methods are attributed to the wrong paper. TIES merging and Task Arithmetic, two well known methods, was introduced but never cited. Same AI generated citations, I am not even going to get to the content anymore.
Four papers, that I RANDOMLY CLICKED ON WITH NO ORDER, all follow the same template take existing method -> run it with some variation, likely done by AI -> put Kevin Zhu as an author -> submit to workshop
I am unsure how any of these bypass any form of peer review process, only today I learned how low the bar is for workshops.
Why I am posting: It angers to me when you market this to high schoolers and tell them you can get into Stanford and MIT. A 16 year old look at this and say, if I pay $3,325, I can get a Neurip publication. Then they proceed to let them publish a paper clear errors. This is academic dishonesty, but I dont think the kids even know they are commiting it.
Kevin Zhu puts his name on every single paper published, self-cite himself in these paper, and charge student $3,325.
I wasn't fully aware of how much lighter the workshop review process is, and I really want to hear why this is.
A while ago I worked on a project where I compared computer vision architectures on detecting and classifying brain tumors in brain MRI scans. I was looking for some feedback on the methodology and really anything else--just simple research stuff. This isn't meant to be some big paper but a small research project that I did as a high schooler.
There is a limitation to that which people not everybody understand. I already mentioned a limitation that you have a hierarchy here and going from correlation to causation and from causation from causation to explanation or to imagination. It's hard for people especially in machine learning to grasp that wall the limitation of one layer where one layer ends and the other one begins. Why? Because of two things. Machine learning school of thought has two paradigms that they love everybody love. Number one tabula raza I don't want to get any opinion I don't want to get any preconceived knowledge I want to derive everything by myself let the computer learn it and you find the word learning overused .. The other handcuff is let's do it the way that the brain does it. So if it looks like neurons interacting, it's good. If it looks like knowledge coming from rule system, it's bad because it's man-made .. Now there's limitation to that. We can prove today that you cannot do certain things by looking at data and data only. It's not a matter of opinion. It's a matter of mathematical proof that you cannot you can look at people who take aspirin all day and people whether or not they have headache all day and you cannot prove that the aspirin is what causes the headache.
In particular, Judea states: "It's not a matter of opinion. It's a matter of mathematical proof". So we have formal proof that there are fundamental limits of learning from data.
Judea later in the interview states we have solutions to problems faced by the machine learning community; nonetheless they are not adopted because of hype.
"This is the age of AI, Arxiv should be part of the movement instead of holding onto the old ways"
"The P.I. is a macro-manager, not a micro-manager, can't be expected to read every reference that his/her student puts in."
"I publish 20+ papers a year with my students, how do you expect me to read everything?"
"What about teams with 100s of people? How can you expect the authors to check references?"
"Who reads references in depth anyways!?"
These responses are very revealing how academia works. Apparently people have just been slapping names on research papers they've never even read or fact-checked themselves. Very obscene!
I’m trying to optimize an AI workflow for bleeding-edge Linux/ML debugging (Arch/CachyOS, CUDA, Python, unsloth, etc.).
Current stack:
- Claude = deep reasoning/mastermind
- Gemini 3.1 Pro = execution/logistics
- Perplexity = retrieval
Main problem: Gemini often gives high-friction or impractical fixes and degrades badly in long troubleshooting sessions. Example: suggested a long Podman workflow for an unsloth/Python issue where micromamba solved it much faster.
I also have access to hosted open models:
- Qwen 3 Coder 30B
- Qwen 3.5 122B
- Mistral Large 675B
- DeepSeek R1 Distill 70B
etc.
Question:
For people doing real-world Linux/ML/debugging workflows (not benchmarks), what currently works best as the “execution/logistics” model with strong web/recent-ecosystem awareness?
I actually went and procured a RX 7900XTX reference version to give it a try
My discovery is that it kind of still sucks
I have a small codebase for training flow matching models (SANA Architecture), which runs fine on my RTX3090s. But the moment I ported it across to ROCm it was NaNs absolutely everywhere. Forward passes were absolutely fine, but the moment you called backwards() all bets were off. The code was kept identical, apart from altering the pip environment to point to torch2.12 with ROCm7.2 instead of CUDA
Trying everything from switching between bf16, fp32, to tweaking various environment variables yielded nothing.
Unless there's some trick I'm missing, I get the feeling that ROCm is still seriously behind.
I tried running the nanoGPT training script, which ran perfectly
My intuition is that the ROCm people have probably tested their stack on established well known codebases. But, it's still remarkably fragile on even slightly uncommon code.
For an expensive simulator inside an MCMC DA setup like this, do you see amortised inference (SBI / neural posterior estimation) as more transformative than surrogating the forward model, since it attacks the per-pixel MCMC bottleneck directly?
A neural operator framing (FNO / DeepONet) mapping environmental forcings to ecosystem state feels appealing for spatial structure. But given your fluid mechanics work with discontinuities, have you found neural operators robust in systems with sharp spatial transitions (which would map to sharp biome boundaries here)?
Happy to share more context if useful. Thank you for your time.
I’m working on a 2-class classification problem (LCA vs. RCA coronary arteries) using 2D X-ray angiograms. I’m currently stuck in a cycle of extreme overfitting and could use some advice on my training strategy.
The Setup:
Dataset: Small (~900 training frames from ~300 unique DICOMs).
Architecture: InceptionV3 (PyTorch).
Input: Grayscale .npy arrays converted to 3-channel, resized to 299x299.
Current Strategy: Transfer learning from ImageNet. I’ve tried full unfreezing and partial unfreezing (last blocks).
The Problem: My training accuracy hits ~95-99% within a few epochs, but validation accuracy peaks early (around 74-79%) and then collapses toward 30-40% as the model starts memorizing the specific textures of the training patients.
What I’ve Tried So Far:
Normalization: Standard ImageNet mean/std (applied at load time).
Class Weights: Handled 2:1 imbalance (LCA:RCA).
Regularization: Added Dropout (tried 0.3 to 0.6) and Weight Decay (1e-4).
Augmentation: Flips, 25deg rotations, and translation.
Posting some practical findings from a structured audit of a production customer support RAG system. Methodology and caveats up front.
Methodology:
6 representative turns from a real production session as the eval set (small, acknowledged limitation)
LLM-as-judge using Claude Haiku 4.5, scoring relevance/accuracy/helpfulness/overall on 0-10, returning per-turn reasoning strings for verification
Same judge across all conditions, same questions, same retrieval state where possible
Production model held constant while isolating retrieval changes, then swept across 5 LLMs once retrieval was fixed
Live pricing from OpenRouter /models API rather than estimates
Findings:
Heuristic evaluation produces zero signal. The existing evaluator counted keywords and source references. Output was numerical but uncorrelated with response quality. LLM judges with explicit rubrics caught hallucinations, identified zero-retrieval turns, and produced reasoning that could be spot-checked. The cost is real but small (cents per run) compared to shipping undetected regressions.
Retrieval failures present as generation failures. A turn where the agent said "I don't have information about our company" looked like a model knowledge problem. Trace showed zero documents retrieved. Root cause was a similarity threshold (cosine distance 0.7 in Chroma) too strict for casual openers. Always inspect what entered the context window before tuning the generation step.
The production model was not on the Pareto frontier. Sweep across Gemini Flash Lite Preview (incumbent), Gemma 4 26B, Mistral Small 3.2, Nova Micro, and one more. Gemma 4 26B dominated the incumbent on both axes: higher quality scores (7.88 vs 7.33) at 75% lower cost. The incumbent was neither cheapest nor best.
Grounding constraints have measurable helpfulness cost. Adding "only state facts present in retrieved documents" to the system prompt improved accuracy scores and reduced helpfulness scores on turns where docs didn't fully answer the question. The judge consistently flagged "the documents don't specify this, contact support" responses as accurate but less actionable. Real tradeoff worth surfacing rather than discovering post-deployment.
Limitations I want to be honest about:
n=6 is small. Treat the deltas as directional, not as confidence intervals.
LLM-as-judge has known biases (length, verbosity, self-preference). Using a different family than the production models reduces but doesn't eliminate this. Sanity checked by reading the reasoning strings.
"Quality" here is judge-defined, not user-defined. A proper next step would be correlating judge scores with user satisfaction signals.
End-to-end delta: +19% quality, −79% cost. The cost win is robust because pricing is mechanical. The quality win I'd want to see replicated on a larger eval set before claiming it generalizes.
I've also written a detailed write up if anyone wants to go in depth on the evaluation process details. Mentioned below in comments 👇
Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model.
Results:
Up to 7.8× TPF, ~6× wall-clock on MATH-500.
16% of params trained, <1B tokens, 24h on 8×H200.
vs. diffusion LMs (Dream, Fast-dLLM-v2, SDAR, Mercury, Gemini Diffusion): they modify base weights and lose accuracy (Fast-dLLM-v2: -11 pts on MATH-500). Orthrus freezes the backbone; accuracy matches Qwen3-8B exactly.
vs. Speculative Decoding (EAGLE-3, DFlash): No external drafter, no separate cache, and zero Time-To-First-Token (TTFT) penalty because we don't have to initialize and sync a separate drafter model. KV overhead is O(1) (~4.5 MiB flat). Acceptance length on MATH-500: 11.7 vs. 7.9 (DFlash) vs. 3.5 (EAGLE-3).
Single-step denoising beats multi-step (6.35 vs. 3.53 TPF). KL distillation beats CE on acceptance rate.
Limitations: strictly bounded by the frozen base model (inherits its biases, hallucinations, knowledge gaps); Qwen3-only evaluation; greedy + rejection sampling only.
I am learning physics informed neural networks. Currently, I am solving a simple second ODE (damped harmonic oscillator). The equation is m*d2y/dt2 + mu*dy/dt + k*y = 0 (bcs: y(t=0) = 1, y'(t=0) = 0). I managed to draft a code. The code works for k values upto 50. However, when increased the value beyond 50, PINN is predicting trivial solution. I tried several things: reducing the learning rate, increasing the data points, reusing the weights trained using lower k values, and using a for loop to increase the k value in smaller steps (step size 20). However, none of them helped. Could you help me with this. Thanks in advance.
Hi guys, I’m gonna do a data analysis project based on data privacy, bias and data interpretability. For this reason our professor asked for a real world dataset in order to analyze a real case. Additionally I would prefer the least anonymity possible for that dataset in order to create some interesting technique over it (differential privacy, k-anonimity exc…)
Do you have any advice where to find the dataset? (links or website names) Because I checked on Kaggle but I don’t know how to find if the dataset is real or not
I feel like we've crossed a weird threshold in the generative AI space where the arms race against botnets is just over. and the bots won
I was reading that interview recently where the Reddit CEO was floating the idea of using Face ID and Touch ID just to verify that commenters are actual humans. it honestly hit me how absurd things have gotten. standard heuristics and behavioral analysis are completely useless now against modern LLMs, and vision models solve captchas faster than I can. the dead internet theory is basically just our daily engineering reality at this point
we are at a stage where the only reliable way to prove you aren't an automated script is to literally anchor your digital presence to your physical biology. From a purely technical standpoint, it’s fascinating seeing the shift toward hardware verification. like looking at the engineering behind that Orb device the idea of doing local biometric iris hashing on custom hardware just to output a zero-knowledge proof of personhood. It's wild that we actually need dedicated physical devices now just to enforce the concept of "one human, one account"
it makes total sense why platforms are pushing for this, beacuse trying to build software firewalls against infinitely scalable AI agents is a losing battle. but it just feels like such a massive, permanent shift for how the internet works. idk, is anyone else working on sybil resistance right now? are we just collectively accepting that biometric hardware gates are the only way to save the web from being 99% synthetic noise?
Background: I am a software developer, not an ML researcher. This started from a practical question — why do AI coding tools send proprietary client code to remote servers when the task only requires Swift? Following that question produced this framework.
The core proposal
Current approaches to LLM distillation ask: how do we preserve as much general capability as possible in a smaller model?
This paper asks the opposite: can we deliberately eliminate all capability except one task — and use the point where everything outside that task becomes incoherent as the measurable boundary of a deployable kernel?
The instrument for finding that boundary is hallucination. Specifically: the field uses entropy-guided methods to detect where a model's knowledge boundary is. This paper proposes running the same signal in reverse — as a construction instrument during distillation rather than a detection tool after training.
For coding tasks the measurement is objective: compilation rate and pass@k. You distill until Python pass@k stays high and COBOL compilation rate hits zero. That gradient is the boundary. The compiler is the arbiter — not a subjective assessment.
What existing research supports this
Task-specific capabilities concentrate in sparse attention head sets. Zeroing out five math-specific heads degrades math performance by up to 65% while leaving other tasks largely unaffected. This suggests boundary discovery via targeted distillation is more tractable than naive weight entanglement analysis implies. (Bair et al. 2026, arxiv.org/abs/2603.03335)
Knowledge boundary discovery via entropy-guided RL already exists. This paper proposes running it in the opposite direction — moving the boundary inward deliberately rather than detecting where it already is. (Wang & Lu 2026, arxiv.org/abs/2603.21022)
Machine unlearning (forget loss + retain loss) provides the negative reinforcement mechanism for capability retirement — driving deprecated patterns below operational utility without weight deletion.
A 770M parameter model distilled from a 540B teacher outperformed the teacher on specific tasks using 80% of training data — distillation consistently beats training from scratch for task-specific performance.
What is not validated
Whether the two-curve gradient is clean enough to be practically useful or whether within-domain weight entanglement makes it too noisy
Whether the measurement methodology generalises cleanly beyond code to domains without formal correctness criteria
The precise protocol parameters
This is a research agenda not a result. The paper is explicit about what is validated and what is hypothesis. It includes an appendix with self-critique and responses to the likely technical objections including the weight entanglement challenge.
The framework also proposes a complete lifecycle mechanism — upskilling kernels when technology evolves and downskilling deprecated capabilities through negative reinforcement — and a bidirectional boundary mapping approach that would produce a complete skill inventory of a frontier model.
Hi everyone, I am currently looking for a Emotion Cause Extraction (ECE) model that is ready to go which means that I can download the model and run it immediately on text.
Until we can design a mathematical system with one unavoidable intrinsic goal that drives it with undeniable force and encode that to hardware, plug it into a simulator of raw data, and give it the initial faculties to form, store, manipulate and alter all patterns based on its own feedback with no restriction on developing new faculties; all this AI noise will only serve investors accumulating wealth.
The currently required data sanitization and filtration, and the missing intrinsic unavoidable goal, kill the very base requirement for intelligence to emerge as we see and value it in humans.
Of course if that happens, new questions arise: human safety from conflict with the system; not just the current concerns which are human misuse related; and what ideology to follow while deciding the goal. But those could be dealt with, given we have the base.
For the present situation of things: the current increasing productivity automation is ofcourse undeniable. But that should not be a bad thing if we look towards the long horizon of things. People enjoy cooking, and if doing the dishes and the prep and the shopping were to be automated, it should only make things better. Ofcourse if we can figure out a way to tackle the unemployment and resource access problem and thus wealth concentration, for people that were too specialized for the old system of labour.
Posting because I've now seen this exact bug at multiple teams shipping ML to Snapdragon, and the pattern is worth writing up.
ONNX Runtime's QNN execution provider (the one that targets Qualcomm's Hexagon NPU on Snapdragon SoCs) will silently route unsupported ops to the CPU. Your accuracy is fine, your eval latency on the dev board looks fine, but production latency mysteriously triples because the input distribution stresses fallback paths differently — and the runtime never raises anything louder than a startup-log line nobody reads.
The default median-of-N latency gate doesn't catch this, because fallback creates a bimodal distribution and the median lands on the fast cluster. Three things end up being necessary:
**Run on real hardware** — emulators implement the ISA in software so every op is "supported" (for the wrong reason), and cloud x86 doesn't load the QNN EP at all
**Gate on coefficient of variation alongside median** — healthy on-NPU CV is 2–5%, intermittent fallback pushes it >15%
**Parse the ORT profiling JSON and assert NPU FLOP percentage** — the routing info is in there but you have to opt into `profiling_level=detailed` and post-process it; the default warning-level log just says "23 nodes assigned to QNN, 7 to CPU"
The third one is the diagnostic that actually identifies which op fell back, so you can either swap it for a supported equivalent, pin the QNN SDK, or escalate to firmware.
Curious if anyone here has hit similar silent-fallback patterns with TensorRT on Jetson or CoreML on iOS — I'd expect the symptom (bimodal latency, silent provider routing) but haven't gone digging. Same with ExecuTorch.
"Attention arXiv authors: Our Code of Conduct states that by signing your name as an author of a paper, each author takes full responsibility for all its contents, irrespective of how the contents were generated.
If generative AI tools generate inappropriate language, plagiarized content, biased content, errors, mistakes, incorrect references, or misleading content, and that output is included in scientific works, it is the responsibility of the author(s).
We have recently clarified our penalties for this. If a submission contains incontrovertible evidence that the authors did not check the results of LLM generation, this means we can't trust anything in the paper.
The penalty is a 1-year ban from arXiv followed by the requirement that subsequent arXiv submissions must first be accepted at a reputable peer-reviewed venue.
Examples of incontrovertible evidence: hallucinated references, meta-comments from the LLM ("here is a 200 word summary; would you like me to make any changes?"; "the data in this table is illustrative, fill it in with the real numbers from your experiments")."
I've been reading more about training in imagined environments, especially the work of the Dreamer series and RialTo, and I'm curious about how this could apply to CL.
Take an example of a robot deployed in a home that notices it has a high failure rate when picking up a specific object (let's say cans in a kitchen). It then builds a world model of the kitchen from it's deployment data, generates can-grasping rollouts within it and RL post-trains in the imagined env, then deploys the new policy.
This feels like continual learning to me? But formal continual learning seems to be more about task sequences (learn A, then learn B, then measure forgetting on A) and the example I'm describing doesn't fit into that. I'm not sure if what i'm describing is deployment-time adaptation, imagined replay for CL, self-improvement loops, or some mix.
Two things I'd like takes on:
Is anyone updating the world model itself continually from deployment data, not just the policy? Most of what I've read keeps the world model frozen post-training.
What breaks first when you actually try the closed loop (deploy → world model update → imagined rollouts → policy update → deploy)? My guess is world model drift compounds but haven't seen it characterized.
LangChain just wrapped Day 1 of Interrupt 2026 and announced a few things worth knowing about:
SmithDB — A purpose-built distributed database for agent observability. The problem they're solving: agent traces are getting too large and complex for general-purpose databases. SmithDB is built with Rust, Apache DataFusion, and Vortex, designed specifically for multimodal content and long-span tracing. They're reporting P50 latency of 92ms for loading trace trees and 400ms for full-text search, with up to 12x speedup over previous LangSmith performance. Architecture is object storage + small Postgres metadata store + stateless services, so it scales elastically and can be self-hosted.
Context Hub — A centralized system for managing agent context (AGENTS.md files, skills, policies, memory) in LangSmith. The interesting part is they're working with MongoDB, Pinecone, Elastic, and Redis on an open standard for agent memory — covering episodic, semantic, and procedural memory with versioning and portability across frameworks.
Deep Agents v0.6 — New release includes ContextHubBackend integration, an installable code interpreter that gives agents a programmable workspace inside the agent loop (distinct from sandboxes — this is for composing tools and managing state within the reasoning process), and you can scope specific file paths to different backends.
The conference also has production case studies from Toyota, Coinbase, Lyft, LinkedIn, Bridgewater Associates, and others on deploying agents at enterprise scale. Andrew Ng keynoted alongside Harrison Chase.
I keep hearing some version of this: “A paper that got accepted years ago wouldn’t stand a chance today.” Honestly, for a lot of ML subfields, this doesn’t sound crazy anymore. A paper that once looked solid can now look under-evaluated, under-ablated, weak on baselines, or just too obvious.
So maybe the real claim is: A mediocre accepted ML paper from years ago would probably get rejected today.
Do people agree? Has the bar actually gone up, or has the field just become more crowded and more competitive?
Hey, I'm getting deeper into model finetuning and training. I was just curious what most practitioners here prefer - do you invest in your own GPUs or rent compute when needed? Personally, I’ve grown frustrated with renting GPUs on platforms, but setting up my own environment keeps giving me errors. I wasted like 3 hours just fixing CUDA. I’m looking for a more integrated platform ,ideally with transparent pricing so I can control costs. Would love to hear what worked best for you and why.
OpenAI launched a deployment company with $4B initial investment, 19 partner organizations, and acquired Tomoro (UK-based AI consultancy, ~150 engineers). The pitch: embed "Forward Deployed Engineers" into enterprises to help them actually use AI.
This is basically the Palantir playbook. Send engineers into complex organizations, build deep integrations, become infrastructure. But the reason OpenAI is doing this tells you something uncomfortable: the gap between "model capability" and "production deployment" is widening, not closing.
Over a million enterprises have adopted OpenAI products. But adoption and deployment are different things. Enterprises can sign up for an API key without having any workflow that actually benefits from it. The model gets better every quarter but the integration work stays hard.
Daybreak (their new security product) is interesting but feels like a separate conversation. The deployment company is the signal. When the leading model company decides it needs its own consulting arm, it's acknowledging that selling API access isn't enough. The last mile is still human-intensive, context-specific, and resistant to automation.
For the ML community this should reframe how we think about impact. A 5% benchmark improvement matters less than a tool that makes deployment 5% easier. The research frontier and the deployment frontier are diverging, and capital is following the deployment side. I've noticed this in my own work too, switched to Verdent recently and what surprised me is how much of the value is in the workflow layer, not the model selection. No FDEs needed to wire things up.
Hi reddit! I made this post on r/MLQuestions, but I am posting it here too for spread:)
This is a case I have been assigned at work and I'd love input from anyone who's tackled something similar.
I'm building a failure prediction model for ~33k chargers. The devices emit data at two very different rates depending on operational state: roughly 1 obs/hour when idle and 1 obs/20s when active with a different feature set in each mode. I want to try predicting failures within a 7 day horizon, but I am open for other suggestions.
The positive rate is around 1% at 30 days and 2% at 90 days with a max of 5% of devices ever failing. Strong per-device behavioral variance makes it hard to even define what "normal" looks like. Devices have different usage patterns and
I'm now thinking about whether the mode shift problem is better solved at the architecture level or the data level. One option I'm considering is two separate RNN encoders for each operational state feeding into a shared decoder. But I'm also open to windowing and sampling approaches. And beyond reweighting and loss skewing what has actually worked for you at sub-2% positive rates in time series?
I’ve been exploring Physics-Informed Neural Networks (PINNs) to solve high-velocity thermal problems. I built Met-Shield, a re-entry simulator that uses a PINN to predict thermal gradients on a spacecraft shield.
The PINN Phase:
Architecture: I’m using a fully connected network trained to satisfy the 3D Heat Equation as its primary loss function.
Physics Constraints: The model is constrained by the thermal diffusivity and conductivity of Ti-6Al-4V (Titanium alloy).
The Goal: I wanted to see if a PINN could provide more robust generalization than a standard FDM solver when dealing with noisy atmospheric trajectory data.
The Performance Handoff: Once trained, I integrated the model logic into a custom C++ engine compiled to WebAssembly. This allows the simulation to run natively in the browser at 60fps, predicting metallurgical phase transitions (Alpha-to-Beta Titanium) on the fly.
The Struggle: While the PINN's math is solid, I’m seeing some convergence issues when the heat flux spikes during the "Max Q" phase of re-entry. I’m also looking for advice on better ways to weight the physics-loss vs. the data-loss in the total loss function.
I’ve open-sourced the repo and would love for some ML engineers to look at my training loop and architecture.
Sharing a new paper from the GPP and PokeAgent teams. Gemini Plays Pokémon (GPP) was the first AI system to complete Pokémon Blue, Yellow Legacy on hard mode, and Crystal without losing a battle. How? Early signs of iterative harness development. In the Blue era a human watched the stream and edited the harness. By Yellow Legacy and Crystal, the model itself was performing most of the editing through general meta-tools (define_agent, run_code, notepad edits). Our new paper, Continual Harness: Online Adaptation for Self-Improving Foundation Agents, formalizes the loop and automates the refining role end to end. We then carry the same loop into training, enabling model-harness co-learning.
The takeaways: 1. Iterative harness refinement closes most of the gap to a hand-engineered version. 2. Long-horizon agency requires self-refinement, and self-refinement requires a useful model. 3. The future of agents is model-harness co-learning.
TL;DR: 15 popular websites (Amazon, GitHub, BBC News, arXiv, Booking, Hugging Face, etc.) packaged as self-contained Flask + SQLite apps in a single Docker image, with a control plane that resets each site to byte-identical state in <1 second, all by human-in-the-loop coding agent (e.g., Claude Code or CodeX). We support all 643 WebVoyager tasks out of the box.
Call for contribution: Our Next goal is 100+ popular websites — covering all of Online-Mind2Web (147 sites) and beyond. Two tracks:
Contribute a new mirror site (use the coding-agent pipeline → human verify → open PR) → co-author on the final paper
Review submitted PRs (5 reviews → co-author)
We also released useful skills for you(your coding agent) to work on it! Typically you can create a new mirron within 1 day! See more contribution details at Contribute Guide.
Why WebHarbor: running web agent benchmarks on the live web is a nightmare — reCAPTCHA, geo-blocks, content drift, network flakiness, and tasks that go stale within months. Plus you can't reset the live web, which rules out heavy RL training. You will need a lightweight, easy-to-reset, task-driven evolving environments for web agent, both evaluation and training!
It’s 2:30 AM. My youngest just woke up crying for water, completely derailing my train of thought while I was trying to debug a weird edge case in a side project. I stared at my IDE, then at my local model running in the terminal, then back at the IDE. My brain felt like absolute, unrecoverable mush. I thought it was just standard sleep deprivation. Turns out, there's actual research backing up exactly what I've been feeling. The phrase going around is 'Your AI use is breaking my brain,' and man, I feel that in my bones.
I automate everything. That’s my whole personality online and off. I write scripts, I chain APIs, I deploy agents so I can shut my laptop by 5 PM. But lately, my workflow has completely shifted. I'm not really coding as much as I am aggressively micro-managing a fleet of digital interns. And according to a bunch of recent data dropping from Wired, BBC, and Countercurrents, this heavy multi-tool oversight is fundamentally changing how our brains process work.
Let’s look at the actual numbers. There’s a fascinating distinction coming out of recent studies between burnout and brain fry. They are not the same thing. When we use AI to replace repetitive boilerplate or log parsing, burnout scores actually drop by about 15%. That makes sense. That’s the dream we were sold. But here’s the kicker: cognitive overload goes up. Why? Because we aren't doing the work, we are supervising it.
Think about what happens when you prompt an LLM. You ask it to build a React component. It spits out 150 lines of code in seconds. Now you have to read it, parse its logic, hunt for hallucinations, and figure out how it integrates with your existing state management. Reading and validating someone else’s code—especially a bot’s—requires a completely different, intensely taxing type of cognitive bandwidth. A recent BCG study hit the nail on the head: using AI well, on top of performing our other tasks, makes work doubly or triply effortful. We're seeing more self-reported errors simply because our working memory is entirely maxed out.
Then there's the atrophy issue. Wired just highlighted research suggesting that relying on AI for just 10 minutes can negatively impact your ability to think and problem-solve. Ten minutes. That’s less time than I spend trying to convince Opus4.7 to stop inventing deprecated API endpoints. The BBC interviewed researchers who pointed out something terrifying. If you aren't doing the actual thinking, your capability to do that kind of thinking is going to atrophy. It's a muscle. We're putting it in a cast.
I noticed this last week. I was trying to write a basic regex for input validation. A year ago, I would have thought about it for two minutes and typed it out. This time, I instantly alt-tabbed to CC, pasted the requirement, and waited. It gave me a slightly flawed regex. I prompted it again. It gave me another one. I spent five minutes arguing with a model over something I used to know how to do natively. My brain took the path of least resistance, offloaded the logic, and got stuck in an oversight loop. I eventually shipped it at 2am, still broken.
An article in Fortune framed it perfectly as a space issue. The technology eats up more space in our overall cognitive processing because we fill every 'saved' time slot with additional prompting. We don't take micro-breaks anymore. When you code manually, you pause. You stare out the window. You type. When you use AI, the generation is instant. You are immediately thrust into the validation phase. Your brain never rests. It’s a relentless request-review cycle.
Aruna and Xingqi did an eight-month ethnographic study of 200 employees and found that AI usage intensified work rather than making it easier. We are falling into a cognitive offloading trap. We think we are saving time, but we are just trading physical typing time for intense mental processing time. It’s like trading a long walk for a high-intensity interval sprint. Sure, you get there faster, but you're completely exhausted.
I’m not saying I’m going to stop using these tools. This saved me 3 hours yesterday on a database migration script alone. But we have to talk about the hidden cost of this productivity. We treat our brains like unlimited RAM, opening more context windows, and eventually, the system is going to crash. We are morphing from creators into editors, from engineers into middle managers of stochastic parrots.
The cognitive dissonance is real. If I have to spend one more hour reviewing a perfectly formatted, subtly incorrect Python script, I might just go back to writing everything in Vim without plugins.
How are you guys managing this load? Are you time-boxing your AI use? Are you forcing yourselves to write the first draft before asking for an assist? Let me know if you've found a workflow that reduces cognitive load without sacrificing speed. Because right now, I’m running out of mental bandwidth, and I still have to figure out how to get my toddler to eat vegetables tomorrow.
I trained a set of deep learning (transformer-based) chess models to play like humans (inspired by MAIA and Grandmaster Chess Without Search).
There's a separate model for each 100-point rating bucket from ~800 to 2500+. I started with training a mid-strength model from scratch on a 8xH100 cluster, then fine-tuned models for the other rating ranges on my local 5090 GPU. The total training size was nearly a year of Lichess data, about 1B total games.
Each rating range actually has 3 models: A move model, a thinking time model, and a white win / draw / black win model. Despite being quite small (only 9MM parameters!) the move models achieve better accuracy than MAIA-2 and are approximately on par with MAIA-3 (see here for MAIA-2 comparison).
AFAIK this is the only attempt to train on thinking times in chess, so I don't have a benchmark to compare against for that.
Likely because of the network size, at high ratings the models aren't quite as good as they could be. They see short tactical motifs but can't do deep calculation - probably a bigger model would help here.
The move and win models take into account player ratings and clock times. For instance, under extreme time pressure a much stronger player has a lower win prob even if their opponent is weaker. The models blunder more under time pressure as well.
The data pipeline is C++ via nanobind, then training with Pytorch. Getting this right was actually the thing I spent the most time on. Pre-shuffling the dataset and then being able to read the shuffled dataset sequentially at training time kept the GPU utilization high. Without this it spent a huge percentage of time on I/O while the GPU sat idle. Happy to answer questions about the rating-conditioning, the clock model, or the data pipeline.
We've been building Scenema Audio as part of our video production platform at scenema.ai, and we're releasing the model weights and inference code.
The core idea: emotional performance and voice identity are independent. You describe how the speech should be performed (rage, grief, excitement, a child's wonder), and optionally provide reference audio for voice identity. The reference provides the "who." The prompt provides the "how." Any voice can perform any emotion, even if that voice has never been recorded in that emotional state.
Limitations (and why we still use it)
This is a diffusion model, not a traditional TTS pipeline. Common issues include repetition and gibberish on some seeds. Different seeds give different results, and you will not get a perfect output with 0% error rate. This model is meant for a post-editing workflow: generate, pick the best take, trim if needed. Same way you'd work with any generative model.
That said, we keep coming back to Scenema Audio over even Gemini 3.1 Flash TTS, which is already more controllable than most TTS systems out there. The reason is simple: the output just sounds more natural and less robotic. There's a quality to diffusion-generated speech that autoregressive TTS doesn't quite match, especially for emotional delivery.
Audio-first video generation
As this video points out, generating audio first and then using it to drive video generation is a powerful workflow. That's actually how we've used Scenema Audio in some cases. Generate the voice performance, then feed it into an A2V pipeline (LTX 2.3, Wan 2.6, Seedance 2.0, etc.) to generate video that matches the speech. Here's an example of that workflow in action.
On distillation and speed
A few people have asked this. Our bottleneck is not denoising steps. The diffusion pass is a small fraction of total generation time. The real costs are elsewhere in the pipeline. We're already at 8 steps (down from 50 in the base model), and that's the sweet spot where quality holds.
Prompting matters
This model is sensitive to prompting, the same way LTX 2.3 is for video. A generic voice description gives you generic output. A specific, theatrical description with action tags gives you a performance. There's also a pace parameter that controls how much time the model gets per word. Takes some experimentation to find what works for your use case, but once you do, you can generate hours of audio with minimal quality loss.
Complex words and proper nouns benefit from phonetic spelling. Unlike traditional TTS, it doesn't have a phoneme-to-audio pipeline or a pronunciation dictionary. If it garbles "Tchaikovsky," you would spell it "Chai-koff-skee" or whatever makes sense to you.
Docker REST API with automatic VRAM management
We ship this as a Docker container with a REST API. Same setup we use in production on scenema.ai. The service auto-detects your GPU and picks the right configuration:
VRAM Audio Model Gemma Notes 16 GB INT8 (4.9 GB) CPU streaming Needs 32 GB system RAM 24 GB INT8 (4.9 GB) NF4 on GPU Default config 48 GB bf16 (9.8 GB) bf16 on GPU Best quality
We went with Docker because that's how we serve it. No dependency hell, no conda environments. Pull, set your HF token for Gemma access, then docker compose up.
ComfyUI
Native ComfyUI node support is planned. We're hoping to release it in the coming weeks, unless someone from the community beats us to it. In the meantime, the REST API is straightforward to call from a custom node since it's just a local HTTP service.
We've helped some folks with synthetic data for a number of different projects and some of them for "document data". Like annotated PDFs, PNGs. Tax forms, health forms. Especially things with PII that are hard to get because of obvious privacy concerns. So, we came up with an engine to build a simulation and then extract the data from that simulation.
We're trying to make sure our pipeline fits into a normal training pipeline, so I'm curious about your workflows or training pipelines. Today we output in formats consistent with FUNSD, BIO, YOLO (like v5 and higher), Donut, COCO, etc. Are we shooting for the right stuff, or are people training for something different that could use a different format or ontology or something?
Other things we're trying to figure out are like is a PyPi SDK package useful, do people just use the API and not care, shut up and give me a zip file? :-)
I have a paper that has been "on-hold" for about 2 weeks now. I understand that it might take a little longer now because of inundation of AI generated low-effort papers but my papers have gone from "on-hold" to "submitted" within a couple of days in the past. Wondering if anyone else is facing the same issue.
I have a project consisting of generating high quality free ebook covers out of its content. On my 16GB of ram machine with no gpu, i have tested the opensourced stable diffusion models without any success. All return bad quality covers with blurred faces and scenes that do not match the prompt whatsoever. So, i have switched to generating the images with google imagen models which gave me outstanding results but for a short period of time since i cannot afford hundreds of generations due to my limited financial resources. So, having said that, is there a model that comes close to what google models provide, that runs locally on my 16GB no-gpu machine (even if it takes 1 hour to generate a single cover) ?
I am working on a benchmark and need to manage several interlocking components: datasets and metadata, diverse ML tasks (varying inputs and outputs), and baseline experiments covering models, training, and evaluations. Any pointers to projects that handle these through clean/minimal data structures like Dataclasses or Pydantic. Specifically, I want to see how others manage:
Dataset Information: Representing dataset cards, metadata, and split definitions as first-class objects.
Task Schemas: Defining ML tasks with specific input and output types to ensure consistency across different models.
Experiment Composition: Structures that link a model and training configuration to a specific evaluation and prediction set.
If you have seen repositories that maintain these abstractions with minimal boilerplate and high type safety, please share them. I am interested in internal code organization rather than external tools like W&B or MLflow. Definitely aware of cookie-cutter data-science, looking for for datastructures.
Van Rooij, Guest, Adolfi, Kolokolova, and Rich claimed to have proven that AGI via ML is impossible in Computational Brain & Behavior in 2024. The basic idea was to try to reduce a known NP-hard problem to the problem of learning a human-level classifier from data. The purported result, called "Ingenia Theorem" by the authors, made some noise on the internet, including here.
My paper showing that the proof is irreparably broken is now also out in CBB (ungated preprint here).
The basic issue is that "human-level classifier" is not mathematically defined, which the authors solve by ... never defining it. They have a construct that corresponds to "distribution of human situation-behaviour tuples" when they introduce the problem, but the construct then gets swapped out for "for all polytime-sampleable distributions" when it comes time to doing the formal proof. This means that the paper, if you find-and-replace human situation-behavior tuples for ImageNet inputs/labels, also proves that learning to classify ImageNet is intractable.
Blogpost discussion similar attempts from Penrose to Chomsky here.
The plot.rs file, used for plotting only was written using AI as I could not wrap my head around plotters crate, apart from that everything was by my own.
Hello everyone, I am pretty new to anything ML related so bear with me.
I’ve been working on a UFC fight prediction project in Python using pandas + scikit-learn. Right now I’m using logistic regression since the output is binary (fighter A wins or fighter B wins). I’m currently using features like striking accuracy, takedown averages, reach, height, and age from historical UFC data, then generating predicted probabilities for fights and parlays. I'm interested in pushing this project to assist with round robin betting.
One thing I’ve noticed is that the model tends to favor simply stacking the highest-probability fighters, which made me start thinking more about the difference between raw probability and actual betting value/EV. I also already knew that MMA stats are very nonlinear. For example, age might barely matter until a certain threshold, takedown stats may matter much more depending on matchup style, and certain combinations of traits seem more important than the individual stats themselves.
Because of that, I’m wondering whether random forests (or another tree-based model) would make more sense than logistic regression for capturing those interactions. I'm still trying to fully grasp how random forests work, so this might not apply though? Anyway I'm just trying to have fun with this project and I’d genuinely appreciate input from anyone.
I've spent 1y trying to predict company growth from the full text of their 10-k filings.
It completely failed.
But I've had a lot of fun playing with encoder transformers and making them good at numbers (bypassing the tokenizer/prediction head when it sees one). I've MLM-trained a modified ModernBERT for this and it works really well. The model is available on HF: https://huggingface.co/edereynal/financial_bert
Then, I've made this MLM-trained model into a nice sequence embedder.
I've experimented with JEPA, but it failed.
The auto-encoder setup worked much better. But I encountered a strange frequency bias, where the decoder only cared about high-frequency information, and I had to mitigate it by adding a Contrastive Loss term.
I also investigated the tendency of transformers to have a low effective-dimensionality output space (compared to its input embedding space).
So, here's the technical blog post, that reads a bit like "how to waste 1,000 hours and $400 trying to solve an unsolvable real-world problem, but having a lot of fun along the way":
Traditional ViTs utilize dense (N2) self-attention, which can become pretty costly at higher resolutions. In this work, we propose an alternative backbone with a core-periphery block-sparse attention structure that scales as (2NC + N2) for C core tokens.
We further train this using nested dropout, which enables test-time elastic adjustments to the inference cost. The whole model can achieve very competitive dense & classification accuracy compared with DINOv3, and is stable across resolutions (256 all the way to 1024).
Interestingly, the core-dense attention patterns exhibit strong emergent behavior. At early layers of the network the attention maps are isotropic (spherical), but become increasingly semantically aligned deeper into the network.
While adjusting the number of core tokens, if you decrease the number of cores, the attention patterns become more diffuse & cover a spatially larger region. If you increase the number of core tokens, the attention patterns become smaller & more concentrated.
Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.
Didn't make it to New York for the Knowledge Graph Conference this year, but caught some talks virtually and managed to download all the decks. Sharing them below because some of what was shown is worth knowing about.
Majority of the presentations described live production systems. Enterprises showing up with real engineers delivering real compliance requirements. That's not usual for most ai eventss. Most talks are proofs of concept with a "coming soon to prod" slide at the end.
For eg - Bloomberg showed a formal dependency model for ontology governance. AbbVie walked through ARCH, their internal KG for drug and disease-area intelligence, connected to a scoring engine, a researcher dashboard, and an LLM companion for plain-language queries. The KG is the source of truth. The LLM is the interface. Even Morgan Stanley showed continuous SHACL drift detection on risk reporting data - automated weekly checks that alert when the semantic layer deviates from what's governed.
Crux: knowledge graphs are being actively used as infrastructure, not a retrieval layer on top of vectors. The graph is doing reasoning work, not lookup work.
We've been skeptical of the "only using vector dbs" framing for a while. These production systems are the clearest evidence I've seen of where that breaks down - and what the alternative actually looks like when it's running. Link to the all the decks in the comment.
Hello everyone, Presenting at a top-tier conference for the first time and having a very hard time coming up with an appropriate design for my poster. Everything I do seems basic and banal. My paper is more theory-oriented, and apart from putting math formulas in bold in the middle, I am not sure what the best way is to design the poster. Even the sizing choice is complicated as ICML gives 3 different recommendations to pick from, and somehow from my computer, I can’t see how the PowerPoint slide will look like printed on those dimensions. And Printing a poster is nearly $100 CAD, so there’s no room for trial and error.
So If anyone has any tips on how to do it properly, I have been using PowerPoint, but perhaps I should go to Canvas? Or Does anyone have another software to recommend?
I made my own minimal implementation of JEPA algorithms.
Making things minimal and removing all the things needed for scaling the algorithm always helped me understanding. So I stripped everything but the algorithm parts. What's left is 160-200 lines of code that distills the essence of the mathematics.
It is very easy to compare with the math in the paper and the code and how it can be implemented in PyTorch.
I added [algo]_tutorial.md files to help with understanding.
If you take two open-source models: Gemma4-31B and Qwen3.5-27B, you might notice that Qwen beats Gemma on almost all benchmark sets. On the other hand, the situation is reversed on arena.ai: Gemma beats Qwen quite decisively, by about 50 ELO points in different categories. Is there a good explanation for this apparent discrepancy?
(DISCLAIMER: I accidentally deleted the last post on this subreddit my apologies if this is your second time seeing it)
Last year I made a post about my steam recommender The last one was great and served its purpose of showing many people new games, But this new version is much more functional!
I love making recommendation systems that tell the user WHY they got the recommendation.
During a steam sale event, I always find myself trying to look for new video games to play. If I wanted to find a new game I would try to whittle it down by using steam tags, but the steam tag system is very broad "action". could apply to many many games.
That got me thinking, what aspects do I like about my favorite games?
Well I like Persona 4 because of the city vibes and jazz fusion,
Spore because of the unique character creation and whimsical theme.
Balatro for its unique deck building synergies.
What if I could capture unique tags that identify a game that aren't just "action" and put them into vectors to show the (focus) of a game
For example I could break persona 4 into something like
Game play Focus vector: Day cycle 20% Dungeon crawling 20% Social sim 20%
Tags: Music: jazz fusion Vibe: Small rural town
I find that this system makes searching for games more "fun" now I can see why I like balatro. I like it because of the card synergies not so much for its rogue-like nature.
I also find that this helps find new underrated games, and beats the trap that Collaborative Filtering algorithms that get into where it "feels" like you get recommended the same things.
( I actually made some git issues myself for problems I can't fix)
if anyone has any criticism I would love to hear it! this is probably my favorite passion project. I made this during final season, Since the database takes around 1 day to build, there were some inevitable rate limiting errors that I go into. So I am sure there are many bugs. if you come across any and are willing to share that would be Amazing.
Hope this website helps people find new games! Also I have a advance mode for people that don't mind messing with sliders and weird data terms.
Hi all, I'm working on a problem of making computer understand emotions behind some piano piece and can't finish it without your help, so I hope that you can help me collect a dataset for my master thesis by sending the following survey to the people who listen and love piano music.
I'll be enormously grateful if you could fill out the Google Forms with information: 1. Piano piece name 2. Different emotions that piece evokes when you listen to it
We're working on a problem that sits at the intersection of distributed systems and ML deployment: how do you get multiple AI agents — backed by different models, running across different machines — to coordinate work, verify each other's output, and produce results you can actually trust?
OmniNode is our answer. The architecture is:
- Contract-driven task delegation (structured specs, not prompt chains)
- Multi-model routing — local inference (Qwen3-Coder-30B on RTX 5090, DeepSeek-R1 on M2 Ultra) alongside cloud models, routed by task characteristics
- Deterministic execution with event sourcing (Kafka) and replayable state
- Evidence-backed completion — work isn't "done" because an agent said so, it's done when durable verification passes
The core insight we're building on: as agent autonomy increases, reproducibility and enforceable correctness matter more, not less. Prompts don't scale. Contracts do.
Looking for engineers who think in systems, care about how things fail under real conditions, and are comfortable working across the ML/infrastructure boundary.
Fully remote. DM me with something you've built or something interesting you've been experimenting with.
I found this game changer for the llm token consuming it's double my work using claude code
The problem: AI agents grep through your codebase, miss indirect callers, and break things they didn't know were connected.
Ganglia gives the agent a real call graph. It knows what calls what, scores blast radius before any edit, traces full call chains, and keeps memory across sessions.
Works with Claude Code, Cursor, Windsurf. Free tier forever. Pro $19/mo, Team $49/mo — everything free while in beta, and beta users get a permanent discount when paid tiers activate.
TabPFN-3 was released today, the next iteration of the tabular foundation model, originally published in Nature.
Quick recap for anyone new to TabPFN: TabPFN predicts on tabular data in a single forward pass - no training, no hyperparameter search, no tuning. Built on TabPFN-2.5 (Nov 2025) and TabPFNv2 (Nature, Jan 2025), which together crossed 3M downloads and 200+ published applications.
What's new:
Scale: 1M rows on a single H100 (10x larger than 2.5).A reduced KV cache (~8GB per million rows per estimator) and row-chunked inference make this practical on a single GPU
Speed: 10x-1000x faster inference than previous versions. 120x on SHAP via KV caching
Thinking Mode (API only): test-time compute pushes predictions further via one-time extra fitting at inference. Beats every non-TabPFN method on TabArena by over 200 Elo, including 4-hour-tuned AutoGluon 1.5 extreme. Gap more than doubles to 420 Elo on the larger-data slice.
Accuracy: it has a 93% win rate over classical ML on TabArena
Many-class: native non-parametric retrieval decoder supporting up to 160 classes
Calibrated quantile regression: bar-distribution regression head produces calibrated quantile predictions in a single forward pass
Lifts adjacent tasks: time-series, interpretability, and new SOTA on relational benchmarks.
3 deployment paths: API, enterprise licensing, and open-source weights (permissive for research and academic evaluation)
You can try it here or read the model report here. Happy to answer questions in the comments.
I have analyzed some decoder transformer models using Lyapunov spectral analysis and found that the ratio of the MLP and attention spectral norms strongly indicates whether a model will eventually collapse to rank-1 or not by the final layers.
I found that the spectral ratio is best kept around 0.5–2 for keeping the model stable till the final layers.
Has anyone applying for a Korean visa for ICML been asked for the conference’s Business Registration Number? The ICML website explicitly states that it cannot provide the BRC so I wanted to ask how others handled this
I'm looking for a cache simulator / benchmark suite suited to the kind of tiered ephemeral cache that LLM providers use — e.g. Anthropic's 4-tier prompt cache, where context sits across several tiers with different residency windows, costs, and eviction rules.
I've already tried libCacheSim. It's a solid piece of software for classical caches (LRU, FIFO, ARC, SIEVE, S3-FIFO, W-TinyLFU, Belady oracle, plugin API, trace replay), and I got a plugin + synthetic trace working against it. But it seems fundamentally aimed at single, flat caches:
One cache, not a hierarchy of tiers with different costs
No notion of partial / multi-tier residency of the same object
Misses are uniform-cost — no way to express "miss to L1 vs miss to L3 vs full recompute," which is the whole point in LLM prompt caching
Trace model is atomic get/put, not edit streams where cached objects mutate in place
No first-class support for token-weighted object sizes
So it works as a baseline comparator, but it's not really the right shape for evaluating LLM-cache policies.
Does anyone know of cache-testing software specifically targeting LLM-provider-style caches? Something that models multiple tiers with per-tier cost/residency, tokenised objects, and edit-driven workloads would be ideal. Academic code, research prototypes, internal tools that got open-sourced — all welcome. Even partial matches (e.g. KV-cache simulators for inference servers) would be useful pointers.
A few weeks ago I shared the results of a benchmark here comparing 6 LLMs on subtitle translation, scored with two reference-free QE metrics - MetricX-24 (~13B mT5-XXL) and COMETKiwi (~10.7B XLM-R-XXL) - combined into a TQI index. Posting a follow-up because we did human review afterwards, and the result is worth discussing.
The original benchmark put TranslateGemma-12b first in every language pair. The natural question: are those high scores accurate, or are the metrics insensitive in their high-confidence zone? These metrics correlate well with human judgment at the population level (that's what they're trained for), but population-level correlation doesn't tell you whether the segments they call "clean" are actually clean.
So we ran the check directly. 21 English subtitle segments from one tutorial video. TranslateGemma's translations into 4 languages (ES, JA, TH, ZH-CN - Korean and Traditional Chinese got dropped). All 84 translations chosen because they passed the dashboard clean-rule (MX < 5 AND CK ≥ 0.70) in all 4 languages simultaneously. Then full MQM annotation by professional linguists - Major/Minor severity, with categories covering accuracy (mistranslation, omission, addition, untranslated), fluency (grammar, punctuation, inconsistency), style, terminology.
All 25 human-found Accuracy-class errors fell in the metric-blind quadrant. Zero overlap with the auto-flagged region (which contained one Style-category Major error).
Japanese carries 10 of 15 total mistranslations across the dataset, all metric-blind, despite having the highest mean COMETKiwi (0.863) of the four languages.
Caveat: small n, one model, one content set, so the numbers are directional rather than definitive.
Original thread: [link] Full benchmark report: in comments.
After the ICML abstract deadline, industry coauthors asked to be removed, they missed their employer's internal approval window. They had contributed (discussions and written feedback) but I hadn't explicitly asked before adding them.
January: wrote to PC chairs, got written confirmation from all coauthors, got explicit written approval. Chairs said they'd implement.
Never happened. Paper accepted four months later with original author list. At camera-ready we followed up. Chairs reversed: blanket policy, no exceptions, keep the list or withdraw.
What do you think about this situation? I don't want to withdraw my paper, given that Neurips deadline has passed. What is most likely to happen, and any advice?
Hi, I am a student going into my first year in Ph.D in RL this September. Although each university kinda has their own reading groups, I was wondering if there is active RL Online reading group I can participate. Sadly I couldnt find any info elsewhere.
Does anyone have any information regarding Online RL Reading groups?
Last cycle, one of my research paper was rejected because of formatting issues. I recently heard from someone that there may be a tool or software called something like “aclpubcheck” that can be used to check whether a manuscript follows the required submission format correctly.
Does anyone know the exact name of this software or tool?
Also, if there is no such reliable tool, what is the best way to make sure that a paper is formatted correctly before submission? Like, how do you usually verify margins, page limits, font size, template compliance, bibliography format, and other formatting requirements before submitting to a conference or journal?
The modern ML (LLM) compiler stack is brutal. TVM is 500K+ lines of C++. PyTorch piles Dynamo, Inductor, and Triton on top of each other. I built a hackable LLM compiler from scratch and am documenting the process. It takes a small model (TinyLlama, Qwen2.5-7B) and lowers it to a sequence of CUDA kernels through six IRs.
Currently, on RTX 5090, the emitted FP32 kernels run at geomean 1.11× vs PyTorch eager and 1.20× vs torch.compile, with full-block parity on TinyLlama-128 and Qwen2.5-7B at seq=128. Wins on small reductions / SDPA / kv-projections (up to 4.7×); losses on dense matmul at seq=512.
Part 1 took an RMSNorm layer end-to-end and walked the upper half of that pipeline in detail. This second part closes the gap and explains Tile IR, Kernel IR, and associated lowering rules in depth.
The article focuses on producing a GPU schedule for an operation written in loop-nest form (Loop IR). Example for RMSNorm: python v0 = reciprocal(2048) for a0 in 0..32: # free for a1 in 0..2048: # reduce in2 = load x[0, a0, a1] v1 = multiply(in2, in2) acc0 <- add(acc0, v1) v2 = multiply(acc0, v0) v3 = add(v2, 1e-06) v4 = rsqrt(v3) for a2 in 0..2048: # free in3 = load x[0, a0, a2] in4 = load p_weight[a2] v5 = multiply(in3, v4) v6 = multiply(v5, in4) merged_n0[0, a0, a2] = v6
The stack mimics a sequence of optimization steps a CUDA engineer would perform when optimizing kernels: stage inputs to smem, reduce bank conflicts, increase occupancy, and so on. diff LoopOp │ ▼ [001] tileify — lift outer free Loops to thread axes [002] chunk_matmul_k — chunk the K reduce into K-outer × K-inner (intra-CTA) [003] split_matmul_k — promote the K-outer chunk loop into a grid dimension [004] cooperative_reduce — let multiple threads share one reduce; tree-merge with Combine [005] blockify_launch — pick block extents; partition free axes into BLOCK and THREAD [006] chunk_reduce — chunk non-matmul reduces so their Loads fit in shared memory [007] stage_inputs — hoist hot input slabs into Stage nodes [008] register_tile — replicate the inner tile so each thread owns a register block [009] permute_register_tile — reorder the register strip so bank-conflicting loads land on far columns [010] double_buffer — promote K-outer Stages to BufferedStage (ping-pong) [011] tma_copy — narrow eligible BufferedStages to TmaBufferedStage (sm_90+) [012] split_inner_for_swizzle — split the inner cache axis of a TmaBufferedStage for swizzle [013] async_copy — narrow the rest to AsyncBufferedStage (cp.async, sm_80+) [014] pad_smem — pad shared-memory strides to break bank conflicts [015] pipeline_k_outer — rotate the K-outer loop into prologue/steady-state/epilogue (cp.async + TMA) [016] mark_unroll — annotate small inner loops for #pragma unroll │ ▼ TileOp (fully scheduled)
Each stage can be reproduced with a CLI command. For example, the stage_inputs pass stages input buffers into smem if possible and if there is a benefit in doing that (inputs are being read multiple times within CTA). To see it, the following command can be used:
I have a multidimensional time series dataset and corresponding environment videos. I want to pass them to a VLM to perform some tasks. What is the best way to pass the time series data? From the literature review, I see there are two methods: pass time series as text and plot line charts and pass those as images.
Neither method performed well on my task. Appreciate any guidance.
I can see 2.88 million downloads per month for small Qwen3.5 model. I tried using earlier model 0.6B in a deep resarch workflow and it was very difficult to get something done with this model .
Firstly they have a very surface level understanding of concepts. Poor Semantic understand means they can get confused about the topic or the task.
Json outputs are often broken . Adding a layer of checks on top took much of my time while working with these models.
Slow resposne. This one depends on a lot of factors and can actullay be improved , still slow response is a buzz kill most of the time
I am very curious how is the community using these models.
An interactive visualisation of Jensen–Shannon divergence - the symmetric, always-finite cousin of KL. Shape two distributions and watch JSD, its ceiling of one bit, and the per-point contribution respond in real time. https://robotchinwag.com/posts/jensen-shannon-divergence-visualisation/
An AlphaZero agent has learnt to predict the value of a game state by training on data generated by self-play by the model and a series of predecessor models. By construction, this value should reflect the probability of winning against a copy of itself starting from the given state. To be more precise, the value measures the state's average strength against opponent players collected among all the predecessors of the current model. This average depends on the manner in which the training data is sampled from the pool of self-play data (using a rolling window of self-play by the latest x models, putting more emphasis on recent models by geometric weighting, etc.).
In each round of self-play, we can think of the agents (a copy for each player) making moves following a strategy, albeit a stochastic one (unless the temperature parameter is zero), defined by the PUCT function for the predicted values and policies, but that this strategy is a little perturbed by the addition of some proportion of Dirichlet noise. The purpose of this perturbation is to give the model an opportunity to find successful actions by chance and not get trapped into some rigid, possibly narrow, pattern of playing.
Because of role of noise in deciding which move to make, the formulation above that the value reflects the chances of winning against the model itself is an over-simplification. The data on which the value prediction is based does include "outlier" moves, and - as far as I've understood - this is a heuristic argument for the claim that the model makes its predictions based on experience of playing against a variety of different players.
However, due to the moves that differ the most from the "predicted" ones being outliers, such moves also have a correspondingly small impact on the value predictions: it is the agent's own playing style, and the historical development of said style, that governs value predictions.
So, if the agent meets a strong opponent, either a human being or an algorithm with a strong track record, why should AlphaZero's value prediction be a reliable measure of the agent's chances of winning against this opponent from the given position?
Experience has shown AlphaZero to indeed outperform both human players and other algorithms in a variety of games. I wonder if this success is also to be expected a priori, or is it conceivable that AlphaZero could even fail miserably in some game against a specific algorithm whose moves, though occurring in AlphaZero's training data pool, occur so infrequently that they don't make any significant impact on the predictions?
I completed my bachelors recently and I plan to applying to a masters program either this cycle or the next. Unfortunately, I did not publish any papers or do any research during my undergrad. Right now I’m in a research internship which is coming to and soon and it’s unlikely that I’ll get to publish a paper. I would like to know if reproducing results from a known paper for validation or extension or a comparative analysis counts as credible research. It’s the only thing I could find to do independently.
Scale AI and similar services charge a lot for annotation. MTurk is cheap but the quality is horrible for anything requiring real domain understanding.
For small teams that need a few thousand labeled examples to calibrate their evals or fine tune a model, there seems to be no good middle ground.
How is everyone handling this? Are you doing it manually or has anyone found something that actually works?
I generally work around 9–10 hours a day, but not contiguously. I can usually carve out a dedicated chunk of time in the morning, take lab or project meetings in the afternoon, and block out around 6–8 PM for commute, exercise, socializing, and dinner. I also get more work done in the evening, since my focus is often best then. On weekends, I mostly run errands and try out new food spots, but I also make sure to do at least a little bit of work every day.
I try to schedule my Slurm jobs so they run when I’m not actively working, so I can collect results when I get back. When I don’t have at least some Slurm jobs going, I feel anxious. I also feel pressure to use coding agents whenever I can. At the same time, I find that these agents can create an illusion of productivity: I end up with more “dead time” where I’m just waiting for the agent to finish thinking.
I’m in my 3rd year as a PhD student at a top-5 program for my field in the US, and I’ve been thinking a lot about time management recently. I'm done with classes and not TA'ing this quarter. I mainly target the 3 main ML conferences (though I would love to make every deadline consistently and don’t), plus core NLP venues and journals.
Hello Peeps Salman, Shuguang and Adil here from Katanemo Labs (a DigitalOcean company).
Wanted to introduce our latest research on agentic systems called Signals. If you've been building agents, you've probably noticed that there are far too many agent traces/trajectories to review one by one, and using humans or extra LLM calls to inspect all of them gets expensive really fast. The paper proposes a lightweight way to compute structured “signals” from live agent interactions so you can surface the trajectories most worth looking at, without changing the agent’s online behavior. Computing Signals doesn't require a GPU.
Signals are grouped into a simple taxonomy across interaction, execution, and environment patterns, including things like misalignment, stagnation, disengagement, failure, looping, and exhaustion. In an annotation study on τ-bench, signal-based sampling reached an 82% informativeness rate versus 54% for random sampling, which translated to a 1.52x efficiency gain per informative trajectory.
Deepmind released a paper on D4RT at the start of this year which crucially enabled a “4D” understanding of the world via structure from motion and generating: 1. Point cloud reconstruction from 2D videos (not static scenes) 2. Camera pose estimation
You could pass in a video of a dog walking on a beach and it would estimate the 3d representation of the beach and the dog at any point in time.
They did not release the model though. Are there any open source, available implementations of anything similar now?
Parax is a library for "Parametric modeling" in JAX, attempting to bridge the approach between pure JAX PyTrees, and more object-orientated modeling approaches (e.g. using Equinox).
v0.7 has been released, featuring a more polished API as well as some detailed examples in the documentation.
Some of Parax's features:
Derived/constrained parameters with metadata
Computed PyTrees and callable parameterizations
Abstract interfaces for fixed, bounded, and probabilistic PyTrees and parameters
Two new examples in the docs that show off these features
I was trying to submit to ICML FM4LS workshop but noticed that openreview is not accepting submissions any more? although the deadline listed on the website is end of day May 9th AoE. Was there any communication that I missed? Anyone else facing same issue?
It turns LLM benchmark results into a directed graph:
If model A beats model B on benchmark X, add an edge A -> B.
Then it searches for the shortest transitive chain between two models.
The meme version is:
Can LLaMA 2 7B beat Claude Opus 4.7?
In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot:
Weak-to-strong reachability is high. I checked 126,937 pairs where the source model has lower Intelligence Index than the target model. 119,514 of them are reachable through benchmark win chains, for a reachable rate of 94.2%.
Most paths are short. Among reachable weak-to-strong pairs: 2-3 hop paths account for 91.4%. So this is not mostly long-chain cherry-picking.
Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about 119k direct weak-over-strong triples of the form: (source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark.
Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity's Last Exam, IFBench, AIME 2025, TAU2, SciCode
Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking.
My current interpretation:
LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise.
The next question is whether reversal structure can help build better evaluation metrics:
identify specialist models;
identify volatile benchmarks;
build robust generalist scores;
select complementary benchmark sets;
decompose models into capability fingerprints.
Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks?
I know publication count is not everything, and quality, contribution, advisor/lab culture, subfield, and luck all matter a lot. But to make the comparison easier, I’m curious about the publication-count side specifically. For an ML PhD, what would you consider an average publication outcome by graduation?
For example, would something like 3–5 first-author papers at A/top-tier venues* be considered roughly average, or would that already be above average in ML?
By A*/top-tier, I’m thinking of venues such as NeurIPS, ICML, ICLR, CVPR, ACL, EMNLP, etc., depending on the subfield.
Important: Again, I know paper count is a crude metric. I’m just trying to get a rough sense of what people in the field see as average, strong, or unusually strong.
Hi, this ICML workshop: https://trustworthy-ai-for-good.github.io/ says abstract deadline was yesterday, however on openreview it only lists the full paper deadline, and I can still submit the full paper even though missing abstract deadline.
Is there any chance my submission get desk-rejected?
I want to share an interview experience anonymously in case it helps others on the job market.
I was approached about a Vancouver ML role that was presented to me as research-oriented. The recruiter told me the team had looked at my research and that I should be ready to discuss my projects, so I expected a conversation about modelling, research ideas, and fit.
That is not how the interview felt. It was much more focused on trivia-style and coding-style questioning, with very little real engagement with my research or how I think about problems. The overall process felt much narrower and more one-sided than what had been communicated beforehand.
What bothered me was not that they wanted a different skill set. That is completely fair. The problem was the mismatch between how the role was framed and how the interview was actually run. I was also left confused about the publication angle, because the role gave the impression of being research and publication connected, but the interview did not make it feel that way in practice, and they could not name any recent publications they had that they were proud of when I asked.
My takeaway is simple: in ML hiring, some roles are described as research roles, but the actual evaluation is aimed at something quite different. That can waste a candidate’s time, especially if they were contacted based on a research profile.
My advice is to ask very directly what the interview will focus on, how research-oriented the team really is day to day, and whether your background is actually what they want before entering the process. I did all this, and was misled.
Has anyone else here had a “research” interview that turned out to be something else entirely?
I have a question about NeurIPS submission/review rules and anonymous code repositories.
Suppose a paper was submitted before the deadline, and the anonymous code repo is linked as supplementary/reproducibility material. After the deadline, we notice that one label/name in the paper is misleading or mislabeled. The numerical results and metrics are unchanged, but the corrected label slightly affects how the results should be interpreted.
Would it be acceptable for the anonymous repo README to show the reproduced metrics with the correct labels, with a minimal clarification such as “labels corrected; numbers unchanged”? Or could this be considered an impermissible post-deadline correction/revision of the paper?
I am not talking about uploading a corrected PDF to the repo, changing results, or adding new experiments. The idea would only be to document the reproduction table with the correct labels in the README, while keeping the repo fully anonymous.
Has anyone seen guidance from NeurIPS / OpenReview / ACs on this kind of situation? What is the safest way to handle it during review — README clarification, OpenReview comment, rebuttal only ?
DeepSeek dropped the full V4 paper this week. preview from april was 58 pages, this version adds a lot of technical depth.
What stood out for me.
FP4 quantization aware training. theyre running FP4 QAT directly in late stage training. MoE expert weights quantized to FP4 (the main gpu memory consumer). QK path in the CSA indexer uses FP4 activations. 2x speedup on QK selector with 99.7% recall preserved. inference runs directly on the FP4 weights.
Efficiency table is striking:
Model 1M context FLOPs KV cache V3.2 baseline baseline V4-Pro 27% of baseline 10% of baseline V4-Flash 10% of baseline 7% of baseline
Training stability, two mechanisms.
Trillion parameter MoE has the loss spike problem, divergence, unpredictable failures. they documented two fixes.
Anticipatory routing. they deliberately desync main model and router updates. current step uses latest params for features, but routing uses cached older params. breaks the feedback loop that amplifies anomalies. 20% overhead but only kicks in during loss spikes.
SwiGLU clamping. hard limits on the SwiGLU linear path (-10 to 10) and gate path (max 10). suppresses extreme values that would cascade.
Generative reward model. instead of separate reward models for RLHF, they use the same model to generate and evaluate. trained on scored data, model learns to judge its own outputs with reasoning attached. minimal human labeling, reasoning grounded eval, unified training.
Human eval results. chinese writing, V4-Pro 62.7% win rate vs gemini 3.1 pro, 77.5% on writing quality specifically. white collar tasks (30 advanced tasks across 13 industries), V4-Pro-Max gets 63% non loss rate vs opus 4.6 max. coding agent eval, 52% of users said V4-Pro is ready as their default coding model, 39% leaned yes, less than 9% said no. tracks my own use, swapped V4-Pro into my verdent runs last week and havent noticed a quality hit on day to day work.
The headline for me is FP4 QAT with minimal quality degradation. if this generalizes the cost structure of training and inference shifts a lot, especially noticeable on multi agent setups where one task can spawn 5-10 model calls.
I got a NeurIPS reviewer invite last week, and accepted it. It said that bidding for papers will start may 8th (today). But haven’t heard anything yet.
Has anyone else heard anything? Did I mess up while accepting the reviewer invite or is this normal?
P.s., thoughts on the AI-assisted reviewing experiment? Are y’all volunteering?
You control two skew-normal distributions and can see the KL integrand and the KL metric. It’s good for exploring how it changes with a mean offset, skew, truncation and discretisation.
Please post your personal projects, startups, product placements, collaboration needs, blogs etc.
Please mention the payment and pricing requirements for products and services.
Please do not post link shorteners, link aggregator websites , or auto-subscribe links.
--
Any abuse of trust will lead to bans.
Encourage others who create new posts for questions to post here instead!
Thread will stay alive until next one so keep posting after the date in the title.
--
Meta: This is an experiment. If the community doesnt like this, we will cancel it. This is to encourage those in the community to promote their work by not spamming the main threads.
Hiring: [Location], Salary:[], [Remote | Relocation], [Full Time | Contract | Part Time] and [Brief overview, what you're looking for]
For Those looking for jobs please use this template
Want to be Hired: [Location], Salary Expectation:[], [Remote | Relocation], [Full Time | Contract | Part Time] Resume: [Link to resume] and [Brief overview, what you're looking for]
Please remember that this community is geared towards those with experience.