GeistHaus
log in · sign up

Machine Learning Blog | ML@CMU | Carnegie Mellon University

Part of Machine Learning Blog | ML@CMU | Carnegie Mellon University

stories primary
Teaching Vision-Language Models to Speak Cinema
machine learning
A year of building a video caption pipeline with 100+ professional creators, and what it taught us about scaling supervision instead of models. By Zhiqiu Lin and Chancharik Mitra. Based on our CVPR 2026 work, Building a Precise Video Language with Human-AI Oversight (Highlight, Top 3%). How close is today's video generator to a Hollywood cinematographer? Hollywood directors reach for certain shots because they make a scene land. They cue a specific feeling in the viewer that flat coverage cannot. Open your favorite video generator (Veo 3.1, Seedance 2, or any of the latest open-source models) and ask it for a dolly zoom of a man standing in the middle of a bustling street, the way Hitchcock used the shot to make the world feel like it is collapsing inward. Or a rack focus pulling from a coffee cup to the woman behind it, the kind of focus pull that quietly tells the audience where to look. Or a Dutch-angle shot of a nervous person staring into the void, a tilted frame that puts the viewer on edge. Most generators will hand back something close to a generic dolly-in, or a slow-motion clip with the wrong focal subject. The output […]
Show full content
// Inject spacing fix jQuery('<style>') .prop('type', 'text/css') .html(` .post-authors { padding-right: 40px; } `) .appendTo('head'); const authors = [ { name: "Zhiqiu Lin", affiliations: ["Carnegie Mellon University"] }, { name: "Chancharik Mitra", affiliations: ["Carnegie Mellon University"] }, { name: "Siyuan Cen", affiliations: ["Carnegie Mellon University"] }, { name: "Isaac Li", affiliations: ["Carnegie Mellon University"] }, { name: "Yuhan Huang", affiliations: ["Carnegie Mellon University"] }, { name: "Yu Tong Tiffany Ling", affiliations: ["Carnegie Mellon University"] }, { name: "Hewei Wang", affiliations: ["Carnegie Mellon University"] }, { name: "Irene Pi", affiliations: ["Carnegie Mellon University"] }, { name: "Shihang Zhu", affiliations: ["Carnegie Mellon University"] }, { name: "Ryan Rao", affiliations: ["Carnegie Mellon University"] }, { name: "George Liu", affiliations: ["Carnegie Mellon University"] }, { name: "Jiaxi Li", affiliations: ["Carnegie Mellon University"] }, { name: "Ruojin Li", affiliations: ["Carnegie Mellon University"] }, { name: "Yili Han", affiliations: ["Carnegie Mellon University"] }, { name: "Yilun Du", affiliations: ["Harvard University"] }, { name: "Deva Ramanan", affiliations: ["Carnegie Mellon University"] }, ]; jQuery('.post-authors').empty(); jQuery('.affiliations').empty(); jQuery('.post-authors').append('<h4>Authors</h4>'); const affiliationMap = {}; let affiliationIndex = 1; // First pass: assign unique affiliation numbers authors.forEach(author => { author.affiliations.forEach(affiliation => { if (!affiliationMap[affiliation]) { affiliationMap[affiliation] = affiliationIndex++; } }); }); // Build author line const authorsHtml = authors.map((author, index) => { const affIndices = author.affiliations.map(a => affiliationMap[a]); const superscriptParts = [...affIndices]; if (author.equalContribution) { superscriptParts.push('*'); } const superscript = `<sup>${superscriptParts.join(',')}</sup>`; let separator = ''; if (index < authors.length - 1) { separator = index === 4 || index === 9 || index === 14 ? ',<br>' : ', '; } return `${author.name}${superscript}${separator}`; }).join(''); jQuery('.post-authors').append(authorsHtml); // Add affiliations jQuery('.affiliations').append('<h4>Affiliations</h4>'); Object.entries(affiliationMap).forEach(([affiliation, index]) => { jQuery('.affiliations').append(`<sup>${index}</sup>${affiliation}<br>`); }); // Equal contribution footnote if (authors.some(a => a.equalContribution)) { jQuery('.affiliations').append('<sup>*</sup>Equal contribution<br>'); } jQuery('.doi').remove();

A year of building a video caption pipeline with 100+ professional creators, and what it taught us about scaling supervision instead of models.

By Zhiqiu Lin and Chancharik Mitra. Based on our CVPR 2026 work, Building a Precise Video Language with Human-AI Oversight (Highlight, Top 3%).

How close is today's video generator to a Hollywood cinematographer?

Hollywood directors reach for certain shots because they make a scene land. They cue a specific feeling in the viewer that flat coverage cannot. Open your favorite video generator (Veo 3.1, Seedance 2, or any of the latest open-source models) and ask it for a dolly zoom of a man standing in the middle of a bustling street, the way Hitchcock used the shot to make the world feel like it is collapsing inward. Or a rack focus pulling from a coffee cup to the woman behind it, the kind of focus pull that quietly tells the audience where to look. Or a Dutch-angle shot of a nervous person staring into the void, a tilted frame that puts the viewer on edge.

Most generators will hand back something close to a generic dolly-in, or a slow-motion clip with the wrong focal subject. The output is usually visually competent, but it does not do the thing. The model has clearly seen videos that contain these techniques. It just does not know how to act on the words.

We think this is symptomatic of a broader gap. Filmmakers communicate with a shared, precise vocabulary: shot size, frame position, focus type, lens distortion, camera height, video speed. Today's vision-language models (VLMs), and the captioning datasets that feed them, mostly do not.

In this post we describe CHAI, a captioning pipeline (in our usage, a caption is a long, structured paragraph describing a video's content, motion, and camera work — not a subtitle track) that we built over the past year with 100+ professional video creators. The acronym stands for Critique-based Human-AI Oversight. Existing video caption datasets are typically written either by crowdworkers, who lack the cinematic vocabulary to describe a shot precisely, or by large vision-language models, whose captions read smoothly (fluent — no grammatical or stylistic errors) but routinely describe objects and motions that are not in the video (hallucinated). The central idea behind CHAI is to combine the two: the captioner model (e.g., a large video-language model such as Gemini-2.5-Pro) writes the draft, a trained human critiques it, and the model revises against that critique.

This post works through four questions:

1. Why do VLMs struggle with cinematic prompts?

2. How should humans and models divide the captioning work?

3. Does the quality of human critique change what the model can learn?

4. Do better captions in the training data give us a better video generator?

Figure 1. Three failure modes of current video captioning pipelines (top, red), and the choices we make in response (bottom, blue): a precise specification, a human-AI oversight loop, and post-training on explicit preferences plus critiques rather than output-only comparisons.

Question 1: Why do VLMs struggle with cinematic prompts?

A natural first hypothesis is that this is a capacity problem — that the current generation of vision-language models is simply too small, has too little context, or has not been pretrained on enough video to handle cinematic prompts, and that the next generation will solve it. But after auditing eight popular video-text datasets from 2016 to 2025 (ActivityNet Captions, MSR-VTT, DREAM-1K, ShareGPT4Video, PerceptionLM, and others), we think the bottleneck is somewhere else. The visual content is in the videos these models train on, and modern VLMs perceive it well. What is missing is the language: the captions paired with those videos do not contain the precise vocabulary needed to describe cinematic technique. In our experiments, training larger models on more of the same data only marginally improved these issues. They appear to be problems of annotation policy, not of capacity.

Three patterns showed up over and over:

• Imprecise terminology. Captions conflate dolly-in (the camera physically moves forward) with zoom-in (the focal length changes), or describe a fisheye distortion as "circular building."

• Missing information. Captions describe what is in the frame and skip everything else: motion, camera shake, focus changes, shot size. Anything temporal, anything about the camera, gets dropped.

• Subjective descriptions. "An atmospheric shot full of tension" tells a model nothing it can ground in pixels.

A natural next thought: just hire crowdworkers to write more careful captions. We tried that. Crowdworkers still confused dolly-in with zoom-in, called wide shots "close-ups," and described fisheye distortion as "a round building." Seeing is not the same as knowing how to describe.

Figure 2. Crowdworker vs. expert descriptions for the same clips. Crowdworkers see the aerial-view shot, the fisheye lens, and the dolly zoom. They just reach for everyday language ("bird's-eye view," "circular building," "warping effect") instead of the technical vocabulary the model would need to act on the description.

What worked, eventually, was bringing in people whose job requires this vocabulary: cinematographers, directors of photography, motion graphics designers, VFX artists, game designers, camera operators. Over the past year, we built a structured caption specification with 100+ such collaborators. The specification has five aspects:

• Subject (type, attribute, relations)

• Scene (composition, dynamics, overlays, point of view)

• Motion (subject actions, interactions, group activity)

• Spatial (shot size, frame position, depth, spatial movement)

• Camera (focus type, depth of field, steadiness, movement, video speed, lens distortion, height, angle)

All five aspects together involve roughly 200 low-level visual primitives, every one with a definition and a decision rule for when it applies. This prevents annotators from freelancing terminology, as all they have to do is tag against the spec.

Figure 3. Typical issues with prior captioning work (left, red) and what we converged on (right, blue). The structured taxonomy was built collaboratively with cinematographers, directors of photography, VFX artists, motion graphics designers, and game designers, and is paired with an annotation policy and training tutorials so the vocabulary stays consistent across annotators.
Figure 4. The full taxonomy. Five aspects, each decomposed into sub-aspects, each grounded in a set of visual or motion primitives.

Takeaway: VLMs struggle with cinematic prompts because the captions they were trained on do not contain the precise vocabulary professionals use. In our experiments, scaling models or data alone gave only marginal gains; specifying the language carefully made a much bigger difference.

Question 2: How should humans and models divide the captioning work?

Once we made the spec, we still had to decide who would write the long captions. The two obvious choices, humans or models, each come with well-known limitations.

Humans alone produce captions with typos, grammatical errors, and inconsistent event ordering. They also fatigue: 200 to 400 words of careful prose per video, while looking up the spec, is exhausting and expensive.

Models alone produce captions that read beautifully but that, on a depressing fraction of clips, confidently describe objects and motions that are not there. They also frequently mix up left and right.

What we noticed in pilot studies is that the failure modes are asymmetric in a useful way. Today's LLMs write better prose than most humans. But humans, especially trained ones, are much better than LLMs at noticing visual or motion errors in a draft, the kind where the caption says "moving left" but the subject is moving right. So we built the pipeline around that asymmetry. The model drafts, the human critiques, the model revises. This is conceptually similar to Saunders et al. (2022)'s self-critiquing models for summarization, but applied to long-form video captioning where the human still does the hard part: catching grounded errors against the actual video.

Concretely, the loop:

1. Primitives. A trained annotator labels which visual and motion primitives are present in the clip.

2. Pre-caption. The model generates a long caption from those primitives, following the spec.

3. Critique. An annotator reads the pre-caption against the video and writes a critique pointing out what is wrong and what should change. The critique has to be accurate (the things it flags are wrong), complete (it does not miss errors), and constructive (it tells the model what to do, not just that something is bad).

4. Post-caption. The model revises its draft using the critique.

5. Refinement. If the post-caption is still off, the human refines the critique rather than rewriting the caption.

We tasked reviewers (top-performing annotators promoted to a quality-control role) with checking every critique and post-caption against the video. This way annotators were scored based on their accuracy, while reviewers earned rewards for catching the mistakes they found. Both precision (do not flag things that are not wrong) and recall (do not miss things that are wrong) were incentivized at the data level, before any modeling happened.

Shifting the human's job from writing to proofreading has a side benefit we underestimated: each video takes far less cognitive effort, and the resulting 200 to 400 word captions end up more accurate than what either humans or models produce alone.

Takeaway: LLMs and humans have asymmetric strengths in long-form video captioning. Designing the pipeline around that asymmetry, rather than trying to replace one with the other, gives both better captions and a more sustainable annotation process.

Question 3: Does the quality of human critique change what the model can learn?

The pipeline produces a triple for every video: (pre-caption, critique, post-caption). That triple is more than just an annotated caption. It is supervision for three different post-training tasks at once:

• Captioning. Train the model to produce long, faithful captions.

• Reward modeling. Treat (pre-caption, post-caption) as a (rejected, preferred) pair.

• Critique generation. Train the model to write the critique itself, given the video and the draft.

We post-trained Qwen3-VL-8B on all three formats jointly using standard supervised fine-tuning (SFT). We also tried reinforcement learning (RL) methods like Direct Preference Optimization (DPO), but found that simple SFT on the full triplet data is the strongest. The detailed numbers are in the paper; the headline is that adding explicit preference and critique signals improves every method we tested.

We were curious whether the quality of the critique mattered to downstream performance, or whether any "this is wrong" signal would do. So we ran an ablation: take a clean CHAI critique, deliberately degrade one property at a time (accuracy, recall, constructiveness), and see how the post-trained captioner performs on each task.

Figure 6. A useful critique has to be accurate (the things it flags are actually wrong), complete (it catches the errors that are there), and constructive (it says what should change, not just that something is bad). All three are needed; degrading any one hurts the downstream model.

Results for an 8B Qwen3-VL post-trained on each variant are presented in Table 1. Caption and Critique are BLEU-4 scores (a standard text-generation metric measuring n-gram overlap with reference text on a 0–100 scale; higher means closer to the human reference) against held-out reference captions and critiques. For the Reward task, we report binary accuracy on whether the captioner scores the post-caption higher than the pre-caption (chance = 50). Higher is better on all three.

Critique variantAcc.Rec.Constr.CaptionRewardCritiqueBlind Gemini-2.5———10.944.521.1Gemini-2.5———12.762.026.2Inaccurate critique✗✓✓12.147.121.9Incomplete critique✓✗✓12.556.628.7Non-constructive critique✓✓✗13.467.232.9CHAI (with QC)18.289.841.7
Table 1. Post-training results when the critique is artificially degraded along one property at a time. Higher is better. As additional reference points, we also tried having off-the-shelf models generate the critiques in place of our human-AI pipeline: (1) Blind Gemini-2.5 uses Gemini-2.5 to critique with the caption text only and no video access (a language-prior baseline); (2) Gemini-2.5 uses the same model with full video input. CHAI (with QC) is our full pipeline including the peer-review quality-control step from Question 2 — i.e., the critiques are accurate, complete, and constructive.

Three things stand out:

1. Quality is not optional. Dropping any one of the three properties materially hurts every downstream task. Non-constructive critiques (the cheapest to collect, since you do not have to say what is wrong) hurt the least but still leave a large gap.

2. Existing data is mostly non-constructive. We checked the critiques in publicly released datasets like Saunders et al.'s GDC release and MM-RLHF. More than half are non-constructive in our sense ("this is wrong" with no suggested fix). That helps explain why training on those datasets leaves performance on the table.

3. An 8B model can be competitive with much larger closed models when the data is right. On the same captioning, reward, and critique benchmarks, the post-trained 8B Qwen3-VL matches or exceeds GPT-5 and Gemini-3.1-Pro on the metrics we report. The model size has not changed; the supervision signal has.

A small bonus: the same reward model also helps at inference time. Best-of-N decoding with the trained reward model continues to improve performance with no additional human labels.

Takeaway: The form of the critique is not a stylistic detail. A model jointly post-trained on captions, preferences, and critiques performs materially better on all three tasks when the critiques it is trained on are accurate, complete, and constructive — and materially worse when any one of those properties is missing.

Question 4: Do better captions in the training data give us a better video generator?

A skeptical reader might say: this is all very nice, but captioning is upstream of what most people actually want, which is generation. So we tested whether the improved captioner moves the needle on a downstream video generator. We took a large corpus of professional video (films, ads, music videos, gameplay), re-captioned it with the post-trained 8B model, and used those new captions to fine-tune Wan2.2.

The fine-tuned model can act on detailed prompts (up to roughly 400 words) for techniques that off-the-shelf generators reliably get wrong:

Figure 7. Two long generation prompts (the text fed to the video generator at inference time) that originated as captions produced by our post-trained captioner on similar held-out clips. Right: zero-shot Wan2.2 follows the prompt loosely, with a dolly zoom becoming a normal dolly-back and an isometric (2.5D) game scene becoming a generic 3D arc. Left: after Wan2.2 is fine-tuned on training videos re-captioned by our model, it follows the same prompt faithfully.

We did not change the generator architecture or training objective. The only thing that changed was the language used to describe the videos in the training set. That was enough to teach an existing generator a class of techniques it previously could not articulate.

Takeaway: A more precise caption vocabulary upstream translates into more controllable generation downstream, with the same model architecture and training recipe. The bottleneck for cinematic control was in the supervision, not the model.

Discussion

We started this project assuming we were going to train a captioner model. We ended up spending most of the year on the pipeline around it: what to write captions about, who should write them, who should check them, and what the checks should look like. The model contributions feel almost downstream of those choices.

Three things we wish we had appreciated earlier:

• Specification before scale. Training larger models on noisier data gave only marginal gains. Once the spec was in place, smaller models started looking very competitive.

• "Crowdsource it" is not a baseline; it is a different problem. Annotating cinematic technique correctly requires the same vocabulary the field already uses. Asking untrained workers to invent that vocabulary on the fly is not the cheap version of asking trained workers to apply it.

• Critiques are training data. The form of the critique we collect today decides how effectively models can be trained tomorrow. Datasets that record only thumbs-up / thumbs-down are leaving a lot of post-training signal on the table.

CHAI is one piece of a longer effort on precise video language. The closest companion is CameraBench (NeurIPS’25 Spotlight), our earlier benchmark on camera motion, which seeded the camera-side primitives in the spec.

Resources

We are releasing the specification, training tutorials, annotation platform, quality-control flow, data, code, and models. If you are working on video understanding or generation and want to use any of these, please do.

Project page: https://linzhiqiu.github.io/papers/chai/

Paper: https://arxiv.org/abs/2604.21718

Code: https://github.com/chancharikmitra/CHAI

References

Krishna et al., 2017. Dense-Captioning Events in Videos (ActivityNet Captions). ICCV. arXiv:1705.00754.

Xu et al., 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. CVPR.

Wang et al., 2024. Tarsier: Recipes for Training and Evaluating Large Video Description Models (DREAM-1K). arXiv:2407.00634.

Chen et al., 2024. ShareGPT4Video: Improving Video Understanding and Generation with Better Captions. NeurIPS. arXiv:2406.04325.

Cho et al., 2025. PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding. arXiv:2504.13180.

Saunders et al., 2022. Self-critiquing Models for Assisting Human Evaluators. arXiv:2206.05802.

Zhang et al., 2025. MM-RLHF: The Next Step Forward in Multimodal LLM Alignment. arXiv:2502.10391.

Lin et al., 2025. Towards Understanding Camera Motions in Any Video (CameraBench). NeurIPS Spotlight. arXiv:2504.15376.

Wan Team, 2025. Wan: Open and Advanced Large-Scale Video Generative Models (Wan2.2). arXiv:2503.20314.

Bai et al., 2025. Qwen3-VL Technical Report. arXiv:2511.21631.

All opinions expressed in this post are those of the authors and do not represent the views of CMU.

Download video
Download video
Download video
Download video
https://blog.ml.cmu.edu/?p=22449
Extensions
Introducing ARFBench: A time series question-answering benchmark based on real incidents
computer visionmachine learningreinforcement learningResearch
More than a trillion dollars are lost every year due to system failures. To resolve them, engineers must troubleshoot outages quickly. An important task in incident response involves analyzing observability metrics, or time series data that snapshot the health of software systems. For example, an engineer for a service may use Datadog to answer questions like “When did latency start increasing?” and “What metrics outside of latency are also behaving abnormally?” to localize the root cause of the anomalous behavior. These time series question-answering (TSQA) tasks are essential for engineers, and present challenging and necessary tasks for SRE models and agents to perform. In this work, we explore the degree to which AI models can perform TSQA tasks. To this end, we’re excited to introduce the Anomaly Reasoning Framework Benchmark (ARFBench), a TSQA benchmark derived from real internal incidents at Datadog, using Datadog’s own internal telemetry (Figure 1). In this blog post, we’ll present three key takeaways from our benchmarking experiments: Existing models struggle: Leading LLMs, vision-language models (VLMs), and time series foundation models (TSFMs) have substantial room for improvement on ARFBench. Hybrid models help: We introduce a new hybrid TSFM-VLM model that yields comparable overall performance to top frontier […]
Show full content
// Inject spacing fix jQuery('<style>') .prop('type', 'text/css') .html(` .post-authors { padding-right: 40px; } `) .appendTo('head'); const authors = [ { name: "Stephan Xie", affiliations: [ "Machine Learning Department, Carnegie Mellon University", "Datadog AI Research" ] }, { name: "Ben Cohen", affiliations: ["Datadog AI Research"]}, { name: "Mononito Goswami", affiliations: ["Amazon AI Research"] }, { name: "Junhong Shen", affiliations: ["Machine Learning Department, Carnegie Mellon University"] }, { name: "Emaad Khwaja", affiliations: ["Datadog AI Research"] }, { name: "Chenghao Liu", affiliations: ["Datadog AI Research"] }, { name: "David Asker", affiliations: ["Datadog AI Research"] }, { name: "Othmane Abou-Amal", affiliations: ["Datadog AI Research"] }, { name: "Ameet Talwalkar", affiliations: [ "Machine Learning Department, Carnegie Mellon University", "Datadog AI Research" ] }, ]; jQuery('.post-authors').empty(); jQuery('.affiliations').empty(); jQuery('.post-authors').append('<h4>Authors</h4>'); const affiliationMap = {}; let affiliationIndex = 1; // First pass: assign unique affiliation numbers authors.forEach(author => { author.affiliations.forEach(affiliation => { if (!affiliationMap[affiliation]) { affiliationMap[affiliation] = affiliationIndex++; } }); }); // Build author line const authorsHtml = authors.map((author, index) => { const affIndices = author.affiliations.map(a => affiliationMap[a]); const superscriptParts = [...affIndices]; if (author.equalContribution) { superscriptParts.push('*'); } const superscript = `<sup>${superscriptParts.join(',')}</sup>`; let separator = ''; if (index < authors.length - 1) { separator = index === 2 ? ',<br>' : ', '; } return `${author.name}${superscript}${separator}`; }).join(''); jQuery('.post-authors').append(authorsHtml); // Add affiliations jQuery('.affiliations').append('<h4>Affiliations</h4>'); Object.entries(affiliationMap).forEach(([affiliation, index]) => { jQuery('.affiliations').append(`<sup>${index}</sup>${affiliation}<br>`); }); // Equal contribution footnote if (authors.some(a => a.equalContribution)) { jQuery('.affiliations').append('<sup>*</sup>Equal contribution<br>'); } jQuery('.doi').remove();

More than a trillion dollars are lost every year due to system failures. To resolve them, engineers must troubleshoot outages quickly.

An important task in incident response involves analyzing observability metrics, or time series data that snapshot the health of software systems. For example, an engineer for a service may use Datadog to answer questions like “When did latency start increasing?” and “What metrics outside of latency are also behaving abnormally?” to localize the root cause of the anomalous behavior. These time series question-answering (TSQA) tasks are essential for engineers, and present challenging and necessary tasks for SRE models and agents to perform. In this work, we explore the degree to which AI models can perform TSQA tasks.

To this end, we’re excited to introduce the Anomaly Reasoning Framework Benchmark (ARFBench), a TSQA benchmark derived from real internal incidents at Datadog, using Datadog’s own internal telemetry (Figure 1). In this blog post, we’ll present three key takeaways from our benchmarking experiments:

  1. Existing models struggle: Leading LLMs, vision-language models (VLMs), and time series foundation models (TSFMs) have substantial room for improvement on ARFBench.
  2. Hybrid models help: We introduce a new hybrid TSFM-VLM model that yields comparable overall performance to top frontier models on ARFBench, demonstrating promising new approaches to TSQA modeling.
  3. Human–AI complementarity: We observe markedly different error profiles between our top TSFM-VLM model and human experts on ARFBench. These results suggest that their strengths are complementary. We introduce a model–expert oracle that establishes a new superhuman frontier for LLMs, VLMs, and TSFMs.
Figure 1: A. Workflow of ARFBench question-answer generation. Engineers use commercial messaging platforms to respond to incidents, where they typically send time series widgets that visualize relevant metrics. Time series and incident timelines from internally monitored incidents are used as input to an LLM pipeline and fit to eight different question templates testing various aspects of anomalies. The resulting multiple choice question-answer pairs can be used to evaluate various predictive models.
ARFBench: Using real-world incident data to create a TSQA benchmark

ARFBench is a TSQA benchmark based on real incidents internal to Datadog, using our own internal telemetry. Compared to existing benchmarks, ARFBench differs in three key aspects. First, it uses real time series data from production systems. Second, each question-answer (QA) example is grounded in expert annotations and additional context. And third, tasks are designed to test compositional reasoning: questions are organized into three tiers of increasing difficulty, with higher-tier tasks depending on correct reasoning on lower tiers (Figure 2).

Figure 2: Example questions from each tier of ARFBench. ARFBench questions are designed in three tiers of increasing difficulty, with higher tier tasks depending on correct reasoning on lower tiers.

ARFBench consists of 750 QA pairs drawn from 142 time series and 63 incidents. Time series in ARFBench have a maximum of 2283 variates (or dimensions) and 40k time steps, which present a challenging setting for context-limited models.

To create ARFBench, we built a VLM pipeline for extracting the time series widgets from internal incident discussion threads to help generate and filter question-answer pairs. We then manually verified every generated question for correctness and privacy concerns, and threw away questions that we found unsuitable.

Reasoning about time series and anomalies requires usage of meaningful context across data modalities. ARFBench enriches time series with two types of context: time series captions, which describe what the time series represent, and multivariate groupings, which contextualize each channel relative to a larger relevant collection of time series channels. For instance, while it may not always matter that a single pod fails and restarts in a service, the combination of many pods failing and restarting simultaneously could indicate a significant anomaly. This level of complexity reflects real-world conditions that many existing unimodal, synthetic datasets fail to capture (Figure 3).

Figure 3: When analyzed alone, variates of a time series may not be anomalous. However, in the context of a grouping of variates, the same variate may be considered anomalous. The multivariate time series in this figure is based on the average remaining TLS certificate lifetime across different clusters and IDs of a particular service.
Leading LLMs, VLMs, and TSFMs have substantial room for improvement

We evaluated three categories of existing models on ARFBench: 

  • LLMs, which take time series as text input
  • VLMs, which take time series plots as image input
  • Time series LLMs. which use a time series encoder with an LLM backbone.

We compared the models to two human baselines: observability experts, and time series researchers without extensive observability experience. The human experts were evaluated on a randomly sampled 25% subset of ARFBench.

Figure 4: Overall accuracy and F1 of various baselines and foundation models on ARFBench. Models are sorted by decreasing accuracy. The Toto-1.0-QA-Experimental achieves the top accuracy on ARFBench and yields comparable F1 to top frontier models.

Among existing models, GPT-5 (VLM) yielded the top performance at 62.7% accuracy and 51.9% F1 (Figure 4). This is much higher than the random choice baseline at 22.5%, but still underperforms domain experts and is far below a model-expert oracle at 87.2% accuracy / 82.8% F1 (see below for further discussion). As expected, model performance tends to worsen as the tier difficulty increases.

We also observe several trends with our evaluations on ARFBench. Corroborating previous works in time series classification and QA such as Daswani et al. 2024, we find that VLMs outperform LLMs. The top proprietary models and open-source models also showed a substantial gap in performance. However, we find that some open-source models perform better than many older proprietary models or models from the Claude family.

Hybrid TSFM-VLM models show promise for specialized TSQA modeling
Figure 5: Architecture diagram of the Toto-1.0-QA-Experimental (Toto-Qwen3-VL) model. Frozen weights are denoted with a snowflake, while trainable weights are marked with a flame. With a small number of trainable parameters, we can align TSFMs and VLMs and yield novel abilities.

Though VLMs yielded the highest accuracy and F1 score among existing models, we found that plotting and input representation was a challenge for both VLMs and LLMs. For example, due to the high number of variates, we often could not plot the time series without repeating colors for or occluding different variates. This motivated a native time series approach alongside the VLM model in which we could utilize time series, plots, and text as joint input.

To test this, we trained a hybrid model (Figure 5) by combining Toto, a state-of-the-art observability TSFM, with Qwen3-VL 32B, a leading open-source VLM. We used both synthetic (Figure 6) and real multimodal data in a multi-stage post-training pipeline incorporating both supervised fine-tuning (SFT) and reinforcement learning (RL). The resulting model, Toto-1.0-QA-Experimental, yielded the top accuracy score of 63.9%, and comparable F1 to top frontier models (48.9%). In the Anomaly Identification task category, where a model selects anomalous variates in the time series, Toto-1.0-QA-Experimental outperforms all models by at least 8.8 percentage points in F1 and achieves best per-category accuracy, suggesting that TSFM-VLM modeling can highly benefit performance on particular tasks. Furthermore, Toto-1.0-QA-Experimental’s parameter count is several magnitudes lower than frontier models, thus providing potential efficiency gains at inference time.

Figure 6: Synthetic data generation flow for post-training hybrid TSFM-VLM and TSFM-LLM models. Time series are generated by first sampling different lengths and scales and second by sampling each datapoint from a normal distribution. To add variation, we add seasonality and drift components into the time series, yielding different base time series (top right). For each base time series, we apply question templates and inject different anomalies (e.g. level shift, change in seasonality) at various points of the time series (bottom right). Finally, we generate time series captions and reasoning for the question-answer pair using a VLM.

We refer interested readers to the paper for more experimental details, error analysis, and case studies.

Domain experts complemented with models set a new superhuman frontier

The current aggregate gap on ARFBench between the best models (Toto-1.0-QA-Experimental & GPT-5) and the two human domain experts is only 8.8 percentage points in accuracy and 12.7 percentage points in F1. However, at the individual question level, we observe noticeably different behavior between GPT-5 and the human experts. GPT-5 answers 48% of questions correctly that both experts get incorrect; on these questions, the human experts tend to make errors in instruction-following or fine-grained perception. Meanwhile, at least one expert correctly answers 79% of questions that GPT-5 gets incorrect. On these sets of questions, model errors tend to involve hallucination and incorrect domain knowledge. We provide examples of both groups of errors in the paper.

Due to the large difference in error distribution, we hypothesize that when experts are complemented with models, their joint capability becomes much higher than any single expert or model alone. To establish this, we compute a model-expert oracle, a best-of-2 metric where an oracle perfectly chooses the best answer between the model and the expert, which yields 87.2% accuracy and 82.8% F1 on our data. This is far above existing model capabilities and sets a new superhuman frontier for LLMs, VLMs, and TSFMs to achieve.

What’s next: time series reasoning as a core component of agents

In the broader scope of incident response, ARFBench only contains questions targeting diagnosis and reasoning. However, we envision that strong diagnosis and reasoning abilities will play a large part in end-to-end agentic systems (e.g., SRE or incident response agents) that require time series reasoning as a subroutine in understanding the incident. While ARFBench can be used to evaluate time series agents, it is not currently a multi-turn benchmark. However, we believe that future agents and models that perform well on the single-turn ARFBench will ultimately perform better on end-to-end tasks.

Getting started with ARFBench

If you are interested in testing your model on ARFBench, you can find the benchmark and leaderboard, and model weights on Hugging Face, and the code on GitHub. To learn more, read our paper.

https://blog.ml.cmu.edu/?p=22366
Extensions