Train a hypernetwork to convert video context into dynamic LoRA weights, enabling zero-visual-token video QA and captioning.
†University of Maryland, College Park
manans@umd.edu baskarsarvesh@gmail.com
Figure 1: VIDEO2LORA overview. Training (left): A frozen VLM encodes the input video into hidden states. The trainable VIDEO2LORA hypernetwork reads these states and generates LoRA adapter weights in a single forward pass. The adapter-augmented frozen VLM is trained against teacher-generated targets. Inference (right): Given a new video, VIDEO2LORA generates the LoRA adapter once. The frozen VLM, augmented with this adapter, answers arbitrary text queries without visual tokens. Per-query cost is independent of video length.
Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce VIDEO2LORA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, VIDEO2LORA predicts these weights directly from the video.
Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, VIDEO2LORA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. VIDEO2LORA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500× and query TTFT by 6–80×, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.
Let v denote a video, i an internalization instruction, p a downstream text prompt, and y the target response. We assume a frozen vision-language encoder E, a frozen answer model F, and a trainable hypernetwork Hφ. The answer model receives the text prompt p and the adapter θ(v), but not the video tokens. During training, only φ is updated; both E and F remain frozen.
We use a frozen SmolVLM2 model as the video encoder. Given a sampled video and the internalization instruction, we collect the text-side hidden states from each transformer layer. Keeping the layer dimension allows the hypernetwork to generate layer-indexed adapters instead of using a single pooled video vector for all layers.
The hypernetwork maps the video-conditioned hidden states into LoRA weights for selected linear modules of the frozen answer model. We use a Perceiver-style resampler architecture. A shared projection head maps each rank latent to the two LoRA factors, which are scaled by learned multipliers initialized to zero perturbation.
For a frozen linear layer, we use the standard LoRA factorization. Each example receives its own generated adapter, so the LoRA weights are conditioned on the input video rather than shared across all videos.
We train the hypernetwork with teacher-forced cross-entropy over response tokens. The answer model receives only the downstream text prompt and the generated adapter during this loss computation, ensuring zero visual tokens in the context window.
Video2LoRA is statistically equivalent or non-inferior to direct video encoding across standard benchmarks, despite storing zero visual frames in the VLM's context at query time.
Scaling behavior on VDC background captioning across frame counts (up to 1,024 frames) and spatial resolutions (up to 1,024px). Despite zero visual tokens, Video2LoRA maintains quality while providing massive efficiency gains.
(a) Change in mean Token-F1 (V2L − Base)
(b) Query-time TTFT speedup of Video2LoRA
(c) Input-token reduction achieved during answering
Inference efficiency comparison between the direct video-in-context baseline and Video2LoRA on VidCapBench, showing speedup and amortization profiles.
(a) Single-question average TTFT (including video internalization)
(b) Amortized TTFT per question vs. number of questions per video
Comparison of the base model with video and VIDEO2LORA generated adapters, across captioning benchmarks using LLM Judge scores and Token F1. We report mean scores, the paired difference Δ (V2L − Base), 95% confidence intervals, and the statistical equivalence (Eq) and non-inferiority (NI) criteria.
| Benchmark | SmolVLM 500M | SmolVLM 2.2B | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Base | V2L | Δ | 95% CI | Eq | NI | Base | V2L | Δ | 95% CI | Eq | NI | |
| ActivityNet Captions | 0.428 | 0.356 | -0.072 | [-0.104, -0.041] | Y | Y | 0.576 | 0.492 | -0.084 | [-0.113, -0.057] | Y | Y |
| PLM-RDCap | 0.308 | 0.263 | -0.045 | [-0.069, -0.021] | Y | Y | 0.326 | 0.316 | -0.010 | [-0.032, +0.012] | Y | Y |
| PLM-RCap | 0.252 | 0.242 | -0.011 | [-0.031, +0.009] | Y | Y | 0.270 | 0.287 | +0.017 | [+0.001, +0.034] | Y | Y |
| VDC (aggregate) | 0.515 | 0.406 | -0.108 | [-0.118, -0.098] | Y | Y | 0.539 | 0.511 | -0.028 | [-0.037, -0.019] | Y | Y |
| CaReBench | 0.334 | 0.278 | -0.056 | [-0.067, -0.045] | Y | Y | 0.437 | 0.369 | -0.068 | [-0.078, -0.058] | Y | Y |
| Average | 0.367 | 0.309 | -0.058 | [-0.078, -0.039] | Y | Y | 0.430 | 0.395 | -0.035 | [-0.052, -0.018] | Y | Y |
| Benchmark | SmolVLM 500M | SmolVLM 2.2B | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Base | V2L | Δ | 95% CI | Eq | NI | Base | V2L | Δ | 95% CI | Eq | NI | |
| ActivityNet Captions | 0.236 | 0.243 | +0.007 | [+0.002, +0.012] | Y | Y | 0.263 | 0.256 | -0.007 | [-0.012, -0.002] | Y | Y |
| PLM-RDCap | 0.189 | 0.198 | +0.009 | [+0.005, +0.013] | Y | Y | 0.198 | 0.207 | +0.009 | [+0.005, +0.013] | Y | Y |
| PLM-RCap | 0.177 | 0.203 | +0.026 | [+0.021, +0.031] | Y | Y | 0.199 | 0.204 | +0.005 | [+0.001, +0.010] | Y | Y |
| VDC (aggregate) | 0.315 | 0.288 | -0.027 | [-0.030, -0.025] | Y | Y | 0.297 | 0.304 | +0.007 | [+0.003, +0.010] | Y | Y |
| CaReBench | 0.295 | 0.275 | -0.020 | [-0.023, -0.017] | Y | Y | 0.292 | 0.279 | -0.013 | [-0.015, -0.010] | Y | Y |
| Average | 0.243 | 0.242 | -0.001 | [-0.005, +0.003] | Y | Y | 0.250 | 0.250 | 0.000 | [-0.004, +0.004] | Y | Y |
Comparison of the base model with video and VIDEO2LORA generated adapters, across video question answering benchmarks using LLM Judge scores and Token F1. We report mean scores, the paired difference Δ (V2L − Base), 95% confidence intervals, and the statistical equivalence (Eq) and non-inferiority (NI) criteria.
| Benchmark | SmolVLM 500M | SmolVLM 2.2B | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Base | V2L | Δ | 95% CI | Eq | NI | Base | V2L | Δ | 95% CI | Eq | NI | |
| NExT-QA (open) | 0.501 | 0.547 | +0.046 | [+0.007, +0.084] | Y | Y | 0.597 | 0.610 | +0.013 | [-0.022, +0.048] | Y | Y |
| ActivityNet-QA | 0.524 | 0.541 | +0.016 | [-0.031, +0.064] | Y | Y | 0.627 | 0.531 | -0.096 | [-0.144, -0.049] | Y | Y |
| PLM-SGQA | 0.390 | 0.317 | -0.074 | [-0.113, -0.034] | Y | Y | 0.493 | 0.295 | -0.198 | [-0.236, -0.161] | – | – |
| VidCapBench | 0.502 | 0.451 | -0.050 | [-0.071, -0.030] | Y | Y | 0.551 | 0.475 | -0.076 | [-0.096, -0.055] | Y | Y |
| Average | 0.487 | 0.460 | -0.027 | [-0.043, -0.011] | Y | Y | 0.562 | 0.477 | -0.085 | [-0.101, -0.069] | Y | Y |
| Benchmark | SmolVLM 500M | SmolVLM 2.2B | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Base | V2L | Δ | 95% CI | Eq | NI | Base | V2L | Δ | 95% CI | Eq | NI | |
| NExT-QA (open) | 0.129 | 0.068 | -0.061 | [-0.076, -0.046] | – | – | 0.140 | 0.076 | -0.063 | [-0.079, -0.048] | – | – |
| ActivityNet-QA | 0.197 | 0.023 | -0.174 | [-0.199, -0.149] | – | – | 0.149 | 0.013 | -0.136 | [-0.156, -0.117] | – | – |
| PLM-SGQA | 0.081 | 0.225 | +0.145 | [+0.131, +0.158] | – | Y | 0.092 | 0.203 | +0.111 | [+0.098, +0.124] | – | Y |
| VidCapBench | 0.216 | 0.209 | -0.007 | [-0.019, +0.004] | Y | Y | 0.196 | 0.218 | +0.022 | [+0.010, +0.033] | Y | Y |
| Average | 0.156 | 0.131 | -0.024 | [-0.041, -0.008] | Y | Y | 0.144 | 0.128 | -0.017 | [-0.032, -0.002] | Y | Y |
Fine-grained evaluation on the Video Description Corpus (VDC) across various captioning styles, comparing the performance of base models and Video2LoRA generated adapters.
| Subset / Caption Style | SmolVLM 500M | SmolVLM 2.2B | ||
|---|---|---|---|---|
| Base | V2L (Δ) | Base | V2L (Δ) | |
| Short caption | 0.629 | 0.535 (-0.094) | 0.556 | 0.579 (+0.022) |
| Detailed caption | 0.476 | 0.401 (-0.074) | 0.526 | 0.463 (-0.063) |
| Camera | 0.310 | 0.131 (-0.178) | 0.478 | 0.392 (-0.085) |
| Background | 0.642 | 0.523 (-0.117) | 0.588 | 0.606 (+0.018) |
| Main object | 0.517 | 0.442 (-0.075) | 0.546 | 0.514 (-0.032) |
Fine-grained evaluation across CaReBench subsets (semantic dimensions), comparing base models and Video2LoRA generated adapters.
| Subset / Semantic Dimension | SmolVLM 500M | SmolVLM 2.2B | ||
|---|---|---|---|---|
| Base | V2L (Δ) | Base | V2L (Δ) | |
| Caption | 0.418 | 0.324 (-0.094) | 0.465 | 0.400 (-0.065) |
| Events | 0.201 | 0.169 (-0.032) | 0.340 | 0.267 (-0.073) |
| Objects | 0.368 | 0.327 (-0.043) | 0.457 | 0.392 (-0.065) |
| Spatial caption | 0.424 | 0.329 (-0.095) | 0.519 | 0.426 (-0.094) |
| Temporal caption | 0.260 | 0.242 (-0.018) | 0.404 | 0.360 (-0.045) |
Browse real evaluation outputs comparing the Baseline (prompted with direct video-in-context) versus Video2LoRA (frozen VLM with no video tokens, answering from parameters).
If you find Video2LoRA useful for your research, please cite our submission:
@misc{suri2026video2loraparametricvideointernalization,
title={Video2LoRA: Parametric Video Internalization for Vision-Language Models},
author={Manan Suri and Sarvesh Baskar and Dinesh Manocha},
year={2026},
eprint={2606.04351},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.04351},
}