Video2LoRA: Parametric Video Internalization
for Vision-Language Models

Train a hypernetwork to convert video context into dynamic LoRA weights, enabling zero-visual-token video QA and captioning.

University of Maryland, College Park

manans@umd.edu      baskarsarvesh@gmail.com

Vid2LoRA Overview

Figure 1: VIDEO2LORA overview. Training (left): A frozen VLM encodes the input video into hidden states. The trainable VIDEO2LORA hypernetwork reads these states and generates LoRA adapter weights in a single forward pass. The adapter-augmented frozen VLM is trained against teacher-generated targets. Inference (right): Given a new video, VIDEO2LORA generates the LoRA adapter once. The frozen VLM, augmented with this adapter, answers arbitrary text queries without visual tokens. Per-query cost is independent of video length.

Abstract

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce VIDEO2LORA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, VIDEO2LORA predicts these weights directly from the video.

Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, VIDEO2LORA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. VIDEO2LORA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500× and query TTFT by 6–80×, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.

1500×
Token Reduction
Visual tokens in VLM prompt context are reduced to zero.
6x - 80x
TTFT Speedup
Near-instantaneous time-to-first-token during queries.
0
Visual Tokens
Inference operates on text only, using parameterized video info.
1,024px
Robust Scaling
Extrapolates stably up to 1024 frames and 1024px resolutions.

Methodology

1

Problem Formulation

Let v denote a video, i an internalization instruction, p a downstream text prompt, and y the target response. We assume a frozen vision-language encoder E, a frozen answer model F, and a trainable hypernetwork Hφ. The answer model receives the text prompt p and the adapter θ(v), but not the video tokens. During training, only φ is updated; both E and F remain frozen.

2

Video Encoder

We use a frozen SmolVLM2 model as the video encoder. Given a sampled video and the internalization instruction, we collect the text-side hidden states from each transformer layer. Keeping the layer dimension allows the hypernetwork to generate layer-indexed adapters instead of using a single pooled video vector for all layers.

3

Perceiver Hypernetwork

The hypernetwork maps the video-conditioned hidden states into LoRA weights for selected linear modules of the frozen answer model. We use a Perceiver-style resampler architecture. A shared projection head maps each rank latent to the two LoRA factors, which are scaled by learned multipliers initialized to zero perturbation.

4

Dynamic LoRA Injection

For a frozen linear layer, we use the standard LoRA factorization. Each example receives its own generated adapter, so the LoRA weights are conditioned on the input video rather than shared across all videos.

5

Training Objective

We train the hypernetwork with teacher-forced cross-entropy over response tokens. The answer model receives only the downstream text prompt and the generated adapter during this loss computation, ensuring zero visual tokens in the context window.

Benchmark Performance

Video2LoRA is statistically equivalent or non-inferior to direct video encoding across standard benchmarks, despite storing zero visual frames in the VLM's context at query time.

Scaling & Quality Trade-offs

Scaling behavior on VDC background captioning across frame counts (up to 1,024 frames) and spatial resolutions (up to 1,024px). Despite zero visual tokens, Video2LoRA maintains quality while providing massive efficiency gains.

Change in mean Token-F1

(a) Change in mean Token-F1 (V2L − Base)

Query-time TTFT speedup

(b) Query-time TTFT speedup of Video2LoRA

Input-token reduction

(c) Input-token reduction achieved during answering

Inference Efficiency on VidCapBench

Inference efficiency comparison between the direct video-in-context baseline and Video2LoRA on VidCapBench, showing speedup and amortization profiles.

Single-question average TTFT

(a) Single-question average TTFT (including video internalization)

Amortized TTFT vs questions per video

(b) Amortized TTFT per question vs. number of questions per video

Table 1: Comparison across Captioning Benchmarks

Comparison of the base model with video and VIDEO2LORA generated adapters, across captioning benchmarks using LLM Judge scores and Token F1. We report mean scores, the paired difference Δ (V2L − Base), 95% confidence intervals, and the statistical equivalence (Eq) and non-inferiority (NI) criteria.

1. LLM Judge Score (Rescaled [0, 1])

Benchmark SmolVLM 500M SmolVLM 2.2B
Base V2L Δ 95% CI Eq NI Base V2L Δ 95% CI Eq NI
ActivityNet Captions 0.428 0.356 -0.072 [-0.104, -0.041] Y Y 0.576 0.492 -0.084 [-0.113, -0.057] Y Y
PLM-RDCap 0.308 0.263 -0.045 [-0.069, -0.021] Y Y 0.326 0.316 -0.010 [-0.032, +0.012] Y Y
PLM-RCap 0.252 0.242 -0.011 [-0.031, +0.009] Y Y 0.270 0.287 +0.017 [+0.001, +0.034] Y Y
VDC (aggregate) 0.515 0.406 -0.108 [-0.118, -0.098] Y Y 0.539 0.511 -0.028 [-0.037, -0.019] Y Y
CaReBench 0.334 0.278 -0.056 [-0.067, -0.045] Y Y 0.437 0.369 -0.068 [-0.078, -0.058] Y Y
Average 0.367 0.309 -0.058 [-0.078, -0.039] Y Y 0.430 0.395 -0.035 [-0.052, -0.018] Y Y

2. Token F1 Score

Benchmark SmolVLM 500M SmolVLM 2.2B
Base V2L Δ 95% CI Eq NI Base V2L Δ 95% CI Eq NI
ActivityNet Captions 0.236 0.243 +0.007 [+0.002, +0.012] Y Y 0.263 0.256 -0.007 [-0.012, -0.002] Y Y
PLM-RDCap 0.189 0.198 +0.009 [+0.005, +0.013] Y Y 0.198 0.207 +0.009 [+0.005, +0.013] Y Y
PLM-RCap 0.177 0.203 +0.026 [+0.021, +0.031] Y Y 0.199 0.204 +0.005 [+0.001, +0.010] Y Y
VDC (aggregate) 0.315 0.288 -0.027 [-0.030, -0.025] Y Y 0.297 0.304 +0.007 [+0.003, +0.010] Y Y
CaReBench 0.295 0.275 -0.020 [-0.023, -0.017] Y Y 0.292 0.279 -0.013 [-0.015, -0.010] Y Y
Average 0.243 0.242 -0.001 [-0.005, +0.003] Y Y 0.250 0.250 0.000 [-0.004, +0.004] Y Y

Table 4: Comparison across Video Question Answering Benchmarks

Comparison of the base model with video and VIDEO2LORA generated adapters, across video question answering benchmarks using LLM Judge scores and Token F1. We report mean scores, the paired difference Δ (V2L − Base), 95% confidence intervals, and the statistical equivalence (Eq) and non-inferiority (NI) criteria.

1. LLM Judge Score (Rescaled [0, 1])

Benchmark SmolVLM 500M SmolVLM 2.2B
Base V2L Δ 95% CI Eq NI Base V2L Δ 95% CI Eq NI
NExT-QA (open) 0.501 0.547 +0.046 [+0.007, +0.084] Y Y 0.597 0.610 +0.013 [-0.022, +0.048] Y Y
ActivityNet-QA 0.524 0.541 +0.016 [-0.031, +0.064] Y Y 0.627 0.531 -0.096 [-0.144, -0.049] Y Y
PLM-SGQA 0.390 0.317 -0.074 [-0.113, -0.034] Y Y 0.493 0.295 -0.198 [-0.236, -0.161]
VidCapBench 0.502 0.451 -0.050 [-0.071, -0.030] Y Y 0.551 0.475 -0.076 [-0.096, -0.055] Y Y
Average 0.487 0.460 -0.027 [-0.043, -0.011] Y Y 0.562 0.477 -0.085 [-0.101, -0.069] Y Y

2. Token F1 Score

Benchmark SmolVLM 500M SmolVLM 2.2B
Base V2L Δ 95% CI Eq NI Base V2L Δ 95% CI Eq NI
NExT-QA (open) 0.129 0.068 -0.061 [-0.076, -0.046] 0.140 0.076 -0.063 [-0.079, -0.048]
ActivityNet-QA 0.197 0.023 -0.174 [-0.199, -0.149] 0.149 0.013 -0.136 [-0.156, -0.117]
PLM-SGQA 0.081 0.225 +0.145 [+0.131, +0.158] Y 0.092 0.203 +0.111 [+0.098, +0.124] Y
VidCapBench 0.216 0.209 -0.007 [-0.019, +0.004] Y Y 0.196 0.218 +0.022 [+0.010, +0.033] Y Y
Average 0.156 0.131 -0.024 [-0.041, -0.008] Y Y 0.144 0.128 -0.017 [-0.032, -0.002] Y Y

Table 2: VDC Results Broken Down by Caption Style

Fine-grained evaluation on the Video Description Corpus (VDC) across various captioning styles, comparing the performance of base models and Video2LoRA generated adapters.

Subset / Caption Style SmolVLM 500M SmolVLM 2.2B
Base V2L (Δ) Base V2L (Δ)
Short caption 0.629 0.535 (-0.094) 0.556 0.579 (+0.022)
Detailed caption 0.476 0.401 (-0.074) 0.526 0.463 (-0.063)
Camera 0.310 0.131 (-0.178) 0.478 0.392 (-0.085)
Background 0.642 0.523 (-0.117) 0.588 0.606 (+0.018)
Main object 0.517 0.442 (-0.075) 0.546 0.514 (-0.032)

Table 3: CaReBench Results Broken Down by Subset

Fine-grained evaluation across CaReBench subsets (semantic dimensions), comparing base models and Video2LoRA generated adapters.

Subset / Semantic Dimension SmolVLM 500M SmolVLM 2.2B
Base V2L (Δ) Base V2L (Δ)
Caption 0.418 0.324 (-0.094) 0.465 0.400 (-0.065)
Events 0.201 0.169 (-0.032) 0.340 0.267 (-0.073)
Objects 0.368 0.327 (-0.043) 0.457 0.392 (-0.065)
Spatial caption 0.424 0.329 (-0.095) 0.519 0.426 (-0.094)
Temporal caption 0.260 0.242 (-0.018) 0.404 0.360 (-0.045)

Qualitative Examples

Browse real evaluation outputs comparing the Baseline (prompted with direct video-in-context) versus Video2LoRA (frozen VLM with no video tokens, answering from parameters).

MSVD-QA
Example 1 of 6
Question Prompt
Loading question...
Ground Truth
Loading ground truth...
Base Model
Loading...
Video2LoRA
Loading...

BibTeX Citation

If you find Video2LoRA useful for your research, please cite our submission:

@misc{suri2026video2loraparametricvideointernalization, title={Video2LoRA: Parametric Video Internalization for Vision-Language Models}, author={Manan Suri and Sarvesh Baskar and Dinesh Manocha}, year={2026}, eprint={2606.04351}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2606.04351}, }