Preference-Optimized Video Question Answering with Rationales for Data Efficiency
We introduce POVQA, a data-efficient pipeline for video question answering that addresses the critical challenge of long video understanding. Our method compresses each second of video into a single temporally pooled image via motion blur and weighted averaging variants, then aligns Large Vision-Language Models with lightweight supervision. We achieve a 23× reduction in context tokens while maintaining comprehensive temporal coverage. Using our novel ReasonVQA dataset with only 239 human-annotated QA pairs, we demonstrate dramatic improvements: F1 score from 0.212 to 0.543, BLEU-4 from 0.031 to 0.291, and ROUGE-L from 0.196 to 0.528. Zero-shot evaluation on TVQA achieves 64.7% accuracy, surpassing prior zero-shot methods.
Representative examples showing POVQA's reasoning capabilities and temporal understanding across different video contexts.
Multi-character dialogue scene with temporal pooling visualization
Motion-heavy scene with blend blur pooling effectiveness
Long sequence requiring understanding of temporal relationships
Short scene with both textual and visual cues
Example from movies tvqa
Example from movies ReasonVQA
Example from movies ReasonVQA
POVQA tackles the challenge of processing long videos (up to 5 minutes) within LLM context limits through intelligent temporal pooling. Our pipeline consists of four key innovations:
Temporal Pooling: Four novel operators (Blend Blur, Weighted Average, Exponential, Ramp) compress 24-60 frames into single representative images.
Subtitle Alignment: Interleaved text-image sequences maintain temporal coherence while maximizing information density.
Rationale Supervision: Two-stage fine-tuning with reasoning chains and direct preference optimization.
Raw Video → Temporal Pooling → Subtitle Alignment → QLoRA SFT → DPO → Enhanced VQA
Method | F1 Score | BLEU-4 | ROUGE-L | Embed Cosine |
---|---|---|---|---|
Baseline (No Fine-tuning) | 0.212 | 0.031 | 0.196 | 0.383 |
POVQA (SFT Only) | 0.545 | 0.278 | 0.520 | 0.632 |
POVQA (SFT + DPO) | 0.543 | 0.291 | 0.528 | 0.631 |
Method | Zero-shot | Accuracy (%) |
---|---|---|
POVQA (Ours) | ✓ | 64.7 |
FrozenBiLM (w/ speech) | ✓ | 59.7 |
GPT-4V | ✓ | 57.8 |
IG-VLM (LLaVA-1.6 34B) | ✓ | 51.1 |
Goldfish-7B (vision+subs) | ✓ | 46.9 |
Q-ViD | ✓ | 41.0 |