SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

2K (2560×1408) I2V generation by SwiftI2V — 81 frames in ~111s on a single H800.

Highlights

Why SwiftI2V matters

202×
less GPU-time than end-to-end 2K I2V baselines

2K
native 2560×1408 resolution, 81 frames

24GB
runs on a single consumer RTX 4090

Abstract

Image-to-video (I2V) generation has made rapid progress, yet scaling to high resolution (e.g., 2K) is bottlenecked by the efficiency–fidelity dilemma: end-to-end high-resolution generators deliver strong quality but require tens of thousands of GPU-seconds per clip, while low-resolution generation followed by video super-resolution (VSR) loses the input-image condition and hallucinates details inconsistent with the reference.

We present SwiftI2V, an efficient two-stage framework that resolves this dilemma. Stage I produces a low-resolution motion reference with a large backbone under few-step sampling; Stage II refines it to 2K by taking both the input image and the Stage I output as strong conditions. To handle long-duration 2K refinement within limited GPU memory, we introduce Conditional Segment-wise Generation (CSG) with bidirectional contextual interaction, which divides the temporal axis into bounded segments augmented by neighboring contexts — mitigating error accumulation while retaining high-frequency fidelity. A stage-transition training strategy further closes the train–test gap by simulating Stage I style artifacts during Stage II training.

On VBench-I2V at 2K, SwiftI2V matches strong end-to-end baselines on key I2V metrics while reducing total GPU-time by 202×. Notably, it enables practical 2K I2V generation on a single consumer RTX 4090 within 24 GB of memory.

Method

A two-stage, segment-wise pipeline for efficient 2K I2V

SwiftI2V two-stage pipeline — **Overall pipeline.** Stage I generates a low-resolution motion reference with a large backbone (Low-Res LoRA + Few-Step LoRA, 4-step sampling). Stage II refines it to 2K, conditioned on both the original input image and the Stage I output via channel-wise concatenation.

Conditional Segment-wise Generation (CSG) with bidirectional contextual interaction.

Conditional Segment-wise Generation

Stage II divides the clip into bounded segments (M = 3 target frames with N = 1 neighbor contexts on each side). A segment anchor — the input image — is re-injected into every segment, while bidirectional (not causal) attention across segments suppresses drift and eliminates visible segment boundaries.

This design is memory-bounded: the peak memory stays roughly constant regardless of total length, so long videos run on consumer GPUs.

Stage-Transition Training

During training we synthesize Stage I-style inputs by having Stage I lightly denoise downsampled clips. This closes the train–test gap caused by real Stage I artifacts (aliasing, residual noise).

2K I2V Gallery

Diverse subjects — human portraits, animals, landscapes, and dynamic scenes

Running on a Consumer GPU (RTX 4090 · 24 GB)

SwiftI2V generates 2K videos on a single RTX 4090, bringing high-resolution I2V within reach of consumer hardware.

Qualitative Comparison

Side-by-side videos comparing SwiftI2V against representative baselines. Zoomed-in views (~1/9 of the full frame) focus on the most challenging regions; method names are labelled in the videos.

SwiftI2V preserves fine-grained details and temporal coherence where baselines show artifacts or drift.

Ablation Study

Visual evidence for the effect of our key designs. Variants are labelled directly in the videos.

BibTeX

@misc{liu2026swifti2vefficienthighresolutionimagetovideo,
      title={SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation},
      author={YaoYang Liu and Yuechen Zhang and Wenbo Li and Yufei Zhao and Rui Liu and Long Chen},
      year={2026},
      eprint={2605.06356},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.06356}
}