AVControl: Efficient Framework for Training Audio-Visual Controls

Abstract

Controlling video and audio generation requires diverse modalities — from depth and pose to camera trajectories and audio transformations — yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality.

We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this.

On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives.

Method Overview

AVControl places the reference control signal on a parallel canvas — as additional tokens processed alongside the generation target in the model's self-attention layers. The only trainable component is a lightweight LoRA adapter; the audio-visual backbone remains entirely frozen. Each control modality is trained as its own independent LoRA, keeping individual runs small and focused.

Training Efficiency

Each LoRA trains in as few as a few hundred to at most 15,000 steps, on small datasets ranging from a few hundred to a few thousand samples. The total budget across all modalities is roughly 55,000 steps — less than one third of a single VACE training run (200K steps).

Control Modalities

Spatially-Aligned Controls

Depth, pose, and edge maps each guide the generation through their own independently trained LoRA. The outputs faithfully follow the spatial structure while maintaining natural motion and visual quality.

Depth

Pose

Edge

Inference-Time Strength Control

Because reference and target interact through self-attention, we can continuously modulate control strength at inference time. This is impossible with channel-concatenation methods.

Strength 0.0

Strength 0.25

Strength 0.5

Strength 1.0

Condition

Video Editing

Inpainting

Input with Mask → Inpainted Result

Outpainting

Cropped Input → Outpainted Result

Local Edit (mask-free)

Original → Edited Output

Video Detailing

Low-res Input → Detailed Output

Crop comparison

Camera Trajectory Control

From a Single Image

From a single image, we generate video with a specified camera motion. The desired trajectory is encoded as a warped grid on the reference canvas.

Camera Grid → Generated Output

Diverse Camera Trajectories (same source image)

Trajectory 1

Trajectory 2

Trajectory 3

Trajectory 4

From an Existing Video

Given an existing video, we estimate its full camera parameters and re-render the scene at a new trajectory while preserving the original scene dynamics.

Source Video

Camera Grid

Re-rendered

Diverse Camera Trajectories (same source video)

Source

Trajectory 1

Trajectory 2

Trajectory 3

Sparse Track Control

Point tracks are rendered as colored dots on a black canvas. The LoRA generates a video that follows these trajectories, enabling fine-grained control over object motion.

Track Dots → Generated Output

Diverse Sparse Tracks (same source image)

Tracks 1

Tracks 2

Tracks 3

Tracks 4

Cut-on-Action

Re-render a scene from a substantially different viewpoint while preserving the action and timing.

Source Angle

Generated New Angle

Diverse Angles (same input video)

Source Angle

New Angle 1

New Angle 2

New Angle 3

Audio-Visual Applications

The following examples contain audio — please turn on your speakers.

Audio Intensity Control

Audio intensity control generates audio whose temporal dynamics follow the visual content. The waveform below shows how the generated audio energy tracks the on-screen action.

RMS Envelope Over Time

Speech → Ambient Scene

Embed clean speech within ambient sounds matching a described scene.

Condition — Clean Speech

Output — Ambient Audio

Who-is-Talking Control

Generate multi-person talking video with synchronized lip motion from an abstract layout of colored bounding boxes indicating who speaks when.

Bounding Boxes → Generated Output

Comparison vs. VACE

Depth-Conditioned Generation

Control Signal

Ours

VACE

Pose-Conditioned Generation

Control Signal

Ours

VACE

Results Across Benchmarks

Benchmark	Task	Key Metric	Δ vs. Best Baseline
VACE Benchmark	Depth / Pose / Inpaint / Outpaint	VBench Avg	+2.3 to +3.8 vs. VACE — best on all 4 tasks
ReCamMaster	Camera trajectory	CLIP-F 99.13%	+0.39 vs. ReCamMaster (98.74%)
VGGSound	Audio intensity	IS 34.51	Best among all methods
HDTF	Who-is-talking	E-FID 0.18	Best E-FID and FID (12.31)

BibTeX

@article{benyosef2026avcontrol,
  title     = {AVControl: Efficient Framework for Training Audio-Visual Controls},
  author    = {Ben-Yosef, Matan and Halperin, Tavi and Ken Korem, Naomi and Salama, Mohammad and Cain, Harel and Joseph, Asaf and Chen, Anthony and Jelercic, Urska and Bibi, Ofir},
  year      = {2026},
}

AVControl

Efficient Framework for Training Audio-Visual Controls

AVControl trains lightweight LoRA adapters on a frozen audio-visual backbone to enable diverse control modalities — from depth and pose to camera trajectories and audio transformations.

Abstract

Method Overview

Training Efficiency

Control Modalities

Spatially-Aligned Controls

Depth

Pose

Edge

Inference-Time Strength Control

Strength 0.0

Strength 0.25

Strength 0.5

Strength 1.0

Condition

Video Editing

Inpainting

Outpainting

Local Edit (mask-free)

Video Detailing

Camera Trajectory Control

From a Single Image

Diverse Camera Trajectories (same source image)

From an Existing Video

Source Video

Camera Grid

Re-rendered

Diverse Camera Trajectories (same source video)

Sparse Track Control

Diverse Sparse Tracks (same source image)

Cut-on-Action

Diverse Angles (same input video)

Audio-Visual Applications

Audio Intensity Control

Speech → Ambient Scene

Who-is-Talking Control

Comparison vs. VACE

Depth-Conditioned Generation

Control Signal

Ours

VACE

Pose-Conditioned Generation

Control Signal

Ours

VACE

Results Across Benchmarks

BibTeX