AISVIT / AI Video / Image to Video

Fabric 1.0 — Image to Video

Image to Video with Fabric 1.0 in AISVIT. Animate still images into dynamic videos with AI. Add camera motion, subject movement, and cinematic transitions from a single source image.

About this model

Specialized VEED audio-to-video model for lip-synced talking-head videos from one portrait image and a voice track, with up to 60-second output and simple per-second pricing.

When is this model useful?

Fabric 1.0 works best when the goal is a speaking person on screen, not a broad cinematic scene or complex character action.

Best fit tasks

Talking-head videos from one portrait photo and a recorded voice track for explainers, onboarding, FAQ answers, product updates, and course clips.
AI presenter or spokesperson videos for landing pages, social posts, internal communications, and lightweight marketing content.
Turning an existing voiceover, podcast excerpt, or announcement into a simple face-on video without filming a real presenter.
Localized or multilingual avatar content when you want to keep the same portrait but swap the audio for different languages or messages.

Main advantages

It is purpose-built for talking portraits, so lip-sync and facial speaking motion are usually better aligned than in general video generators.
You can use your own audio file, which gives you direct control over tone, pacing, emotion, and the exact spoken words.
It supports longer outputs than many short-form cinematic models in AISVIT, with videos up to 60 seconds when the input audio allows it.
The control set is simple for non-technical users: portrait image, audio file, and output resolution.

Limitations to know

This is not the right model for action scenes, multiple characters, wide camera moves, or full-body animation. It is specialized for talking-head output.
Results depend heavily on the source portrait. Clear, front-facing images with one visible face usually work best, while side angles, occlusions, or busy crops can reduce quality.
In this integration, output is limited to 480p or 720p, so it is not a high-resolution cinematic production route.
Video length mostly follows the uploaded audio, and you do not get deep manual control over gestures, camera behavior, or scene staging.

How to use this model

The simplest workflow is to start with a clean portrait and a clear voice track, then choose the resolution based on whether you are testing an idea or preparing a more polished asset.

Simple workflow

Upload a clear portrait image where one face is easy to see. Head-and-shoulders framing with good lighting usually gives the most stable talking-head result.
Upload the voice audio that should drive the animation. The model follows that recording, so speaking speed, pauses, and emotion come from your audio file.
Choose 480p if you want a lighter, cheaper first pass, or 720p if you want a cleaner result for publishing or review.
Keep the message focused. This model works best for one speaker delivering one clear message, not for scene changes or several people talking.
If you need several language versions, reuse the same image and swap the audio track for each language. That is often faster than rebuilding the whole video from scratch.

Supported inputs

Required: one portrait image in JPG, JPEG, or PNG format.
Required: one audio file in MP3, WAV, M4A, or AAC format.
Best results usually come from one clear, front-facing portrait with good lighting and minimal face obstruction.
The model supports 480p or 720p output selection in this AISVIT route.
Output can run up to 60 seconds when the uploaded audio length and model limits allow it.

What you get

A generated MP4 video file.
A talking-head style clip where the portrait is animated to match the uploaded speech audio.
Available output resolutions: 480p or 720p.
Video length that usually follows the source audio, up to 60 seconds in the current model description.

Other workflows for this model

Audio to Video

AISVIT pricing details

480p: 8 credits per second
720p: 15 credits per second