Kling O1 — Image to Video

About this model

Multimodal Kling model for short 5 to 10 second clips and context-aware editing when you need more than basic generation and want to steer style, character design, camera language, or source footage with references.

When is this model useful?

Kling O1 works best when basic text-to-video is not enough and you need stronger control: preserve motion, borrow style from references, carefully transform existing footage, or hold subject appearance more consistently across the clip.

Best fit tasks

Text-to-video generation for short clips where a plain prompt is not enough and you want to add reference images so the style, subject, or product stays closer to the brief.
Image-to-video workflows where the opening frame matters and you want the scene to start from a specific still while optionally landing on a guided final composition.
Reference-driven generation with a video reference in feature mode, where you want to borrow camera rhythm, motion language, or scene energy without directly editing the source footage itself.
Video-to-video editing in base mode when you need to change the character, environment, or styling of a shot while preserving the original motion, timing, and optionally the original audio track.

Main advantages

One model covers text-to-video, image-guided generation, and reference-based video editing, so related workflows stay in one tool instead of being split across multiple models.
Kling O1 is especially useful when motion integrity matters: in base editing, camera movement, action timing, and overall shot structure usually stay more stable than in broad remix-style tools.
You can add several reference images, up to 7 without video or up to 4 when a video reference is included, which helps with characters, products, wardrobe, locations, and visual style.
Standard and Pro modes make budgeting easier: Standard is practical for iteration, while Pro is better for cleaner and more presentation-ready output.

Limitations to know

This model is built for short clips. Text-to-video and image-to-video usually run at 5 or 10 seconds, while video-reference workflows stay in the short 3 to 10 second range.
Kling O1 does not generate brand-new audio. It can only preserve the existing audio from a source video during base editing when keep original sound is enabled.
Aspect ratio control is mainly for pure text-to-video. Once you start from an uploaded image or edit a base video, the model usually follows the framing of that input media.
Too many references, overly long prompts, or several dense events inside one short render can reduce control. This is not the best model for long narrative sequences or tiny frame-perfect text.

How to use this model

The best workflow for Kling O1 is to begin with one clear scene goal, then add references only when each one solves a specific control problem: subject appearance, visual style, camera behavior, or editing an existing clip.

Simple workflow

Write the prompt in plain language: who or what is in the shot, what happens, where it happens, what style you want, how the camera behaves, and what should remain consistent.
Choose Standard for cheaper, faster tests or Pro when you are closer to a final render and want cleaner output.
For text-to-video, pick 5 or 10 seconds and choose 16:9, 9:16, or 1:1 depending on where the clip will be used.
Upload a start image if the first frame needs to match a specific portrait, product shot, illustration, or composition. Add an end image only when you want the scene to arrive at a very specific final frame.
Add reference images when subject identity, product details, styling, or location cues matter. You can use up to 7 image references without video or up to 4 when a video reference is also attached.

Supported inputs

Required: a text prompt.
Optional: one start image in JPG, JPEG, or PNG format, up to 10 MB.
Optional: one end image in JPG, JPEG, or PNG format, up to 10 MB, but only together with a start image.
Optional: reference images in JPG, JPEG, or PNG; up to 7 images without video or up to 4 when a video reference is included.
Optional: one reference video in MP4 or MOV format, roughly 3 to 10 seconds long and up to 200 MB.

What you get

A generated MP4 video file.
In AISVIT text-to-video and image-to-video, the result is usually silent rather than newly audio-generated.
In base editing, you can preserve the original source audio when keep original sound is enabled.
Clip length depends on the workflow: 5 or 10 seconds for clean generation, a short 3 to 10 second range for feature-video guidance, or source-following duration for base editing.
Quality is controlled through Standard and Pro modes rather than a separate manual resolution selector.

Other workflows for this model

AISVIT pricing details

Standard without video input: 8.4 credits per second
Standard with reference or base video: 12.6 credits per second
Pro without video input: 11.2 credits per second
Pro with reference or base video: 16.8 credits per second