Byte open-source Bernini: AI video editing finally begins to 'understand human language'

2026-06-18

When video generation shifts from "one-time release" to "repeated modification", the real challenge is no longer just the image quality, but controllability.

In the past year, the progress of AI video models has been rapid: higher clarity, smoother shots, and increasingly diverse styles.

But if you really put it into the creative process, you will soon encounter another problem: easy to generate, difficult to modify.

For example, if you want to change a sunny city aerial photo to a snowy day. Many models may simply add some snowflakes to the image, without synchronously changing the sky, road reflections, and building lighting. For example, if you want to put a poster on the LED screen of a shopping mall, the edges will be blurred, the perspective will be messy, and the camera will reveal the truth when it moves.

This type of problem may appear to be "unstable effects" on the surface, but fundamentally, the model does not truly understand: what users want to change, which areas must be retained, and which changes need to be consistent along the timeline.

The ByteDance commercialization technology team recently opened the unified framework Bernini for video generation and editing, aiming at this pain point. According to public information, Bernini's core idea is to first use a multimodal large model for semantic understanding and planning, and then generate high-quality videos using a Diffusion/DiT renderer. Currently, Bernini-R's inference code and weights are open, and a complete version that includes the MLLM Planner is still being compiled. 

The hardest part of AI video is not "generation", but "accurate modification"

Single image editing is no longer simple, video editing is even more troublesome.

Because a video is not a frame image, but a continuous spatiotemporal relationship. You can change the weather, not just the weather; You must change your actions and not alter the identity of the subject; If you change the material, the texture should not drift randomly with the lens; You implant a picture, and the boundaries, lighting, and perspective must follow the original video.

So, a truly usable video editing model needs to handle at least four things simultaneously:

Understand instructions: know if the user wants to change the weather, style, subject, material, or lens focus;

Retaining structure: The relationship between the subject, background, and lens that should not be moved should be kept as stable as possible;

Maintain consistency in timing: frames before and after should not flicker, drift, or deform;

Supporting reference materials: Image and video references should not just be "looking at a rough idea", but should truly constrain the results.

This is also the reason why Bernini is being watched. It doesn't just emphasize "generating a good-looking video", but pushes the problem to a step closer to the production process: how to continue making changes after generation?

Bernini's solution: Understand first, then take action

Bernini's architecture can be simply understood as a combination of "director+post production".

The first half is like a director: the MLLM based planner is responsible for understanding text instructions, source videos, reference images, and reference videos, and determining what the target image should look like. It does not directly draw pixels, but first forms a target semantic representation, which is equivalent to drawing a "semantic sketch" for the subsequent generation process.

The latter half is like post production: the DiT based renderer is responsible for converting semantic planning into continuous video frames. For video editing tasks, it will also combine the VAE features of the source video, trying to preserve the details and non editing areas in the original video as much as possible, to avoid "deviating the entire film with just one change". [ref_1]

This division of labor may seem simple, but it is crucial.

In the past, many video generation models were more like "start drawing when you see a prompt". If the prompt is not precise enough, the model will be free to play; If the user only wants to modify a part, the model may start over the entire video. Bernini attempted to add a layer of "understanding and planning" in the middle, clarifying the creative intent before entering the rendering stage.

In other words, what it needs to solve is not whether it can be generated, but whether it can be stably generated according to human ideas.

Controllable editing: from weather, materials, to actions and camera focus

In public cases, Bernini covers multiple types of video editing tasks, including weather changes, style transfer, material replacement, subject action adjustment, focus and perspective control, etc. ref_1

These abilities have direct value when applied in creative settings.

For example, in weather editing, it's not just about adding a snow filter to sunny days, but also about making the sky, roads, buildings, and lighting work together as a whole, making the picture look like the weather has changed in real life.

For example, material replacement is not just about "applying a layer of texture" to the plate, but also about making materials such as fabric, metal, marble, etc. follow the movement of the object to maintain stability, without displacement or drift after a few frames.

For example, action editing is even more difficult. Once the subject moves, the model needs to change its actions while maintaining its identity, body shape, environment, and camera relationships. If the action is changed and the background shakes or the subject deforms during movement, it is difficult for the creator to use it directly.

This is also a hurdle that AI videos must overcome to move from "demonstration effects" to "production tools": users do not just need a stunning sample, but must be able to modify, reuse, and deliver it.

Reference materials will become increasingly important

It is difficult to describe complex visual requirements with just one prompt.

Advertisements should specify products, short dramas should have fixed characters, film and television rehearsals should match scenes, and the artistic style may also come from a reference image. For creators, the most natural way of expression is often not to write a long piece of text, but to directly show the model: it's about this material, this character, this composition, and this visual atmosphere.

Bernini supports images and videos as reference inputs, which is very practical. According to public information, it can be used for reference subject addition, material reference, style reference, image/video implantation, and also supports generating new videos based on reference images, including single image reference, multi-element combination reference, multi angle reference, and keyframe to continuous shot scenes. ref_1

Behind this is a larger trend: AI video creation is moving from "cultural and creative videos" to "multimodal and controllable creation".

The truly high-frequency workflow in the future may not be the creator inputting a sentence and waiting for the result, but rather:

Provide an original video;

Provide reference for several brands, characters, or materials;

Explain in natural language where to make changes;

The model only modifies the modified area and maintains consistency throughout the entire piece.

This is more like an AI version of post production software, rather than just a video blind box generator.

A technical detail: why is it easy to "confuse" with multiple references?

When the model simultaneously receives source video, target video, reference image, and reference video, it encounters a very practical problem: different materials may have similar time and spatial coordinates.

If not distinguished, the model can easily mix "reference materials" and "videos that need to be edited" together. The result is: what should have been retained was not retained, what should have been migrated was migrated incorrectly, and even the reference image was treated as part of the target image.

Bernini introduced Segment Aware 3D Rotary Positional Embedding (SA-3D RoPE) to address this issue. Simply put, it means adding segment markers to different visual segments to let the model know which segment is the reference, which segment is the source video, and which segment is the target to be generated, while preserving the temporal and spatial positional relationships. [ref_1]

This detail illustrates that video controllable editing is not simply about "bigger models". It requires specialized design in data organization, spatiotemporal representation, and multimodal alignment.

What does it mean for the industry?

The significance of Bernini is not just that ByteDance has opened up a video modeling framework.

More importantly, it has pushed the competition focus of AI videos one step forward: from "who can generate more dazzling demos" to "who can better enter the real creative process".

In the real process, users will definitely make repeated changes. The customer needs to change the product packaging, the director needs to adjust the camera focus, the brand needs to unify the color tone, and the post production needs to accurately implant the materials. The most valuable thing here is not a one-time release, but controllability, interpretability, and iteration.

From this perspective, Bernini represents a direction:

AI video models must first understand the creative intent before executing visual generation; It must be able to receive multiple reference materials and minimize randomness as much as possible.

Of course, one must also remain calm. Currently, publicly available information shows that Bernini-R was the first to be opened, corresponding to the second stage model in the three-stage training process; The version that includes the complete MLLM Planner is still being compiled. Ref_1 means it is still one step away from fully releasing its ability.

But the direction is already clear: AI videos will not stop at "inputting a sentence and generating a segment". The core of the next stage is to turn video generation into a more reliable creative infrastructure.