Source: Metaverse Daily Explosion
Not yet open to public testing, OpenAI has shocked the technology circle, the Internet, and the social media circle with the trailer produced by the text generation video model Sora.
According to the official video released by OpenAI, Sora can generate a one-minute complex scene "hyper video" based on the text information provided by the user. Not only are the details of the picture realistic, this model can also simulate the motion of the lens.
Judging from the released video effects, the industry is excited about Sora’s ability to understand the real world. Compared with other large text-to-video models, Sora shows advantages in semantic understanding, picture presentation, visual coherence and duration.
OpenAI directly calls it a "world simulator", announcing that it can simulate the characteristics of people, animals and environments in the physical world. But the company also admits that Sora is not yet perfect, and there are still imperfections in understanding and potential security issues.
Therefore, Sora is only open for testing to a very small number of people. OpenAI has not yet announced when Sora will be open to the public, but the shock it brings is enough for companies developing similar models to see the gap.
01 Sora "trailer" shocked everyone
As soon as the OpenAI text generation video model Sora was released, there were "shocked" comments in China.
We media exclaimed that "reality no longer exists", and Internet bosses also boasted about Sora's abilities. Zhou Hongyi, founder of 360, said that the birth of Sora means that the implementation of AGI may be shortened from 10 years to about two years. In just a few days, Sora's Google search index increased rapidly, and its popularity was close to that of ChatGPT.
Sora’s popularity stems from the 48 videos released by OpenAI, the longest of which is 1 minute. This not only breaks the duration limit of the video generated by the previous Vincent video models Gen2 and Runway, but also the picture is clear, and it even learns the lens language.
In the 1-minute video, a woman in a red dress is walking on a street lined with neon lights. The style is realistic and the picture is smooth. The most amazing thing is the close-up of the heroine, including the pores and pores on her face. Spots and acne marks are all simulated, and the makeup removal effect is comparable to turning off the beauty filter in a live broadcast. The neck lines on the neck even accurately "leak" the age, and are perfectly unified with the facial condition.
In addition to being realistic about characters, Sora can also simulate real-life animals and environments. A video of a multi-angle close-up of a Victoria Crowned Pigeon. The ultra-high definition shows the blue feathers from the bird's body to the crown, and even the dynamics and breathing rate of the red eyes. It is difficult to tell whether this is generated by AI or shot by humans. of.
For non-realistic creative animations, Sora’s generation effects also reach the look of Disney animated movies, making netizens worry about the animators’ jobs.
The improvements Sora brings to the text generation video model are not only in the video duration and picture effects, but also in simulating the movement trajectories of the lens and shooting, the first-person perspective of the game, the aerial perspective, and even the movie A shot in the end.
After watching the wonderful video released by OpenAI, you can understand why the Internet circle and social media public opinion are shocked by Sora, and these are just trailers.
02 OpenAI proposes a “visual patch” data set
So, how does Sora achieve simulation capabilities?
According to the Sora technical report released by Open AI, this model is surpassing the limitations of previous image data generation models.
Previous research on text generation visual images has used various methods, including recurrent networks, generative adversarial networks (GAN), autoregressive transformers and diffusion models, but the commonality is that they focus on less visual data categories, shorter videos, or fixed-size videos.
Sora adopts a diffusion model based on Transformer. The graph generation process can be divided into two stages: forward process and reverse process, so that Sora can move forward or backward along the timeline. Expand video capabilities.
The forward process stage simulates the diffusion process from a real image to a pure noise image. Specifically, the model gradually adds noise to the image until the image becomes completely noisy. The reverse process is the inverse of the forward process, and the model will gradually restore the original image from the noise image. One positive and one negative, back and forth between virtual and real. In this way, OpenAI allows the machine Sora to understand the formation of vision.
The process from full noise to clear images
Of course, this process requires repeated training and learning. The model will learn how to gradually remove noise and restore the details of the image. Through the iteration of these two stages, Sora's diffusion model is able to generate high-quality images. This model has shown excellent performance in image generation, image editing, super-resolution and other fields.
The above process explains why Sora can achieve high definition and ultra-detail. However, from static images to dynamic videos, the model still needs to further accumulate data and train and learn.
Based on the diffusion model, OpenAI converts all types of visual data, such as videos and images, into a unified representation to perform large-scale generative training of Sora. The representation used by Sora is defined by OpenAI as "visual patches", which are collections of smaller data units, similar to text collections in GPT.
The researchers first compressed the video into a low-dimensional latent space, and then decomposed this representation into spatiotemporal patches. This is a highly scalable representation form that facilitates the conversion from video to patch. , is also suitable for training generative models that process multiple types of videos and images.
Convert visual data into patches
In order to train Sora with less information and computation, OpenAI developed a video compression network to first reduce the dimensionality of the video to a low dimension at the pixel level. Latent space, and then use the compressed video data to generate patches, which can reduce the input information and reduce computational pressure. At the same time, OpenAI also trained the corresponding decoder model to map the compressed information back to the pixel space.
Based on the representation of visual patches, researchers can train Sora on videos/images of different resolutions, durations, and aspect ratios. Entering the inference stage, Sora can determine the video logic and control the size of the generated video by arranging randomly initialized patches in a grid of appropriate size.
OpenAI reports that when trained on a large scale, the video model has demonstrated exciting capabilities, including Sora's ability to truly simulate people, animals and environments in the real world, generate high-fidelity videos, and simultaneously Achieve 3D consistency and time consistency to truly simulate the physical world.
03 Altman serves as a setter for netizens to test
From results to R&D process, Sora shows powerful capabilities, but ordinary users have not yet been able to experience it. At present, they can only write prompt words. On X, @OpenAI founder Sam Altman, as Setter helps netizens generate videos on Sora and then releases them for the public to see the effect.
This also makes people wonder whether Sora is really as awesome as OpenAI officially shows.
In this regard, OpenAI bluntly stated that there are still some problems in the current model. Like early GPT, current Sora also has "hallucinations", which are more concrete representations of errors in visual-focused video results.
For example, it cannot accurately simulate many basic interactive physical processes, such as the relationship between treadmill tracks and people's movement, the temporal logic of glass breaking and liquid flowing out of the cup, etc.
In the video clip below "Archaeologists excavated a plastic chair", the plastic chair "floated" directly out of the sand.
There are also wolf cubs that appear out of thin air, which netizens jokingly call "wolf mitosis."
Sometimes it can’t tell the difference between front, back, left and right.
The flaws in these dynamic pictures seem to prove that Sora still needs more understanding and training in the logic of physical world movement. In addition, compared to the risks of ChatGPT, the moral and security risks of Sora, which gives an intuitive visual experience, are even greater.
Previously, the Vincentian graph model Midjourney has told humans that "pictures do not necessarily mean the truth." Artificial intelligence-generated pictures that look like real ones have begun to become elements of rumors. Dr. Newell, chief scientific officer of identity verification company iProov, said that Sora can make it "easier for malicious actors to generate high-quality fake videos."
It is conceivable that if Sora The generated videos are maliciously abused for fraud and defamation, spreading violence and pornography, and the consequences are incalculable. This is why Sora makes people shocked and scared.
OpenAI has also considered the security issues that Sora may bring, which is probably the reason why Sora is only open to a very small number of people for invitation-based testing. When will it be open to the public? OpenAI did not give a timetable, and judging from the official video released, other companies have little time to catch up with the Sora model.