Tencent's "Sora" Enters the AI Video Arena

Advertisements

At the beginning of this year, the launch of the "text-to-video" model Sora sparked a wave of global competition in AI-generated videoNearly ten months later, Sora has not yet opened its doors to the public, while Tencent's Hunyuan, a newcomer in the field, has swiftly joined the fray.

On December 3rd, Tencent's Hunyuan model officially unveiled its video generation capabilitiesIndividual users can apply for a trial through the Tencent Yuanbao app, while corporate clients can access services through Tencent Cloud, with an API for internal testing applications now available.

Bringing text-to-video generation to the forefront is yet another milestone for the Hunyuan model, following its advancements in text generation, image generation, and 3D renderingIn a notable move, Tencent has also open-sourced the video generation model, which boasts a staggering 13 billion parameters, making it the largest open-source video generation model currently available.

As reported by Wall Street Journal, the entry barrier for using Tencent's video generation capabilities is almost nonexistent—users simply input a text description, and the model generates a five-second video

This efficiency stands in stark contrast to Sora's minute-long video generation or some products akin to Sora that generate videos of approximately ten seconds.

During a recent media briefing, a technical leader from Tencent’s multimodal generation team revealed that the video length is not a technical issue but rather a matter of computational power and data availabilityDoubling the video time requires computational power that rises exponentially, making it less feasibleTherefore, the first version of Hunyuan video generation is limited to five seconds, designed to meet the primary needs of users"If there's strong demand for longer, uninterrupted shots in the future, we will upgrade," he explained.

The current iteration of Tencent Hunyuan video generation highlights four main characteristics: realistic quality, semantic adherence, dynamic fluidity, and native transitions

Technologically, Tencent's video generation model adopts a DiT architecture similar to that of Sora but incorporates several crucial upgradesThese include the integration of a multimodal large language model as a text encoder, a self-developed Scaling Law-based full-attention DiT, and a self-developed 3D Variational Autoencoder (VAE).

The leader noted that Tencent Hunyuan stands out as one of the few video generation models employing a multimodal large language model as a text encoder, while the industry commonly resorts to T5 and CLIP models for this purposeThis strategic choice stems from recognizing three significant benefits offered by this technological pathway: enhanced comprehension of complex text, inherent alignment of images and text, and support for system prompts.

Furthermore, prior to undertaking the GPT project, OpenAI invested considerable effort into validating the effectiveness of the Scaling Law—which posits that training larger models with more data yields better performance—in language models

However, this validation has yet to be publicly established in the domain of video generation, both academically and industrially.

Amidst this backdrop, the Tencent Hunyuan team conducted its own verification of the Scaling Law concerning image and video generation, ultimately concluding that while image DiT supports it, video generation can also be effective through a two-stage training process based on an image DiT model.

"Therefore, our initial Tencent Hunyuan video generation model is built upon fairly rigorous inferences from the Scaling Law, resulting in a 13 billion parameter model," stated the technical leader of Tencent Hunyuan’s multimodal generation team.

At the same time, Tencent Hunyuan is actively exploring the ecosystem of video generation models, including models for generating videos from images, voiceovers for videos, and the creation of digital avatars from 2D photos

alefox

The technical leader mentioned that in contrast to text-to-video, generating videos from images is progressing more swiftly, with the possibility of announcing new developments within the next month.

Since the AI large model boom ignited by ChatGPT two years ago, the technological pathway for large language models has converged, while video generation models remain in a phase of exploration.

Analysts at Orient Securities noted that under the technological direction set by OpenAI, the technical trajectory of language models has become predominantly centered around the GPT styleHowever, in the realm of multimodal technologies, no single company currently holds an absolutely leading position, leaving various exploratory paths open for companies to navigate.

The technical leader from Tencent Hunyuan reiterated that the overall state of text-to-video generation is still relatively immature, with a low success rate across the board

As one of the most challenging avenues in multimodal generation, video generation demands substantial resources in terms of computational power and data, making it less advanced than text or image generationThis challenge is compounded by slow progress in commercialization and product development.

OpenAI has further announced delays to the updates of Sora, attributing this set-back to a deficiency in computational resources, which has prevented the model from being opened to the public as of yet.

Despite these hurdles, the race to dominate the market has led to a flurry of developments in the video generation space since last November.

To date, numerous domestic and international large model vendors have launched products similar to Sora, including major players such as MiniMax, Zhiyu, ByteDance, Kuaishou, and Ai Shi Technology in China, as well as overseas firms like Runway, Pika, and Luma

Leave a Comment

*Call us 24/7 or fill out the form below to receive a free.