Partnering with Shutterstock to accelerate our AI research efforts

Written by
Alexandru Voica
Published on
April 8, 2025
Table of contents

Turn your texts, PPTs, PDFs or URLs to video - in minutes.

Learn more

Today, we are announcing that Synthesia will be leveraging Shutterstock's extensive content library to research new ways of pre-training our EXPRESS-2 model which will power a new generation of AI avatars later this year. 

As part of the agreement, our R&D team will have access to Shutterstock’s content library through a research license, with the option of extending it for commercial use. Our AI researchers will review the high-quality HD and 4K video data made available by Shutterstock and select relevant content that enhances our existing data sets, including cinematic videos that capture individuals actively engaged in professional tasks, and interactions within workplace settings. 

Building a large AI model involves a lot of research-focused experimentation before it can be put into production. Since last year, we’ve been testing various approaches and architectures for pre-training EXPRESS-2, our second foray into building a large video and audio model for our platform. 

During the pre-training phase, models require access to a wide variety of data so that they learn general aspects about the world. In our case, we need to show EXPRESS-2 enough human performances so it can reproduce natural and realistic behaviors, movements and expressions, such as delivering a script in the appropriate tone and with the correct facial expressions and body language movements. 

To do so, we started a research project in 2023 during which we paid thousands of actors and studied their performances in our studios around the world. Thanks to this partnership with Shutterstock, we hope to try out new approaches that will improve the performance of EXPRESS-2, and increase the realism and expressiveness of our AI generated avatars, bringing them closer to human-like performances.​ 

Below is a quick comparison between the current generation Expressive Avatars powered by EXPRESS-1 in production, and what could soon be possible thanks to EXPRESS-2. 

While these avatars are still iterations of our research (and therefore not representative of the final commercial product), we’re very excited about what we’ve been able to achieve so far.

Partnering with Shutterstock to accelerate Synthesia's AI research

Once large models are pre-trained, they usually require an element of fine-tuning so they can perform specific tasks. Recently, there have been some misconceptions and unhelpful generalizations about the outputs of large models in relation to how they’re built, so it’s worth stating a few facts about the relationship between our models and the software platform we operate.

First of all, Synthesia is not a research lab developing foundational models, we’re an applied AI company offering businesses access to a software platform. We’ve chosen to create an R&D team and pursue fundamental research into video and audio models because we want to be the best in the world at replicating humans sharing information in a business context. 

Therefore, we do not sell direct access to our models; instead, our models are wrapped around layers of software which are presented to our users as a graphical interface that resembles the experience of building a PowerPoint presentation. As a result, the outputs of our platform are more deterministic than those of a foundational model because our users don’t use prompts like they would with a chatbot or a general purpose video generator. Instead, they provide a script which is then performed by an AI avatar with the appropriate tone, facial expressions and body language generated with the help of a model, as seen in the video below. 

We also fine-tune our models to provide a specific feature to our users: an avatar with their likeness. To do this, users provide a short video of themselves which is then fed to the model to create an AI avatar based on their likeness. This step is done with the explicit consent of the individual involved using a biometric-based verification flow. 

In other words, users don’t come to our platform to create deepfakes of celebrities or 30-second cinematic shots for entertainment purposes; instead, they use Synthesia to create a digital twin of themselves which is then placed in an engaging and dynamic corporate video about cybersecurity best practices, calculating water bills or ways to communicate better at work

Finally, we add a third layer of control to the outputs of our models by applying content moderation before videos are generated. Once a user submits a video draft to our platform for generation, their content passes through a verification process based on a mix of automated and human moderation to ensure compliance with our content policies. 

We are very clear about the types of content that cannot be created on our platform and our detailed terms are made available to users when they sign up for the platform. In a recent red team test run by NIST in partnership with Humane Intelligence, our moderation systems performed significantly better in preventing the generation of harmful content compared to the guardrails available with foundational models. 

The new EXPRESS-2 model will represent a leap forward in AI-driven video and audio synthesis. With the enriched training data from Shutterstock, we hope to accelerate our research and produce more capable models that deliver performances with nuanced expressions and gestures that closely mimic human behavior. 

These advancements will in turn empower businesses to create more engaging and authentic corporate videos, effectively conveying their narratives and connecting with their audiences on a deeper level.​

Stay tuned for more updates and demonstrations as we prepare to launch the EXPRESS-2 model later this year!

About the author

Head of Corporate Affairs and Policy

Alexandru Voica

Alexandru Voica is the Head of Corporate Affairs and Policy at Synthesia.

Go to author's profile
faq

Frequently asked questions