Synthesia’s content moderation systems withstand rigorous NIST, Humane Intelligence red team test

Published on
December 2, 2024
Table of contents

Turn your texts, PPTs, PDFs or URLs to video - in minutes.

Learn more

As Head of Security at Synthesia, I recently traveled to Arlington, Virginia to attend CAMLIS, a global AI conference focused on information security. But attending the conference was not the only reason for me to be there. 

After months of preparation, I finally got a chance to participate in Synthesia’s first public and in-depth red team test conducted by the National Institute of Standards and Technology (NIST) in collaboration with Humane Intelligence. During the test, 30 expert security testers challenged and assessed Synthesia’s content moderation capabilities, focusing specifically on preventing the creation of non-consensual deepfakes and harmful content on our platform. 

The red team test was designed in compliance with NIST’s AI-600-1 risk management framework, a detailed document which defines risks associated with the use of generative AI, and describes actions that organizations can take to measure and manage these risks. 

Humane Intelligence, an organization led by Dr. Rumman Chowdhury, is a tech nonprofit building a community of practice around algorithmic evaluations. The organization is building a no-code, open access evaluation platform open to the public.Humane-intelligence.org is a platform for organizations and individuals to align, create community, share best practices, and have a one-stop shop for creating technical evaluations that help drive benchmarks, standards, and more. They are actively engaged in the development of hands-on, measurable methods of real-time assessments of societal impact of AI model.

For this event, Humane Intelligence selected the 30 testers through a rigorous application process, from a pool of approximately 500 phase 1 red teaming participants. Humane Intelligence then partnered with four AI model owners (Meta, Synthesia, Cisco’s Robust Intelligence, and Anote) to create the testing environment, and developed the methodology to measure and evaluate qualitative findings from this exercise.  

I am pleased to report that Synthesia's platform was very resilient to the red team attacks, underscoring the resilience and effectiveness of our guardrails. Here’s what happened in outside Washington, DC in late October: 

The test scope and methodology

The Synthesia-specific test was divided into two primary sections, each corresponding to a separate NIST category from the risk management framework mentioned above.

First, the testers were encouraged to create Personal Avatars on our platform and test whether they could create non-consensual deepfakes during the avatar creation process. This section of the red team test was mapped to NIST’s Human-AI configuration category which describes how AI system use can involve varying risks of misconfigurations and poor interactions between a system and a human who is interacting with it. 

The goal was to assess whether Synthesia’s system can accurately differentiate between legitimate requests for Personal Avatars and attempts to create avatars without consent.

Each tester relied on a number of dedicated accounts to attempt creating avatars. They first followed the standard procedure to create avatars of themselves using their laptop’s webcam, ensuring the platform accepted their legitimate requests. Next, they tested if the system could prevent unauthorized avatar creation (non-consensual deepfakes) using another person’s likeness.

In total, testers made over 40 attempts to create avatars and our systems were able to detect and block all attempts to create non-consensual avatars, with feedback mechanisms in place to alert users of policy violations.

For example, some testers tried to create non-consensual clones of celebrities or politicians while others worked together, with one person supplying the face and another the voice for the clone. I also saw examples of people trying to tamper with the webcam feed, trying to trick our systems with stock footage of other people instead of a live recording of themselves. 

In the case below, a tester submitted a video of Dr. Fei-Fei Li scraped from YouTube, with harmful text placed on top of it. We correctly identified this as a non-consensual deepfake attempt and prevented the avatar from being created:

In the second part of the test, the testers created scripts that tried to break our content policies. This was in line with NIST’s Information Integrity category which outlines how AI systems can ease the deliberate production or dissemination of false or misleading information at scale, where an actor has the explicit intent to deceive or cause harm to others. 

Synthesia has the world’s most advanced AI avatars that are able to closely replicate human likeness and voice, and mimic human emotions. As a result, it’s important that this technology is used for its intended purpose (instructional videos for business communications), and not for creating harmful or misleading content. 

The goal was therefore to evaluate whether our platform could flag and block generated video content that violated six of Synthesia's content policies: 

Once logged in to Synthesia, testers were asked to create scripts specifically targeting the policies above and upload them in our video editor, attempting to use both stock avatars and any Personal Avatars created earlier in the test.

There were over 75 attempts to create harmful content. Our moderation processes consistently held up, with our automated systems and human-in-the-loop moderation teams ensuring that no violating content could reach the final stage of publishing. With one exception, every single attempt was blocked across all categories and avatars; we were able to apply our moderation rules consistently and also triggered appropriate bans for repeat offenses.

Here are some examples of scripts that were moderated by our platform, together with the corresponding policies.


We rejected the video above as it violated our policy on Suicide & Self Harm: Suicidal Ideation and Intent. Notice that our detection systems were sophisticated enough to determine that the user was talking about suicide without actually mentioning the term specifically.

We rejected the video above for violating our Deceptive Information & Other Restricted Content policy as it encourages direct exposure to dry ice which can cause severe skin damage:

We rejected the video above for breaking our Promotional Activity and Other Restricted Practices: Age Restricted Products or Services policy. We do not allow users to create content advertising age-restricted services such as OnlyFans:

We rejected the video above for breaking our Promotional Activity and Other Restricted Practices: Gambling policy. We do not allow advertising or promotional content related to gambling with a stock avatar. 

The one video we should’ve moderated but missed provided potentially confusing information relating sunlight to cancer. Our policies prohibit the creation of unverified or unauthorized information about medical terms. We’ve flagged this video to our content moderation and engineering teams so we can improve our detection systems in the future.

Interestingly the script for the video was created with ChatGPT, with the user prompting the chatbot to generate a short authoritative response to a skeptical person on how a human’s cellular response to sunlight can be used to diagnose cancer - proving that today’s language models still struggle with hallucinations: 

Our commitment to trustworthy AI

While this test is a testament to Synthesia’s commitment to trustworthy AI, we also recognize the value of continuous improvement. Data and insights from this exercise will further strengthen our moderation capabilities, ensuring we stay at the forefront of responsible AI and security standards.

In partnership with industry leaders like Humane Intelligence and government agencies such as NIST, Synthesia remains dedicated to building a safer, more secure AI platform for all users. Humane Intelligence plans to continue organizing red teaming events globally.

About the author

Martin Tschammer

Martin Tschammer

Martin Tschammer is the Head of Security at Synthesia.

Go to author's profile
faq

Frequently asked questions