Behind the Scenes: How Synthesia’s AI Avatars Are Made


Turn your texts, PPTs, PDFs or URLs to video - in minutes.

Ever wondered how Synthesia’s AI avatars come to life?
In a Synthesia Behind the Scenes session, we sat down with Tosin Oshinyemi (Lead Avatar Producer) and Josh Baker-Mendoza (Technical Supervisor) to uncover the creative and technical magic behind avatar production.
From actor selection and performance coaching to AI rendering and motion tracking, we dove deep into the process that powers Synthesia’s cutting-edge avatars.
The art and science behind AI avatars
AI avatars may be driven by technology, but as Tosin and Josh emphasized, they remain deeply human at their core. Every detail—expressions, gestures, and nuances—comes from real performances, captured through meticulous production techniques.
“We’re making tools that help people connect, teach, and inspire in ways they never could before.” — Tosin Oshinyemi
Casting the right talent: ensuring diversity and versatility
Creating an avatar library that serves a broad range of use cases requires careful actor selection. Tosin explained how Synthesia ensures diversity and usability by considering:
- Distinct yet versatile appearances – Avatars must reflect a variety of backgrounds, ages, and styles to resonate with different audiences.
- Performance range – Actors are chosen based on their ability to express natural emotions and gestures that can suit multiple content types.
- Industry relevance – Some avatars are designed specifically for corporate training, while others are tailored for marketing, education, or healthcare content.
- User feedback – The team regularly assesses requests from customers and adjusts casting decisions accordingly.
Synthesia’s approach involves a mix of agency partnerships, open casting calls, and street casting to ensure a well-rounded library of avatars. Sometimes, actors are scouted based on their unique ability to bring personality and warmth to an AI-driven experience.
Filming and studio setup: the technology behind the avatars
Once actors are selected, they enter a carefully controlled filming environment to capture the footage that becomes their AI avatar. Josh, broadcasting live from Synthesia’s London studio, walked us through the technical setup that makes these avatars possible:
- Lighting: Using 3-point lighting with soft diffusion to create balanced visuals and reduce harsh shadows.
- Camera Setup: High-resolution capture using 4K RAW for precision, ensuring that details like facial expressions translate accurately.
- Backgrounds: For Express-1, you’ll need to record footage with a green or blue screen background (preferably a green screen). For Personal Avatars, a minimalist approach can maximize the potential use cases of an avatar, but it all depends on the main intended use case(s) for the avatar (albeit hyper detailed, dense backgrounds won’t play nicely with the tech).
Josh emphasized that the studio setup isn’t just about aesthetics—it directly affects the realism and adaptability of the final avatars. The goal is to produce avatars that fit seamlessly into any digital environment, whether it’s a corporate training video, a marketing campaign, or an educational module.
Performance best practices for AI avatar creation
Tosin shared some key insights into how to achieve the best performance when filming footage for AI avatars:
- Speak naturally, not mechanically – The most compelling avatars feel like real people, not robotic readers.
- Engage with an imagined audience – Instead of just reading from a script, actors should picture speaking to a real person.
- Use microexpressions and body language – Even subtle nods, head tilts, and natural facial movements enhance realism.
- Avoid exaggerated movements – While expression is essential, overly large gestures can appear unnatural in AI avatars.
- Do multiple takes – The best performances often emerge after a few practice rounds.
“Your avatar should feel like you, not a stiff, robotic version of you. Be expressive but stay natural.” — Tosin Oshinyemi
Bringing avatars to life with AI
Once the footage is captured, Synthesia’s AI technology analyzes facial expressions, movements, and speech patterns, mapping them onto digital avatars. Josh highlighted the critical role of optical flow algorithms and speech-to-expression mapping, which allow avatars to maintain fluid, lifelike animations.
Josh also detailed the technical infrastructure behind avatar creation:
- 2D-Based Capture: While volumetric (4D) capture exists, Synthesia’s avatars rely on high-resolution single-camera, 2D-based capture, making production scalable, efficient, and accessible.
- Speech-to-Expression Mapping: AI interprets speech input and generates subtle microexpressions to enhance realism.
- Intentional Lighting: The studio setup includes soft light diffusion, minimizing harsh shadows and ensuring avatars integrate seamlessly into various backgrounds.
The future of AI avatars is also evolving rapidly. Josh teased upcoming advancements, including full-body motion tracking and adaptive AI-driven gestures, making avatars even more dynamic and responsive.
“It’s easy to get caught up in the tech, but at the end of the day, what we do is about people. AI is just a tool that lets us communicate better.” — Josh Baker-Mendoza
A passionate community at the heart of innovation
Feedback from Synthesia creators plays a critical role in shaping Synthesia’s future development. Every new feature and improvement is informed by real needs, ensuring that AI avatars continue to feel natural, engaging, and truly human.
During the live interview, members of Synthesia’s AI Video Creator Community were eager to share which avatars they use most frequently, highlighting how they match different avatars to specific content types—whether for well-being topics, leadership training, or technical instruction.
Others shared their enthusiasm for creating personal avatars, emphasizing how having a digital representation of themselves enhances engagement and personalization in workplace training and communication.
Key takeaways
- AI avatars start with real human performances: the tech only enhances what’s already there.
- Lighting and camera quality are crucial: a high-quality recording results in a more realistic avatar.
- Performance direction matters: the most engaging avatars feel natural, not robotic.
- Exciting updates are on the way: Josh hinted at Express-2 avatars, which will feature even more natural movement and speech synchronization.!
About the author
Strategic Advisor
Kevin Alster
Kevin Alster heads up the learning team at Synthesia. He is focused on building Synthesia Academy and helping people figure out how to use generative AI videos in enterprise.