OpenAI's voice cloning AI model requires just a 15-second audio sample to function

Calvin D

OpenAI has introduced a groundbreaking text-to-voice platform known as Voice Engine, capable of generating a lifelike synthetic voice from merely a 15-second audio clip of an individual's voice. This advanced AI can then vocalize text prompts either in the native language of the original voice clip or in numerous other languages. OpenAI highlighted in its blog, “These smaller-scale deployments serve as critical learning opportunities, shaping our safety protocols and strategies, and enhancing our understanding of the potential positive impacts of Voice Engine across different fields.”

Select businesses have been granted access to this technology, among them are Age of Learning, a firm specializing in educational technology, HeyGen, a platform for visual storytelling, Dimagi, a developer of software for frontline health workers, Livox, an innovator in AI-driven communication applications, and Lifespan, a healthcare organization.

OpenAI shared examples demonstrating how Age of Learning has utilized this tech to create voice-over content that is both scripted and capable of delivering “real-time, personalized responses” to students, with these responses powered by GPT-4. Highlighted below are audio clips showcasing this application:

Reference audio in English:

Following are AI-generated audio samples based on the reference:

Voice Engine's development, initiated in late 2022 by OpenAI, has already found application in powering voices for text-to-speech APIs and ChatGPT’s Read Aloud feature. In a discourse with TechCrunch, OpenAI’s Jeff Harris shared insights on the model's training process, which involved a combination of licensed and public data. However, access to the model is restricted to approximately 10 developers, as per OpenAI.

The landscape of AI-generated audio, particularly voice generation, is progressively evolving. Though the focus has predominantly been on instrumental or ambient sounds, the venture into voice synthesis is growing, marked by entities like Podcastle and ElevenLabs. These developments occur amidst efforts by the US government to mitigate unethical applications of AI voice technologies, following incidents of AI voice impersonation. OpenAI mandates its partners to adhere strictly to ethical guidelines regarding voice generation, including ensuring consent from voice donors and transparency about the AI nature of the voice to listeners. Additionally, OpenAI has proposed several measures to mitigate risks associated with such technologies, including improving public awareness on AI deepfakes and developing efficient tracking systems for AI-generated content.