The Top 3 Challenges of AI-Powered Voice Technology
Here are three common speed bumps AI practitioners face when it comes to speech-to-speech technologies.
Join the DZone community and get the full member experience.
Join For FreeThe promise of artificial intelligence (AI) that can generate human-like data has been talked about for decades. Yet, data scientists have tackled this problem with little success. Pinpointing effective strategies for creating such systems has presented challenges that run the gamut from technical to ethical and everywhere in between. However, generative AI has emerged as a bright spot to watch.
At its most basic, generative AI enables machines to use elements such as audio files, text, and images to produce content from speech, to writing, to artistry. According to tech investor Sequoia Capital, “Generative AI is well on the way to becoming not just faster and cheaper, but better in some cases than what humans create by hand,” according to a recent Tech Monitor interview.
Recent advances in generative speech-based machine learning technologies, in particular, have made massive strides, but we still have a long way to go. In fact, voice compression—which occurs in apps we rely on heavily, like Zoom and Teams—is still based on tech from the eighties and nineties. While speech-to-speech technology has endless potential, it’s vitally important to assess the challenges and shortcomings creating barriers for generative AI to thrive.
Here are three common speed bumps AI practitioners face when it comes to speech-to-speech technologies.
1. Sound Quality
Arguably the most important part of the optimal conversation is that it's understandable. In the case of speech-to-speech technology, the goal is to sound human. The robotic tone of Siri and Alexa, for example, is machine-like and not always clear. There are a few reasons why this is hard to achieve through AI, but the nuances in human language play a big part.
Mehrabian’s Rule can help explain this. Human conversation can be broken down into three parts: 55% facial expression, 38% tone of voice, and a mere 7% words. Machine understanding relies on words or content to operate. Only recent strides in natural language processing (NLP) have made it possible to train AI models on factors like sentiment, emotion, timbre, and other important, but not necessarily spoken, aspects of language. If you’re just working with audio, not visual, this becomes even more challenging without more than half of the understanding that comes from facial expression.
2. Latency
Analysis of AI synthesis can take time—but with speech-to-speech communication, real-time is the only time that matters. Voice conversion must happen immediately as the speaking is taking place. It must be accurate, too, which you can imagine is no easy feat for a machine.
The necessity of real-time can vary by industry. For example, a content creator doing a podcast may be more concerned with sound quality than real-time voice conversion. But for industries like customer service, time is of the essence. If a call center agent is using voice assistive AI to respond to a caller, they may be able to sacrifice a little on quality. Still, time is crucial to providing a positive experience.
3. Scale
In order for speech-to-speech technology to live up to its potential, it must support various accents, languages, and dialects and be usable for everyone—not just specific geographies or markets. This takes mastery of specific applications of the technology and a great deal of tuning and training in order to scale effectively.
Emerging tech solutions are not one-size-fits-all; all users will need to support this AI infrastructure with thousands of architectures for a given solution. Users should also expect to test models consistently. This is nothing new: all the classical challenges of machine learning also apply to the generative AI space.
So, how do we start to tackle some of these issues so we can start realizing the value of speech-to-speech technology? Fortunately, it’s less scary when you break it down step-by-step. First, you must master the problem. Earlier I gave the example of a call center vs. a content creator. Make sure you’re taking into account the use case and the desired outcome, and go from there.
Second, ensure your organization has the right architecture and algorithms. But even before that, make sure you have the right data. Data quality matters, especially when considering something as sensitive as human language and speech. Lastly, if real-time voice conversion is necessary for your application, make sure that functionality is supported. Ultimately, no one wants to talk to a robot.
While ethical concerns around generative AI deep fakes, consent, and proper disclosures are now coming to light, it’s important to understand and address the basics first. Speech-to-speech technology has the potential to revolutionize the way we understand one another, opening opportunities for innovations that will unite people. But in order to get there, we must first face the main challenges.
Opinions expressed by DZone contributors are their own.
Comments