What Is TTS and How Is It Implemented in Apps?
A guide to developing a text-to-speech converter.
Join the DZone community and get the full member experience.
Join For FreeDoes the following routine sound familiar? In the morning, your voice assistant gives you today's weather forecast. And then, on your way to work, a navigation app gives you real-time traffic updates, and in the evening, a cooking app helps you cook up dinner with audible steps.
In such a routine, machine-generated voice plays an integral part, creating an engaging, personalized experience. The technology that powers this is called text-to-speech, or TTS for short. It is a kind of assistive technology reading aloud digital text, which therefore is also known as read-aloud technology.
With a single tap or click on a button, TTS can convert characters into audio, which is invaluable to people like me, who are readers on the go. I'm a huge fan of both reading and running, so with the help of the TTS function, my phone transforms my e-books into audio books, and I can listen to them while I'm on a run.
There are two things, however, that I'm not satisfied with the TTS function. First, when the text contains both Chinese and English, the function will fail to distinguish one from another and consequently say something that is incomprehensible. Second, the audio speed of the function cannot be adjusted, meaning I cannot listen to things slowly and carefully when it's necessary.
I made up my mind to develop a TTS function that overcomes such disadvantages. After some research, I was disappointed to find out that creating a speech synthesizer from scratch meant that I had to study linguistics (which enables TTS to recognize how text is pronounced by a human), audio signal processing (which paves the way for TTS to be able to generate new speech), and deep learning (which enables TTS to handle a large amount of data for generating high-quality speech).
That sounds intimidating. Therefore, instead of creating a TTS function from nothing, I decided to turn to some solutions that are already available on the market for implementing the function. One such a solution I found is the TTS from HMS Core ML Kit. Let's now dive deeper into it.
Capability Introduction
The TTS capability adopts the deep neural network (DNN) synthesis mode and can be quickly integrated through the on-device SDK to generate audio data in real time. Thanks to the DNN, the generated speech sounds natural and expressive.
The capability comes with many timbers to choose from and supports as many as 12 languages (Arabic, English, French, German, Italian, Malay, Mandarin Chinese, Polish, Russian, Spanish, Thai, and Turkish). When the text contains both Chinese and English, the capability can differ one from another properly.
On top of these, the speech speed, pitch, and volume can be adjusted, making the capability customizable and thereby better meet requirements in different scenarios.
Developing the TTS Function
Making Preparations
1. Prepare the development environment, which has requirements on both software and hardware:
Software requirements:
- JDK version: 1.8.211 or later
- Android Studio version: 3.X or later
- minSdkVersion: 19 or later (mandatory)
- targetSdkVersion: 31 (recommended)
- compileSdkVersion: 31 (recommended)
- Gradle version: 4.6 or later (recommended)
Hardware requirements: A mobile phone running Android 4.4 or later or EMUI 5.0 or later.
2. Create a developer account.
3. Configure the app information in AppGallery Connect, including project and app creation, as well as configuration of the data processing location.
4. Enable ML Kit in AppGallery Connect.
5. Integrate the SDK of the kit. This step involves several tasks. The one I want to mention in special is adding build dependencies. This is because capabilities of the kit have different build dependencies, and those for the TTS capability are as follows:
dependencies{
implementation 'com.huawei.hms:ml-computer-voice-tts:3.11.0.301'
}
6. Configure obfuscation scripts.
7. Apply for the following permission in the AndroidManifest.xml file: INTERNET. (This is because TTS is an on-cloud capability, which requires a network connection. I noticed that the kit also provides the on-device version of the capability. After downloading its models, the on-device capability can be used without network connectivity.)
Implementing the TTS Capability Using Kotlin
1. Set the authentication information for the app.
2. Create a TTS engine by using the MLTtsConfig class for engine parameter configuration.
// Use custom parameter settings to create a TTS engine.
val mlTtsConfig = MLTtsConfig() // Set the language of the text to be converted to Chinese.
.setLanguage(TTS_ZH_HANS) // Set the Chinese timbre.
.setPerson(MLTtsConstants.TTS_SPEAKER_FEMALE_ZH) // Set the speech speed. The range is (0, 5.0]. 1.0 indicates a normal speed.
.setSpeed(1.0f) // Set the volume. The range is (0, 2). 1.0 indicates a normal volume.
.setVolume(1.0f)
val mlTtsEngine = MLTtsEngine(mlTtsConfig)
// Set the volume of the built-in player, in dBs. The value range is [0, 100].
mlTtsEngine.setPlayerVolume(20)
// Update the configuration when the engine is running.
mlTtsEngine.updateConfig(mlTtsConfig)
3. Create a callback to process the text-to-speech conversion result.
val callback: MLTtsCallback = object : MLTtsCallback {
override fun onError(taskId: String, err: MLTtsError) {
// Processing logic for TTS failure.
}
override fun onWarn(taskId: String, warn: MLTtsWarn) {
// Alarm handling without affecting the service logic.
}
// Return the mapping between the currently played segment and text. start: start position of the audio segment in the input text; end (excluded): end position of the audio segment in the input text.
override fun onRangeStart(taskId: String, start: Int, end: Int) {
// Process the mapping between the currently played segment and text.
}
// taskId: ID of an audio synthesis task.
// audioFragment: audio data.
// offset: offset of the audio segment to be transmitted in the queue. One audio synthesis task corresponds to an audio synthesis queue.
// range: text area where the audio segment to be transmitted is located; range.first (included): start position; range.second (excluded): end position.
override fun onAudioAvailable(taskId: String, audioFragment: MLTtsAudioFragment, offset: Int, range: Pair<Int, Int>,
bundle: Bundle) {
// Audio stream callback API, which is used to return the synthesized audio data to the app.
}
override fun onEvent(taskId: String, eventId: Int, bundle: Bundle) {
// Callback method of a TTS event. eventId indicates the event ID.
when (eventId) {
MLTtsConstants.EVENT_PLAY_START -> {
}
MLTtsConstants.EVENT_PLAY_STOP -> { // Called when playback stops.
var isInterrupted: Boolean = bundle.getBoolean(MLTtsConstants.EVENT_PLAY_STOP_INTERRUPTED)
}
MLTtsConstants.EVENT_PLAY_RESUME -> {
}
MLTtsConstants.EVENT_PLAY_PAUSE -> {
}
MLTtsConstants.EVENT_SYNTHESIS_START -> {
}
MLTtsConstants.EVENT_SYNTHESIS_END -> {
}
MLTtsConstants.EVENT_SYNTHESIS_COMPLETE -> { // Audio synthesis is complete. All synthesized audio streams are passed to the app.
var isInterrupted
: Boolean = bundle.getBoolean(MLTtsConstants.EVENT_SYNTHESIS_INTERRUPTED)
}
else -> {
}
}
}
}
4. Pass the callback just created to the TTS engine created in step 2 to convert text to speech.
mlTtsEngine.setTtsCallback(callback)
/**
* The first parameter sourceText indicates the text information to be synthesized. The value can contain a maximum of 500 characters.
* The second parameter indicates the synthesis mode. The format is configA | configB | configC.
* configA:
* MLTtsEngine.QUEUE_APPEND: After a TTS task is generated, this task is processed as follows: If playback is going on, the task is added to the queue for execution in sequence; if playback pauses, the task is resumed, and the task is added to the queue for execution in sequence; if there is no playback, the TTS task is executed immediately.
* MLTtsEngine.QUEUE_FLUSH: The ongoing TTS task and playback are stopped immediately, and all TTS tasks in the queue are cleared. The ongoing TTS task is executed immediately, and the generated speech is played.
* configB:
* MLTtsEngine.OPEN_STREAM: The synthesized audio data is output through onAudioAvailable.
* configC:
* MLTtsEngine.EXTERNAL_PLAYBACK means the external playback mode. The player provided by the SDK is not used. You need to process the audio output by the onAudioAvailable callback API. In this case, the playback-related APIs in the callback APIs become invalid, and only the callback APIs related to audio synthesis can be listened to.
*/
// Use the built-in player of the SDK to play speech in queuing mode.
val sourceText: String? = null
val id = mlTtsEngine.speak(sourceText, MLTtsEngine.QUEUE_APPEND)
// In queuing mode, the synthesized audio stream is output through onAudioAvailable. In this case, the built-in player of the SDK is used to play the speech.
// String id = mlTtsEngine.speak(sourceText, MLTtsEngine.QUEUE_APPEND | MLTtsEngine.OPEN_STREAM);
// In queuing mode, the synthesized audio stream is output through onAudioAvailable, and the audio stream is not played automatically, but controlled by you.
// String id = mlTtsEngine.speak(sourceText, MLTtsEngine.QUEUE_APPEND | MLTtsEngine.OPEN_STREAM | MLTtsEngine.EXTERNAL_PLAYBACK);
5. Pause or resume speech playback.
// Pause speech playback.
mlTtsEngine.pause()
// Resume speech playback.
mlTtsEngine.resume()
6. Stop the ongoing TTS task and clear all TTS tasks to be processed.
mlTtsEngine.stop()
7. Release resources occupied by the TTS engine, when the TTS task ends.
if (mlTtsEngine != null) {
mlTtsEngine.shutdown()
}
These steps explain how the TTS capability is used to develop a TTS function using the Kotlin language. The capability also supports Java, but the functions developed using either of the languages are the same — just choose the language you are more familiar with or want to try out.
Besides audio books, the TTS function is also helpful in a bunch of other scenarios. For example, when someone has had enough of staring at the screen for too long, they can turn to TTS for help. Or, when a parent is too tired to finish off a bedtime story, they can use the TTS function to read the rest of the story for their children. Voice content creators can turn to TTS for dubbing videos and providing voiceovers.
The list goes on. I look forward to hearing how you use the TTS function for other cases in the comments section below.
Takeaway
Machine-generated voice brings an even greater level of convenience to ordinary, day-to-day tasks, allowing us to absorb content while doing other things at the same time.
The technology that powers voice generation is known as TTS, which is relatively simple to use. A worthy solution to implement this technology into mobile apps is a capability of the same name from HMS Core ML Kit. It supports multiple languages and works well with bilingual text of Chinese and English. The capability provides a range of timbers that all sound surprisingly natural, thanks to its adoption of the DNN technology. The capability is customizable, in terms of its configurable parameters including the speech speed, volume, and pitch. With this capability, building a mobile text reader is a breeze.
Published at DZone with permission of Jackson Jiang. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments