Voices Enables Fast Text-to-Speech For Java Applications

Voices, an open-source text-to-speech project, was designed for applications running on Java 17 or newer. The library requires no external APIs or manually installed software. Audio files can be generated for various languages based on dictionaries or OpenVoice.

Henry Coles, creator of Voices and Pitest and head of mutation testing at Arcmutate, introduced Voices on Bluesky in September 2025 and the latest version, released in late October 2025, is 0.0.8.

Voices uses ONNX Runtime, a cross-platform AI engine that speeds up training and inference, supporting models from various deep learning frameworks such as TensorFlow and PyTorch. The runtime leverages hardware accelerators whenever possible and supports various hardware and operating system configurations.

Several libraries are required for the examples demonstrated here. The following POM file configuration can be used with Maven:


<!-- The main dependency -->
<dependency>
    <groupId>org.pitest.voices</groupId>
    <artifactId>chorus</artifactId>
    <version>0.0.8</version>
</dependency>
<!-- A prepackaged model -->
<dependency>
    <groupId>org.pitest.voices</groupId>
    <artifactId>alba</artifactId>
    <version>0.0.8</version>
</dependency>
<!-- A dictionary of pronunciations -->
<dependency>
    <groupId>org.pitest.voices</groupId>
    <artifactId>en_uk</artifactId>
    <version>0.0.8</version>
</dependency>
<!-- The runtime for ONNX models -->
<dependency>
    <groupId>com.microsoft.onnxruntime</groupId>
    <artifactId>onnxruntime</artifactId>
    <version>1.22.0</version>
</dependency>

Alternatively, other build tools, such as Gradle, may be used for all examples demonstrated here.

An en_us dictionary may be used instead of the en_uk dictionary by replacing the above dependency. The onnxruntime may be replaced by onnxruntime_gpu for GPU acceleration.The Chorus class can be used to manage voice models and manage resources. It’s advised to use a single Chorus instance in the application due to the expense of loading models. The following example demonstrates the conversion of English text to an InfoQ_English sound file:


ChorusConfig config = chorusConfig(EnUkDictionary.en_uk());
try (Chorus chorus = new Chorus(config)) {

    Voice alba = chorus.voice(Alba.albaMedium());

    Audio audio = alba.say("This is the InfoQ article about the Voices library");
    Path path = Paths.get("InfoQ_English");
    audio.save(path);
}

The previous example used a model retrieved at build time via the Maven dependency. Alternatively, other models can be retrieved during runtime via the following Maven dependency:


<dependency>
    <groupId>org.pitest.voices</groupId>
    <artifactId>model-downloader</artifactId>
    <version>0.0.8</version>
</dependency>

Now the models can be used via factory methods on the following classes:


org.pitest.voices.download.Models
org.pitest.voices.download.UsModels
org.pitest.voices.download.NonEnglishModels

The following example uses the Dutch nlNLRonnie model from the NonEnglishModels class to convert Dutch text to a Dutch sound file:


Model nlModel = NonEnglishModels.nlNLRonnie();

ChorusConfig config = chorusConfig(EnUkDictionary.en_uk());

try (Chorus chorus = new Chorus(config)) {

    Voice alba = chorus.voice(nlModel);

    Audio audio = alba.say("Dit is een Nederlandse tekst Scheveningen");
    Path path = Paths.get("Dutch");
    audio.save(path);
}

Alternatively, OpenVoice may be used to improve the resulting speech without requiring a dictionary. However, it requires more computational power and has a significantly larger model of 50 MB compared to the 3 MB dictionary file. The following dependency enables support for OpenVoice with Maven:


<dependency>
    <groupId>org.pitest.voices</groupId>
    <artifactId>openvoice-phonemizer</artifactId>
    <version>0.0.8</version>
</dependency>

After declaring the dependency, the OpenVoiceSupplier model can be used:


ChorusConfig config = chorusConfig(Dictionaries.empty()).withModel(new OpenVoiceSupplier());
try (Chorus chorus = new Chorus(config)) {

    Voice alba = chorus.voice(Alba.albaMedium());

    Audio audio = alba.say("This is the InfoQ article about the Voices library");
    Path path = Paths.get("InfoQ_English_OpenVoice");
    audio.save(path);
}

OpenVoice also supports other voices in UK or US English and languages such as Dutch, as shown in the following example:


Model nlModel = NonEnglishModels.nlNLRonnie();
ChorusConfig config = chorusConfig(Dictionaries.empty())
        .withModel(new OpenVoiceSupplier());

try (Chorus chorus = new Chorus(config)) {

    Voice alba = chorus.voice(nlModel);

    Audio audio = alba.say("Dit is een Nederlandse tekst Scheveningen");
    Path path = Paths.get("Dutch_OpenVoice");
    audio.save(path);
}

The library also allows running the models on the GPU instead of the CPU by removing the onnxruntime from the classpath and adding the onnxruntime_gpu dependency, for example, via the Maven pom.xml. Next, the gpuChorusConfig should be used instead of the ‘normal’ chorusConfig:


ChorusConfig config = gpuChorusConfig(EnUkDictionary.en_uk());

By default, GPU 0 is used without further options; alternatively, the withCudaOptions() method, defined in the ChorusConfig class, may be used for configurations.

Pauses are added when the library encounters Markdown symbols: #; —; or em or en dashes; in the text.

As with other configurations, the ChorusConfig class may be used to change the defaults for pauses.

Various alternative text-to-speech solutions, such as Sherpa Onnx and MaryTTS, are harder to consume from build tools like Maven and/or produce lower quality voices.

InfoQ reached out to Henry Coles to learn more about Voices.

InfoQ: What types of use cases do you envision as the most common for voices? Can you share some examples where this library really shines?

Henry: The code was originally part of a tool for editing fiction. I can only guess where else it might be useful, but it shines when you need to generate reasonably natural-sounding speech quickly and don’t want to rely on external services.

InfoQ: What inspired you to create the Voices library?

Henry: I needed to generate speech from Java, and most modern Text to Speech (TTS) libraries are written in Python. Initially, I ran piper as an HTTP service, but this was inconvenient, so I began to search for ways to run the piper models from Java.

InfoQ: What made you decide to create a new library instead of collaborating on an existing solution?

Henry: The existing Java TTS solutions were established a long time ago and sound robotic by modern standards. They’d be difficult to improve. By contrast, running the piper ONNX models is trivial except for one missing piece: Java code to convert text to phonemes. I couldn’t find a Java phonemiser anywhere, so I had to hack one together.

InfoQ: What challenges did you face while building voices, and how did you overcome them? Were there any key design decisions that you debated on?

Henry: The main challenge was knowing nothing about linguistics. The development process was also completely different from how I usually work. It was largely a porting project, translating TypeScript logic to Java. Testing what was essentially someone else’s logic was further complicated by the fact that there are no clear “correct” answers. English can’t really be phonomised by simple rules (special case handling is required via a dictionary), so for some inputs the output is always going to be wrong; it’s a question of deciding which ones. I ended up with a very manual development loop, generating audio and scoring it by ear, then adding test cases to catch regression for that particular input.

InfoQ: Which part of the library would you like to improve?

Henry: I’d like to clean up the API. The current one was created to meet a single use case quickly; something nicer could likely be created with a bit of upfront thought.

InfoQ: Are there any plans to add functionalities in the future?

Henry: If I get a chance, I’ll take a look at improving how it handles pauses and setting the rhythm of the speech.

InfoQ: What automated testing approach do you recommend for applications using the library? Maybe using a speech-to-text solution to be able to compare the input and the output?

Henry: I’d recommend testing the output sparingly. A couple of tests that check that audio is produced and everything is wired up correctly make sense, but the library’s functionality is not something client apps control, so they should mainly concern themselves with checking on the input to the boundary.

More information and examples can be found on GitHub.