Spring AI SpeechModel: Text to Speech Example

In Spring AI, the SpeechModel interface allows us to interact with Text-to-Speech (TTS) APIs of supported LLMs such as tts-1 and tts-1-hd by OpenAI. and It enables us to convert text messages into life-like spoken audio.

1. SpeechModel and StreamingSpeechModel API

Spring AI provides two primary interfaces to interact with TTS models:

SpeechModel: provides methods to send a speech request to the TTS API and returns the resulting speech response in the requested audio format.
StreamingSpeechModel: generates a stream of audio bytes from the provided text message in real-time. The response contains a Flux of speech responses, each containing a portion of the generated audio.

The following classes implement both interfaces, at the moment:

OpenAiAudioSpeechModel

public class OpenAiAudioSpeechModel implements SpeechModel, StreamingSpeechModel {

	//implementation methods
}

2. Getting Started with SpeechModel

To work with the speech model, we must include the ‘spring-ai-openai-spring-boot-starter‘ dependency in the project. Read Getting Started with Spring AI for how to setup a new project.

<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
</dependency>

Spring Boot’s autoconfiguration, apart from other beans for chat and image models, creates the instance of OpenAiAudioSpeechModel class with the following default configuration. We can change these properties as per requirements.

# API Key is mandatory
spring.ai.openai.api-key=${OPENAI_API_KEY}  

# Speech Options
spring.ai.openai.audio.speech=tts-1
spring.ai.openai.audio.speech.options.voice=alloy  # alloy, echo, fable, onyx, nova or shimmer
spring.ai.openai.audio.speech.options.response-format=mp3  # mp3, opus, aac, flac, wav or pcm
spring.ai.openai.audio.speech.options.speed=1.0  # from 0.0 (slowest) to 1.0 (fastest)

We can inject the SpeechModel and StreamingSpeechModel type beans in any Spring-managed bean.

@RestController
public class SpeechController {

  private final SpeechModel speechModel;

  SpeechController(SpeechModel speechModel) {
    this.speechModel = speechModel;
  }

  // use speechModel
}

3. Create SpeechModel Programmatically

To create the SpeechModel bean with default configurations, we only need the API key:

var openAiAudioApi = new OpenAiAudioApi(System.getenv("OPENAI_API_KEY"));
var speechModel = new OpenAiAudioSpeechModel(openAiAudioApi);

If we wish to change the defaults with custom parameters, we can utilize OpenAiAudioSpeechOptions class as follows:

OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
  .withModel(OpenAiAudioApi.TtsModel.TTS_1.getValue())
  .withResponseFormat(AudioResponseFormat.MP3)
  .withVoice(OpenAiAudioApi.SpeechRequest.Voice.ALLOY)
  .withSpeed(1.0f)
  .build();

var openAiAudioApi = new OpenAiAudioApi(System.getenv("OPENAI_API_KEY"));
SpeechModel speechModel = new OpenAiAudioSpeechModel(openAiAudioApi, speechOptions);

4. Converting Text to Speech Example

Once the SpeechModel bean is initialized, we can use its call() method to supply the text that needs to be converted to speech.

SpeechResponse speechResponse = speechModel.call(new SpeechPrompt(message));
byte[] audio = speechResponse.getResult().getOutput();

For example, take the following controller method:

@RestController
class SpeechController {

  private final SpeechModel speechModel;

  SpeechController(SpeechModel speechModel) {
    this.speechModel = speechModel;
  }

  @PostMapping("/speech") byte[] speech(@RequestBody String message) {
    
    return speechModel.call(new SpeechPrompt(message))
      .getResult()
      .getOutput();
  }
}

We can test this method from Postman or any other tool.

The API response is raw bytes that we can store as an mp3 file, and verify the generated audio file by listening to it.

4. Streaming Audio in Real-time

Just like podcasts, we can ask the LLM to provide the audio in a streaming fashion using ‘chunked‘ transfer encoding. This can help in building audio player-like features in our application. This is super useful in cases where the audio player narrates a written blog post.

To utilize the streaming model, most things remain the same except we call the stream() method in place of call() method in the previous example. Notice we are using the ‘AudioResponseFormat.OPUS‘ which is used for internet streaming and communication with low latency.

@GetMapping("/stream-speech")
Flux<byte[]> streamingSpeech(@RequestParam String message) {

  OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
    .withResponseFormat(AudioResponseFormat.OPUS)
    .build();

  OpenAiAudioApi openAiAudioApi = new OpenAiAudioApi(System.getenv("OPENAI_API_KEY"));
  StreamingSpeechModel streamingSpeechModel = new OpenAiAudioSpeechModel(openAiAudioApi, speechOptions);

  return streamingSpeechModel.stream(message);
}

Now when we invoke this API in the browser, we can see that the response is received in chunks.

5. Summary

In this Spring AI Text-to-Speech tutorial, we learned the following:

Currently, the module only supports OpenAI’s tts-1 models.
The primary interfaces for generating audio files are SpeechModel and StreamingSpeechModel.
The OpenAiAudioSpeechModel class implements both interfaces, so it supports audio generation in a single response as well as in a streaming fashion.
The OpenAiAudioSpeechModel bean (with default configuration) is available to use as soon as we add the spring-ai-openai-spring-boot-starter dependency to the project.
We can configure the model to customize the audio speed and voice with properties configuration as well as Java configuration using OpenAiAudioSpeechOptions class.

Happy Learning !!

Source Code on Github

Spring AI SpeechModel: Text to Speech Example

1. SpeechModel and StreamingSpeechModel API

2. Getting Started with SpeechModel

3. Create SpeechModel Programmatically

4. Converting Text to Speech Example

4. Streaming Audio in Real-time

5. Summary

Weekly Newsletter

Comments

Spring AI PromptTemplate: Creating Prompts (with Examples)

Spring AI Speech to Text Example

About Us

Tutorial Series

Meta Links

Our Blogs

Follow On: