Skip to content

VoiceML: <Say>

The <Say> action synthesizes text into speech and plays it back on the call.


The <Say> action supports the following attributes.

Attribute Allowed Values Default
Voice Voice name Google-en-AU-Wavenet-B


The Voice attribute specifies the voice to use with text-to-speech from the available voices.


The body of an action is the content nested within the action. The following is supported for <Say>.

Type Description
plain text The text to synthesize into speech.

The maximum size of the input text is 5000 characters.


Enfonica uses high quality speech synthesis provided by Google Cloud Text-to-Speech.

Tier Pricing
First 10M characters $0.0019 per 100 characters
10M+ characters Talk to sales

Text-to-speech is billed at the end of a call based on how many total characters were synthesized, in blocks of 100 characters. For example, if 180 characters were used, the text-to-speech cost for that call will be $0.0038.

Using text-to-speech in place of audio URIs

Anywhere in VoiceML that accepts the URI to an audio file for playback is also capable of text-to-speech. Use the tts scheme with the URL encoded text that you want to synthesize. This allows you to use text-to-speech for attributes like WhisperAudioUri and ScreenAudioUri.

"To say this with text-to-speech, use the following audio URI:"


To specify the voice, you can use the voice query parameter. For example:



Example 1: Play some speech

The following example says "Hello world" on the call.

<?xml version="1.0" encoding="UTF-8"?>
    <Say>Hello world</Say>

Example 2: Play some speech with a British accent

The following example says "Hello world" on the call in a British accent.

<?xml version="1.0" encoding="UTF-8"?>
    <Say Voice="Google-en-GB-Neural2-C">Hello world</Say>