Implementing Text-to-Speech with Amazon Polly - Neural Voices and SSML Speech Control
Generate natural-sounding speech with the Neural TTS engine and control speech rate, pitch, and pauses with SSML tags. Learn how to build diverse audio content using real-time streaming and asynchronous synthesis to S3.
Overview of Polly
Amazon Polly is a text-to-speech (TTS) service that converts text into natural-sounding speech. The Neural TTS engine uses deep learning models to produce significantly more natural speech compared to the traditional Standard TTS engine. It supports over 30 languages including Japanese, with more than 60 voices offering male, female, and child options. Japanese neural voices include Kazuha and Tomoko. The Generative engine uses the latest foundation models for the highest quality speech, currently available in English. The Long-Form engine is optimized for long-form content such as books and news articles, automatically adjusting natural pauses and intonation between paragraphs.
SSML and Speech Control
SSML (Speech Synthesis Markup Language) tags provide fine-grained control over how text is read aloud. The tag adjusts speech rate, pitch, and volume, while the tag inserts pauses at any position. The tag specifies pronunciation for specific words using IPA (International Phonetic Alphabet), preventing mispronunciation of proper nouns and technical terms. The tag specifies how numbers are read (phone numbers, dates, currency), and the tag adds stress. Registering a lexicon lets you globally override pronunciation for specific words and phrases, eliminating the need to write SSML each time. With the Neural engine, the NTTS-specific tag applies styles such as newscaster or conversational.
Synthesis Methods and Integration
Polly offers two synthesis methods. The SynthesizeSpeech API converts text to speech in real time and returns an audio stream. You can play the response directly or save it to a file. It is suitable for text under 3,000 characters. The StartSpeechSynthesisTask API performs asynchronous synthesis, outputting long text to an S3 bucket in MP3 or OGG format. It can process up to 200,000 characters, making it ideal for audiobook narration or batch generation of announcement audio. The SpeechMark feature provides timing information (word-level, sentence-level) between text and audio, useful for automatic subtitle synchronization and lip-syncing. Integration with Connect enables dynamic IVR voice prompts, and combining with Lex builds voice-interactive bots. For more on speech technology, see related books on Amazon.
Polly Pricing
Polly pricing is pay-per-use based on the number of characters processed. The Neural engine costs approximately $16.00 per 1 million characters, and the Standard engine costs approximately $4.00 per 1 million characters. The Generative engine costs approximately $30.00 per 1 million characters. The Long-Form engine costs approximately $100.00 per 1 million characters, which is expensive but specialized for high-quality long-form audio such as audiobooks. The free tier includes 1 million Neural engine characters per month and 5 million Standard engine characters per month for the first 12 months. SSML tags are not counted toward character usage, so leveraging SSML does not increase costs.
Summary
Amazon Polly is a service that generates natural-sounding speech using the Neural TTS engine. It supports diverse audio content creation through fine-grained SSML speech control, two synthesis methods (real-time streaming and asynchronous synthesis), and subtitle synchronization via SpeechMark. It can also be used to build voice interaction systems through integration with Connect and Lex.