Amazon Polly Specialized2016年〜
A text-to-speech service that converts text into natural-sounding audio
What It Does
Amazon Polly is a text-to-speech (TTS) service that converts text into realistic speech. It offers dozens of voices in over 30 languages, with natural-sounding output powered by a neural TTS engine. SSML (Speech Synthesis Markup Language) lets you adjust speech rate, pitch, and pauses.
Use Cases
Improving website and app accessibility (screen reader support), generating e-learning narration, audio delivery of news articles, voice generation for IVR (interactive voice response) systems, and audio output for IoT devices.
Everyday Analogy
Think of a professional narrator. Hand over a script (text), and they read it naturally in the voice and language you specify. You can even give direction (SSML) like "slow down here" or "emphasize this part."
What Is Polly?
Amazon Polly is an AI service that converts text to speech. It offers two engines - Standard and Neural - with the Neural engine producing more natural, human-like speech. For Japanese, voices like Mizuki (female) and Takumi (male) are available. Generated audio can be downloaded or streamed in MP3, OGG, or PCM format.
SSML and Voice Customization
SSML tags give you fine-grained control over speech. Use to insert pauses, to change speed or pitch, to stress words, and to specify pronunciation. You can also choose speaking styles like newscaster or conversational, depending on the use case. Long texts can be processed with asynchronous synthesis tasks, with results saved to S3. To deepen your understanding of SSML and voice customization, reference books on Amazon can be helpful.
Getting Started
In the Polly console, go to the "Text-to-Speech" tab, enter your text, select a voice, and click "Listen." To use the API, pass text and a voice ID to the SynthesizeSpeech API. The free tier includes 5 million characters (Standard) / 1 million characters (Neural) per month for the first 12 months.
Things to Watch Out For
- The Neural engine is higher quality but costs roughly 4x more per character than Standard - choose based on your use case
- Redistributing generated audio is allowed within the terms of service, but presenting Polly-generated speech as a human voice is prohibited