Speech-to-Text - Building a High-Accuracy Automatic Transcription Platform with Amazon Transcribe
Learn how to convert speech to text (STT) with Amazon Transcribe and build a bidirectional voice processing pipeline by combining it with Amazon Polly. Covers real-time transcription, speaker identification, and accuracy improvements with custom vocabularies.
Growing Demand for Speech-to-Text and Key Features of Amazon Transcribe
From meeting minutes and call center analytics to video subtitles and medical voice dictation, the need to convert speech to text is expanding rapidly. Amazon Transcribe is a deep learning-based automatic speech recognition (ASR) service that converts audio files and real-time audio streams into text with high accuracy. It supports over 100 languages and dialects, including strong recognition accuracy for Japanese. Below is an example configuration for using Transcribe's real-time streaming via WebSocket. ```javascript const url = `wss://transcribestreaming.ap-northeast-1.amazonaws.com:8443 /stream-transcription-websocket ?language-code=ja-JP&media-encoding=pcm&sample-rate=16000`; ``` It provides built-in post-processing features such as automatic punctuation, number formatting, and profanity filtering.
Real-Time Transcription and Batch Processing
Transcribe offers two modes: real-time streaming and batch processing. In real-time streaming, audio is sent over a WebSocket connection and text results are received within seconds. This is ideal for live meeting captions, real-time call center assistants, and automatic subtitles for live broadcasts. Partial Results allow displaying in-progress text while someone is still speaking, then updating to the final text once the utterance is complete. In batch processing, audio files stored in S3 are processed asynchronously, with results output as JSON to S3. This is useful for bulk transcription of recorded files and making archived audio searchable. Speaker Diarization automatically distinguishes multiple speakers and records who said what and when. Channel Identification can also recognize left and right channels of stereo recordings as separate speakers.
Custom Vocabularies and Approaches to Improving Accuracy
Transcribe's custom vocabulary feature improves recognition accuracy for industry-specific terminology, product names, and personal names. By registering words and their pronunciations (in IPA notation) in a custom vocabulary list, terms that are difficult for the standard model to recognize can be accurately transcribed. Custom Language Models (CLM) fine-tune the model with domain-specific text data, achieving recognition accuracy optimized for a particular industry or organization's context. Transcribe Medical is a model specialized for the healthcare domain, accurately recognizing medical terminology, drug names, and anatomical terms. It operates in a HIPAA-compliant environment and can be used for medical record voice input and automated clinical note generation. Transcribe Call Analytics specializes in call center analysis, providing sentiment detection, call categorization, and automatic issue detection. To learn comprehensively about automatic transcription algorithms, check out technical books on Amazon.
Bidirectional Voice Processing with Polly
Combining Transcribe and Polly enables building a bidirectional voice processing pipeline from speech input through text processing to speech output. The workflow converts user speech to text with Transcribe, runs natural language processing or business logic in Lambda, then converts the response to speech with Polly. Integrating with Amazon Lex adds intent recognition and slot extraction for a complete voice dialog system. With Amazon Connect, you can embed high-accuracy speech recognition and natural speech synthesis into contact center IVR (Interactive Voice Response) systems. For multilingual needs, you can build a real-time interpretation pipeline that recognizes speech with Transcribe, translates text with Amazon Translate, and generates audio in the target language with Polly. Integration with Kinesis Video Streams also enables real-time transcription of audio tracks from live video.
Transcribe Pricing
Transcribe pricing is based on the number of seconds of audio processed. Batch transcription costs approximately $0.00024 per second (about $0.864 per hour), with the first 60 minutes per month included in the free tier. Streaming transcription has a similar per-second rate. Call Analytics adds an analysis fee of approximately $0.02 per minute on top of the standard transcription rate. There is no additional charge for using custom vocabularies, but training Custom Language Models incurs separate costs. Medical Transcribe costs approximately $0.000575 per second, about 2.4 times the standard rate.
Summary - Building a Speech-to-Text Platform
Amazon Transcribe is a fully managed service that provides high-accuracy, deep learning-based speech-to-text conversion. With both real-time streaming and batch processing modes, speaker diarization, custom vocabulary support, and coverage of over 100 languages, it serves as a versatile foundation for voice applications. Combined with Polly for bidirectional voice processing, Lex for voice dialog systems, and Translate for real-time interpretation, it supports a wide range of use cases. Integration with S3, Lambda, and Kinesis Video Streams enables building serverless voice processing pipelines.