Amazon Transcribe

An automatic speech recognition service that converts audio data to text, supporting both real-time streaming and batch processing with speaker identification and custom vocabulary

Overview

Amazon Transcribe is an automatic speech recognition (ASR) service that converts audio files and real-time audio streams to text. It supports over 100 languages and includes speaker diarization, custom vocabulary, automatic punctuation, profanity filtering, and automatic PII (Personally Identifiable Information) detection and masking. It also offers Transcribe Call Analytics for call center conversation analysis and Transcribe Medical for the healthcare domain.

Batch Processing and Streaming Processing Design Patterns

Transcribe's batch processing asynchronously transcribes audio files (MP3, MP4, WAV, FLAC, OGG, AMR, WebM) stored in S3. You start a job with the StartTranscriptionJob API and receive completion notifications via EventBridge or SNS. It handles audio files up to 4 hours and 2 GB, outputting results in JSON format to S3. This suits use cases where real-time processing isn't needed, such as meeting recordings, podcast archives, and lecture transcriptions. Streaming processing sends real-time audio streams via WebSocket or HTTP/2 and receives transcription results within seconds. It's used for live captions, real-time meeting minutes, and real-time call center analysis. In streaming mode, partial results are returned incrementally and updated as recognition progresses. Final results are confirmed at utterance boundaries. Recognition accuracy for Japanese tends to be lower than English, particularly for technical terms and proper nouns. Registering custom vocabulary can significantly improve recognition accuracy for industry-specific and internal terminology.

Custom Vocabulary and Language Model Tuning

Custom Vocabulary lets you register words and phrases that Transcribe's standard model struggles to recognize. You specify entries in table format with Phrase, SoundsLike (pronunciation hint), and DisplayAs (display format). For example, to register the product name "CloudHSM," set Phrase to "CloudHSM," SoundsLike to "cloud-H-S-M," and DisplayAs to "CloudHSM." Up to 50,000 entries can be registered, and you simply specify the custom vocabulary when running a job. Custom Language Models are a more advanced tuning mechanism where you provide domain-specific text data (manuals, meeting minutes, emails, etc.) as training data to build a domain-specialized language model. While custom vocabulary improves recognition of individual words, custom language models improve accuracy including context. For domains rich in specialized terminology like healthcare, legal, and finance, combining custom vocabulary with custom language models is most effective. Vocabulary Filters mask or remove specific words, automatically stripping inappropriate expressions or sensitive information from transcription results.

Call Analytics and Medical-Focused Transcribe Medical

Transcribe Call Analytics is a feature specialized for call center conversation analysis that automatically performs sentiment analysis, issue detection, and call summarization in addition to transcription. Sentiment analysis classifies each utterance as positive, negative, neutral, or mixed, visualizing the emotional trajectory of the entire call on a timeline. Issue detection automatically identifies customer complaints and requests such as "I want to return this," "I want to cancel," or "Let me speak to a supervisor." Call summarization automatically extracts key points from the conversation, automating the creation of post-call notes that operators previously wrote manually. Transcribe Medical uses a healthcare-specialized model with significantly higher recognition accuracy for medical terminology, drug names, and anatomical terms compared to the standard model. It's HIPAA-compliant and used for physician dictation and patient conversation transcription. For a deeper dive into speech recognition and NLP, books on the topic (Amazon) are a great resource. Pricing is approximately 0.00024 USD per second of audio (roughly 0.864 USD per hour) for batch processing, and approximately 0.00036 USD per second for streaming. Call Analytics incurs additional charges of approximately 0.02 USD per minute.

共有するXB!