Amazon Rekognition
A machine learning-based visual analysis service that performs object detection, facial analysis, text detection, content moderation, and custom label detection on images and videos
Overview
Amazon Rekognition is a machine learning service that provides visual analysis of images and videos via API. Using pre-trained models, it can detect objects and scenes, detect/compare/search faces, detect text in images, detect inappropriate content, recognize celebrities, and detect PPE (personal protective equipment). The Custom Labels feature lets you train custom image classification and object detection models from a small number of training images (as few as dozens), enabling applications in manufacturing visual inspection and retail product recognition.
Image Analysis API Selection and Accuracy Characteristics
Rekognition's image analysis APIs are organized by purpose, and selecting the right API is important for both accuracy and cost. DetectLabels detects objects, scenes, and activities in images, returning labels like 'Dog', 'Beach', and 'Surfing' with confidence scores. It supports thousands of label categories and is useful for automatic image tagging and building search indexes. DetectFaces returns face positions, landmarks (eye, nose, mouth coordinates), and attributes (age range, gender, emotions, glasses presence, eye open/closed state). Emotion analysis detects 8 types - HAPPY, SAD, ANGRY, CONFUSED, DISGUSTED, SURPRISED, CALM, and FEAR - each with a confidence score. DetectModerationLabels detects inappropriate content (violence, nudity, drugs, etc.) and is used for content moderation on UGC (user-generated content) platforms. DetectText detects and recognizes text within images, reading characters from signs, license plates, and screenshots. It can detect up to 100 words and handles tilted and curved text.
Face Collection and Face Search Design
Rekognition's face collection feature indexes facial feature vectors for high-speed face searching. When you register face images to a collection using the IndexFaces API, facial features are extracted and stored as vectors. The SearchFacesByImage API takes a new face image as input and searches for similar faces in the collection in milliseconds. A single collection can store up to 20 million faces with search accuracy exceeding 99.9%. Applications include access control, event attendee verification, and person grouping in photo apps. An important design consideration is that face collections store only feature vectors, not the original images. The standard design is to store original images separately in S3 and link them via ExternalImageId. Face recognition accuracy depends heavily on lighting conditions, face angle, and image resolution. Frontal-facing faces under adequate lighting with a minimum of 80x80 pixels are recommended. From a privacy perspective, regulations on facial recognition technology vary by country, so you need to design processes for disclosing usage purposes and obtaining consent in advance. For a deeper understanding of computer vision and its applications, books on computer vision (Amazon) are a great resource.
Video Analysis and Streaming Processing
Rekognition Video provides two modes: asynchronous analysis of video files stored in S3, and real-time streaming analysis from Kinesis Video Streams. For asynchronous analysis, you start analysis jobs with APIs like StartLabelDetection, StartFaceDetection, and StartPersonTracking, detect completion via SNS notifications, and retrieve results. Label detection and face detection are performed on each frame in the video, with timestamped results returned. For streaming analysis, real-time face searches can be executed against live video sent to Kinesis Video Streams. A typical use case is detecting specific individuals from surveillance camera footage. Custom Labels can be applied to video as well as images, enabling industrial applications like real-time defect detection on manufacturing lines. Cost-wise, image analysis is 1 USD per 1,000 images (DetectLabels), and video analysis is 0.10 USD per minute of processing time as the base rate. When processing large volumes of images, selectively analyzing only the necessary images is more cost-efficient than batch-analyzing everything from S3.