Building a Real-Time Data Pipeline with Amazon Kinesis - Choosing Between Data Streams and Data Firehose
Ingest real-time data with Data Streams and automatically deliver it to S3, Redshift, and OpenSearch with Data Firehose. This article explains how to build a streaming pipeline with shard design and on-demand mode selection.
Overview of Kinesis
Kinesis is a family of services for collecting, processing, and analyzing real-time streaming data. Data Streams provides real-time processing with custom consumers, Data Firehose provides automatic delivery to S3 and Redshift, and Managed Apache Flink provides SQL/Flink processing on stream data.
Choosing Between Data Streams and Firehose
Data Streams manages throughput at the shard level, with custom processing implemented via Lambda or KCL (Kinesis Client Library). It is ideal for real-time alerts requiring sub-second latency or when multiple consumers need to read from the same stream. Data Firehose automatically buffers data from producers and delivers it to S3, Redshift, OpenSearch, and Splunk. You can apply data transformation (Lambda) and format conversion (Parquet) before delivery, eliminating the need for consumer implementation. Firehose is the best choice for log aggregation and delivery to analytics platforms.
Data Streams Design Patterns
The number of Data Streams shards is determined by throughput requirements. Each shard provides 1 MB/sec write (1,000 records/sec) and 2 MB/sec read. On-demand mode automatically adjusts shard count, while provisioned mode requires manual configuration. Partition key design controls data distribution and prevents hot shards (skewed load on specific shards). Enabling enhanced fan-out allocates a dedicated 2 MB/sec read throughput per consumer, allowing multiple consumers to process data without affecting each other. KCL (Kinesis Client Library) automatically manages shard rebalancing and checkpointing, simplifying consumer application development. For a comprehensive understanding of Kinesis, related books (Amazon) can be a helpful resource.
Kinesis Cost Optimization
Data Streams provisioned mode charges per shard hour (approximately $0.015/hour) and PUT payload units (25KB units, approximately $0.014 per million units). On-demand mode charges per data ingestion volume (approximately $0.08 per GB) and read volume, suitable for workloads with variable traffic. Data Firehose charges only for ingested data volume (approximately $0.029 per GB), with no shard management required. Record aggregation (combining multiple small records into a single PUT) reduces PUT payload unit count and optimizes Data Streams costs. Extending the data retention period beyond the default 24 hours incurs additional charges, so set it to the minimum required for your replay needs.
Summary
Kinesis is a real-time streaming data processing platform. Data Streams enables custom real-time processing, with enhanced fan-out for parallel processing across multiple consumers. Data Firehose automates delivery to S3 and Redshift, while record aggregation reduces PUT payload unit count for cost optimization. On-demand mode also eliminates the operational overhead of shard management.