Designing Streaming Data Processing - Building Real-Time Data Pipelines with Kinesis

Learn streaming data processing design techniques with Amazon Kinesis, including building real-time data pipelines with Data Streams, Data Firehose, and Lambda integration.

Growing Demand for Real-Time Data Processing and the Role of Kinesis

The demand for instantly processing and analyzing large volumes of data generated in real time, such as IoT sensor data, web application clickstreams, financial transaction logs, and social media feeds, is expanding rapidly. While batch processing introduces latency of hours to days, streaming processing delivers analysis results within seconds of data generation. Amazon Kinesis is a family of fully managed services for collecting, processing, and analyzing streaming data in real time. Kinesis Data Streams provides the data stream foundation, Kinesis Data Firehose handles data delivery, and Kinesis Data Analytics enables SQL/Apache Flink analysis of streaming data.

Data Collection with Kinesis Data Streams

Kinesis Data Streams is a service for collecting and retaining large volumes of streaming data in real time. A data stream is composed of shards, with each shard supporting 1 MB/s of writes and 2 MB/s of reads. On-demand mode automatically scales the number of shards based on traffic, eliminating the need for capacity planning. Provisioned mode lets you explicitly specify the shard count to optimize costs. The default data retention period is 24 hours, extendable up to 365 days. The Kinesis Producer Library (KPL) maximizes producer-side throughput through record aggregation and buffering. The enhanced fan-out feature provides dedicated read throughput (2 MB/s per shard) for each consumer, allowing multiple consumers to process the same stream in parallel.

Serverless Stream Processing with Lambda Integration

The integration of Kinesis Data Streams with Lambda is widely adopted as a serverless stream processing pattern. Lambda automatically polls records from the Kinesis stream via event source mappings and passes them to the Lambda function in batches. You can optimize the balance between throughput and latency by adjusting batch size, batch window, and parallelization factor. Setting the parallelization factor enables multiple Lambda instances to process a single shard in parallel, increasing processing capacity. For error handling, the bisect-on-function-error feature automatically splits failed batches in half for retry, helping identify problematic records. Failed records can be sent to an SQS dead-letter queue for later investigation and reprocessing. The filtering feature lets you filter records by condition before passing them to the Lambda function, avoiding unnecessary record processing. A CLI example for setting up filtering on a Lambda event source mapping: aws lambda create-event-source-mapping --function-name process-orders --event-source-arn arn:aws:kinesis:ap-northeast-1:123456789012:stream/orders --starting-position LATEST --batch-size 100 --maximum-batching-window-in-seconds 5 --filter-criteria "{"Filters":[{"Pattern":"{\"data\":{\"event_type\":[\"ORDER_PLACED\"]}}"}]}" passes only ORDER_PLACED events to Lambda, avoiding unnecessary record processing and optimizing costs. To systematically learn real-time data processing from fundamentals to advanced topics, books on Amazon offer comprehensive coverage.

Delivery with Data Firehose and S3 Integration

Kinesis Data Firehose is a service that automatically delivers streaming data to destinations such as S3, Redshift, OpenSearch, and Splunk. It fully automates the process from data reception through buffering, transformation, compression, and delivery, eliminating the need to develop consumer applications. Buffer size (1-128 MB) and buffer interval (60-900 seconds) control delivery frequency and batch size. The data transformation feature uses Lambda functions to perform format conversion, filtering, and enrichment on records before delivery. For S3 delivery, automatic conversion to Parquet or ORC format is available, enabling construction of data lakes optimized for analysis with Athena or Redshift Spectrum. Dynamic partitioning determines S3 prefixes dynamically based on record content, enabling efficient data organization.

Kinesis Pricing

Kinesis Data Streams on-demand mode costs approximately $0.08 per GB (writes) and $0.04 per GB (reads). Provisioned mode costs approximately $0.015 per shard-hour. Kinesis Data Firehose costs approximately $0.029 per GB. Compared to MSK (Kafka), Kinesis offers richer native integration with AWS services and is more cost-effective for small to medium-scale streaming. For large-scale workloads requiring the Kafka ecosystem, MSK is the better choice.

Summary

Amazon Kinesis is a family of fully managed services that comprehensively covers streaming data collection, processing, and delivery, supporting the construction of real-time data pipelines. Data Streams' on-demand mode enables scalable data collection without capacity planning. Serverless stream processing through Lambda integration allows real-time data transformation and analysis without infrastructure management. Data Firehose automates the construction of analysis-optimized data lakes through automatic delivery to S3 with Parquet conversion. For organizations building real-time data processing infrastructure, Kinesis is a core service.