Designing Streaming Data Processing - Building Real-Time Data Pipelines with Kinesis

Learn streaming data processing design techniques with Amazon Kinesis, including building real-time data pipelines with Data Streams, Data Firehose, and Lambda integration.

約 6 分で読めます最終更新: 2025-10-13

Growing Demand for Real-Time Data Processing and the Role of Kinesis

The demand for instantly processing and analyzing large volumes of data generated in real time, such as IoT sensor data, web application clickstreams, financial transaction logs, and social media feeds, is expanding rapidly. While batch processing introduces latency of hours to days, streaming processing delivers analysis results within seconds of data generation. Amazon Kinesis is a family of fully managed services for collecting, processing, and analyzing streaming data in real time. Kinesis Data Streams provides the data stream foundation, Kinesis Data Firehose handles data delivery, and Kinesis Data Analytics enables SQL/Apache Flink analysis of streaming data.

Data Collection with Kinesis Data Streams

Kinesis Data Streams is a service for collecting and retaining large volumes of streaming data in real time. A data stream is composed of shards, with each shard supporting 1 MB/s of writes and 2 MB/s of reads. On-demand mode automatically scales the number of shards based on traffic, eliminating the need for capacity planning. Provisioned mode lets you explicitly specify the shard count to optimize costs. The default data retention period is 24 hours, extendable up to 365 days. The Kinesis Producer Library (KPL) maximizes producer-side throughput through record aggregation and buffering. The enhanced fan-out feature provides dedicated read throughput (2 MB/s per shard) for each consumer, allowing multiple consumers to process the same stream in parallel.

Serverless Stream Processing with Lambda Integration

The integration of Kinesis Data Streams with Lambda is widely adopted as a serverless stream processing pattern. Lambda automatically polls records from the Kinesis stream via event source mappings and passes them to the Lambda function in batches. You can optimize the balance between throughput and latency by adjusting batch size, batch window, and parallelization factor. Setting the parallelization factor enables multiple Lambda instances to process a single shard in parallel, increasing processing capacity. For error handling, the bisect-on-function-error feature automatically splits failed batches in half for retry, helping identify problematic records. Failed records can be sent to an SQS dead-letter queue for later investigation and reprocessing. The filtering feature lets you filter records by condition before passing them to the Lambda function, avoiding unnecessary record processing. A CLI example for setting up filtering on a Lambda event source mapping: aws lambda create-event-source-mapping --function-name process-orders --event-source-arn arn:aws:kinesis:ap-northeast-1:123456789012:stream/orders --starting-position LATEST --batch-size 100 --maximum-batching-window-in-seconds 5 --filter-criteria "{"Filters":[{"Pattern":"{\"data\":{\"event_type\":[\"ORDER_PLACED\"]}}"}]}" passes only ORDER_PLACED events to Lambda, avoiding unnecessary record processing and optimizing costs. To systematically learn real-time data processing from fundamentals to advanced topics, books on Amazon offer comprehensive coverage.

Delivery with Data Firehose and S3 Integration

Kinesis Data Firehose is a service that automatically delivers streaming data to destinations such as S3, Redshift, OpenSearch, and Splunk. It fully automates the process from data reception through buffering, transformation, compression, and delivery, eliminating the need to develop consumer applications. Buffer size (1-128 MB) and buffer interval (60-900 seconds) control delivery frequency and batch size. The data transformation feature uses Lambda functions to perform format conversion, filtering, and enrichment on records before delivery. For S3 delivery, automatic conversion to Parquet or ORC format is available, enabling construction of data lakes optimized for analysis with Athena or Redshift Spectrum. Dynamic partitioning determines S3 prefixes dynamically based on record content, enabling efficient data organization.

Kinesis Pricing

Kinesis Data Streams on-demand mode costs approximately $0.08 per GB (writes) and $0.04 per GB (reads). Provisioned mode costs approximately $0.015 per shard-hour. Kinesis Data Firehose costs approximately $0.029 per GB. Compared to MSK (Kafka), Kinesis offers richer native integration with AWS services and is more cost-effective for small to medium-scale streaming. For large-scale workloads requiring the Kafka ecosystem, MSK is the better choice.

Summary

Amazon Kinesis is a family of fully managed services that comprehensively covers streaming data collection, processing, and delivery, supporting the construction of real-time data pipelines. Data Streams' on-demand mode enables scalable data collection without capacity planning. Serverless stream processing through Lambda integration allows real-time data transformation and analysis without infrastructure management. Data Firehose automates the construction of analysis-optimized data lakes through automatic delivery to S3 with Parquet conversion. For organizations building real-time data processing infrastructure, Kinesis is a core service.

Practical Use Cases for Amazon Quick - Department-Specific Scenarios and Workflow Automation Design PatternsExplore concrete use cases for sales, IT, and finance departments, along with design patterns for notifications, approvals, and multi-step workflows using Quick Flows.BI Dashboard Visualization - Building a Data-Driven Decision Platform with Amazon QuickSightExplains how to build interactive BI dashboards with Amazon QuickSight and a serverless data analytics platform with Athena integration. Covers high-speed visualization with the SPICE engine and practical methods for sharing insights across the organization.Building Blockchain Networks - Leveraging Distributed Ledgers with Amazon Managed Blockchain and QLDBExplains how to build blockchain networks with Amazon Managed Blockchain and use Amazon QLDB as a verifiable ledger database. Covers practical use cases such as supply chain management and ensuring transparency in financial transactions.Privacy-Preserving Data Collaboration with AWS Clean RoomsRun joint analysis across multiple companies without sharing or copying data. Learn about aggregation rules for preventing individual identification and Cryptographic Computing for encrypted analysis.Customer Identity Unification - Resolving Scattered Customer Data with AWS Entity ResolutionLearn how to perform entity resolution (record matching) on customer data using AWS Entity Resolution. This article covers ML-based matching, rule-based matching, privacy protection, and integration with Clean Rooms.Leveraging Third-Party Data with AWS Data Exchange - Data Procurement and Subscription ManagementProcure third-party data products via Marketplace and build automated delivery pipelines to S3. This article also covers how to productize and monetize your own data.Building a Data Lake with Amazon S3 and Lake Formation - Design Patterns and GovernanceExplore data lake design patterns using S3 as the storage foundation and Lake Formation for fine-grained access control. This article also covers ETL pipelines and cost optimization.Data Lake Governance - Centralized Access Control with AWS Lake FormationLearn about building, access control, and governance for data lakes using AWS Lake Formation. This article covers fine-grained column-level and row-level permission management for S3-based data lakes, along with Glue and Athena integration.

Growing Demand for Real-Time Data Processing and the Role of Kinesis

Data Collection with Kinesis Data Streams

Serverless Stream Processing with Lambda Integration

Delivery with Data Firehose and S3 Integration

Kinesis Pricing

Summary

Related Services

Related Articles

More on This Topic

Similar Articles and Services