Building an Apache Kafka Streaming Platform with Amazon MSK - Cluster Design and Operations
Design a managed Apache Kafka cluster, learn when to choose MSK Serverless, and explore data integration patterns with MSK Connect.
MSK Features and When to Choose Over Kinesis
MSK is a managed service for Apache Kafka, with AWS handling broker provisioning, patching, and failure recovery. It is fully compatible with the Kafka protocol, so existing Kafka producers, consumers, Kafka Streams applications, and Connect connectors work as-is. When deciding between MSK and Kinesis Data Streams, choose MSK when you need compatibility with the existing Kafka ecosystem, and Kinesis when you prioritize AWS-native integration and simplicity. Kinesis uses per-shard pay-as-you-go pricing suited for small starts, while MSK uses per-broker-instance hourly pricing that becomes more cost-efficient at large-scale streaming volumes.
Cluster Design and MSK Serverless
With provisioned clusters, you specify the broker instance type (e.g., kafka.m5.large), number of brokers, and storage capacity. For production environments, a minimum of 3 brokers (one per AZ) is recommended, with a replication factor of 3 and min.insync.replicas of 2 to ensure data durability. MSK Serverless, which became generally available in 2022, eliminates cluster provisioning entirely. You can focus solely on creating topics and sending/receiving data, with throughput auto-scaling at the partition level. It is ideal when traffic patterns are unpredictable or when operations team resources are limited. However, provisioned clusters offer more flexibility for customizing Kafka configuration parameters.
Data Integration with MSK Connect
MSK Connect is a service that runs Kafka Connect connectors in a managed environment. The S3 Sink Connector automatically writes topic data to S3 for analysis with Redshift or Athena. The DynamoDB Sink Connector updates tables in real time, and the Debezium Source Connector captures RDS table changes and streams them to Kafka topics. Connector scaling can be configured automatically or manually, adjusting worker counts to tune throughput. Connector plugins are created from JAR files uploaded to S3, so you can freely use connectors from Confluent Hub or the open-source community. To gain a deeper understanding of streaming analytics, specialized books on Amazon are a valuable resource.
MSK Pricing
Provisioned cluster pricing consists of per-broker-instance hourly charges and storage. A kafka.m5.large costs approximately $0.21 per hour (about $151 per month), with a minimum 3-broker configuration costing approximately $453 per month. Storage costs approximately $0.10 per GB per month. MSK Serverless is charged by cluster hour (approximately $0.75/hour) and partition hour, making it advantageous for intermittent traffic environments. Compared to Kinesis Data Streams (approximately $0.015 per shard hour), choose MSK when Kafka ecosystem compatibility is needed, and Kinesis when AWS-native integration is the priority.
Summary
MSK is a service that delegates Apache Kafka operational management to AWS while maintaining Kafka ecosystem compatibility. MSK Serverless eliminates cluster management entirely, and MSK Connect runs Kafka Connect connectors in a managed environment. Tiered Storage automatically tiers cold data to S3, reducing broker storage costs.