AWS DataSync
A service that automates and accelerates data transfers between on-premises and AWS or between AWS services, supporting scheduled transfers across NFS/SMB/HDFS/S3/EFS/FSx
Overview
AWS DataSync is a service that automates data transfers between on-premises storage systems and AWS storage services, or between AWS storage services. It supports transfers from NFS, SMB, HDFS, and object storage to S3, EFS, FSx for Windows File Server, FSx for Lustre, FSx for OpenZFS, and FSx for NetApp ONTAP. A purpose-built network protocol delivers transfer speeds up to 10x faster than open-source tools. Data verification during transfer, encryption, bandwidth throttling, and scheduled execution are included as standard features.
Agent Placement and Network Design
For data transfers from on-premises to AWS, you need to deploy a DataSync agent in your on-premises environment. The agent can run as a virtual machine on VMware ESXi, Microsoft Hyper-V, or KVM, or as an Amazon EC2 instance. Minimum agent requirements are 4 vCPUs and 32 GB RAM, with 16 vCPUs and 64 GB RAM or more recommended when transferring tens of millions of files. You can choose from three network paths: over the internet, via AWS Direct Connect, or through a VPC endpoint (PrivateLink). For large initial data transfers, using a Direct Connect dedicated connection is common, while internet-based transfers are typically sufficient for daily incremental syncs. Bandwidth throttling lets you limit transfer bandwidth during business hours and use full bandwidth overnight. For transfers between AWS services (e.g., S3 to EFS), no agent is required - DataSync handles the transfer in a fully managed fashion.
Transfer Task Design and Data Integrity Guarantees
A DataSync task consists of three elements: a source location, a destination location, and transfer settings. Transfer settings let you configure data filtering (include/exclude patterns), file metadata preservation (POSIX permissions, timestamps, ownership), and overwrite policies (transfer only changed files vs. all files). DataSync automatically performs checksum verification during and after transfer, guaranteeing that source and destination data are identical. This verification cannot be skipped, ensuring data integrity at all times. For large-scale transfers of hundreds of millions of files, task execution may take hours to days, but if interrupted, the next run transfers only the delta - no need to start over. With scheduled execution, you can automate daily or weekly periodic syncs, achieving near-real-time data replication between on-premises file servers and S3. For a deeper understanding of cloud migration strategies, books on cloud migration (Amazon) are a great resource.
Design Patterns by Migration Scenario
DataSync has three typical use cases. First, migrating on-premises NFS/SMB file servers to S3 or EFS. After the initial full copy, incremental syncs continue until cutover, at which point the application's connection target is switched. When file counts exceed tens of millions, splitting tasks by directory and running them in parallel can significantly reduce transfer time. Second, continuous data synchronization in hybrid cloud environments. A representative pattern is periodically transferring logs or sensor data generated on-premises to S3 for analysis with Athena or EMR. Third, data transfers across AWS regions or accounts. For replicating data to an S3 bucket in another region for disaster recovery, DataSync is effective for file system data (EFS, FSx) that S3 cross-region replication cannot handle. Compared to Transfer Family (SFTP/FTPS), which specializes in receiving files from external partners, DataSync specializes in high-speed bulk data transfer and automation.