Show HN: S3HyperSync – Faster S3 sync tool – iterating with up to 100k files/s

49 points by Starofall a year ago

An alternative S3 sync tool to extremly fast sync s3 buckets.

Feedback and contributions are welcome!

How does it compare with s5cmd [1]? s5cmd is my goto tool for fast s3 sync and they have the following at the top of their Github page:

> For uploads, s5cmd is 32x faster than s3cmd and 12x faster than aws-cli. For downloads, s5cmd can saturate a 40Gbps link (~4.3 GB/s), whereas s3cmd and aws-cli can only reach 85 MB/s and 375 MB/s respectively.

[1] https://github.com/peak/s5cmd

Starofall a year ago

I have not yet compared it against this tool, but the given numbers are for download and not for syncing files.
For a S3->S3 sync using an C6gn.8xlarge instance, I got up to 800MB/sec using 64 workers, but the files were only in average around 50MB. And the bigger the file the higher the MB/sec.
Also from my short look into it, s5cmd does not support syncing between S3 providers (S3->CloudFlare).

kapilvt a year ago

For large buckets key space enumeration is a significant portion of most bulk operations, especially on a potentially non optimized key space (aka hotspots), there’s a few heuristics that can be utilized, but doing an s3 inventory allows skipping that and focusing on transfer with significantly less api calls, albeit requires bucket preparation.

Starofall a year ago

For a big sync this is definitly the way to go. But for a simple daily sync we found the tool more usefull. Especially when syncing to a non AWS S3 provider that does not offer a inventory feature.

sam_goody a year ago

How can I be confident that everything was synced correctly? Is there a way to compare the SHA or whatever key S3 provides?

Also, would this work well when there is not a lot of room on the disk it is syncing from? I have had serious issues with the S3 cli in such a scenario?

Also, how would this compare to something like rclone?

Starofall a year ago

File size is easy to compare, so you at least know that the full file got sycned. Hashes is a bigger issue with AWS in general, as you only get an ETag from S3 which is the MD5 of a file for Puts, but not for multiparts - and also dependent on the multipart size. So you cant really check the file for hash equality easily.
The good news is that with S3 over HTTP you should not really run into byte flip issues.
The sync server does not need any file system storage, it processes all uploads in memory and only ever buffers 5MB per worker for multipart uploads.
rclone looks like a good alternative, but without the focus on fast iterations for e.g. daily backups of huge buckets.

toomuchtodo a year ago

Can you target buckets at different providers, such as syncing from AWS to Backblaze (assuming S3 compatible target)?

Starofall a year ago

Yes! That was one usecase we had, that the AWS CLI was not able to do. You can sync any two S3 compatible services with each other, and it even supports legacy path-style if you e.g. want to sync to a local Minio S3 installation.

asyncingfeeling a year ago

Very nice!

Seemingly not the intended use case, and I might be overlooking something, but nice to have features which the s3 sync tool has and I'd personally miss: - profiles - local sync

Starofall a year ago

Profile support might also make things less easy to use, as the tool has support for cross S3 provider syncing. Thats why it needs the key and secret of source and target seperatly.
Local Sync is on the idea list, but not that simple - as local folders do not have the same "paginate all items in lexicographic order as it would look like on S3" feature ^^

jayzalowitz a year ago

Anyone have time to dig in and say what tricks its using most?

Starofall a year ago

The code is really extremly simple, just these few files: https://github.com/Starofall/S3HyperSync/tree/main/src/main/...
1) The underlying S3 Framework is already super fast https://pekko.apache.org/docs/pekko-connectors/current/s3.ht...
2) Lots of multithreading, stream buffering and pipelineing
3) For the fast iteration speed the "read,parse,ask for next" loop is the main bottleneck - so if you e.g. know that your sync source prefix contains uuids - the tool creates a file iterator for each known subfolder prefix. And with 16 iterators, its mainly the CPU that bottlenecks the XML parsing :)