perf/storage: start streaming zip/index uploads, parallel directory upload#3287
perf/storage: start streaming zip/index uploads, parallel directory upload#3287GuillaumeGomez merged 2 commits intorust-lang:mainfrom
Conversation
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
|
Thanks! |
|
You may want to talk to @Kobzol, he's played around with various S3 upload strategies in rustc-perf recently. I think we settled on https://docs.rs/object_store/ for it. That seems like it has some support for multi-part uploads, though I don't know if we're using them in rustc-perf. Their docs do seem to confirm some of the multi-part limitations you're suggesting:
https://docs.rs/object_store/latest/object_store/trait.MultipartUpload.html#tymethod.put_part |
|
In rustc-perf we pre-compress the files before uploading them to S3, so no direct streaming is involved there (though I try to hide latency by compressing multiple files concurrently on a background tokio blocking worker thread pool). The object_store crate seems to work fine for the uploads, and it doesn't have so many dependencies as the official AWS crates. We are not using multi-part uploads in rustc-perf at the moment, or at least not explicitly. |
I wanted to do streaming uploads for quite some time, and hoped it could be as cool as the streaming downloads.
But sadly no :) all S3 APIs need a known content-length before you start the upload. This makes it hard for example to have a local buf and just compress it while uploading.
Only way around that is multipart uploads, but these are comprex, and a part has ( I think) a minimum size of 5 MiB anyways, which we would have to buffer. Since most files are smaller than that, we can also just buffer the whole compressed file and then upload it.
Where the stream works well is when you have a local file. I updated our zip&index method to use tempfiles and stream these to S3.
With the new API it was also easy to optimize
store_all(upload all files from a directory) to compress & upload directly, and in parallel, where before we first loaded all files into memory, compress then, and then upload them.Let's see if new places come up