Testing S3 Deep Archive downloads through CloudFront

After writing the archive comparisons, I stumbled across a Reddit post where someone argues that Deep Archive is actually pretty useful as a proper archive as long as you’re not downloading regularly.

I’m inclined to agree, except I think you can do an end run around the egress cost because CloudFront allows 1 TB outbound to the Internet as part of the free tier.

Setup

I ran all this in us-east-2 to make sure this test was isolated from my normal use for billing purposes without creating a new account.

I used a single S3 bucket and a CloudFront distribution pointing to the bucket. I initially set up more infrastructure, but some experimentation showed that I could simplify it.

Once the bucket was ready, I used the console to upload my 968.3 MB test file directly to the bucket. I could have used the CLI to upload a file either directly to the Deep Archive class or to the Standard storage class and rely on a Lifecycle rule to move the file to Deep Archive.

Changing My Understanding of Restored Files

Before triggering the restore on the file I constructed the URL from the CloudFront distribution name and the object key: https://d24a7cr1iyx5xd.cloudfront.net/Destiny 2 on GeForce NOW 2021-07-23 16-14-37.mp4

When I tried accessing the URL I got an error about InvalidObjectState and “The operation is not valid for the object's storage class DEEP_ARCHIVE”. Reading the GetObject API docs, this error is expected if the object hasn’t been restored yet.

Based on this I changed my understanding: Instead of needing to copy the file to a new bucket to do anything with it, once the restore is complete the file behaves like a standard S3 object. The restored file is essentially a shadow copy of the underlying archived file that S3 can use for various operations until it expires.

The documentation I found doesn’t explicitly state that the temporary copy is usable as a standard S3 object though. Working with archived objects states: “Amazon S3 restores a temporary copy of the object only for the specified duration.” The S3 pricing page (!) comes closest, stating:

When you restore an archive, you are paying for both the archive (charged at the S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive rate) and a copy, accessible with GET using the same object key, that you restored temporarily (charged at the S3 Standard storage rate for a duration of time you choose).

The key part of that is the “accessible with GET using the same object key”.

Restore

I triggered the single restore using the bulk priority through the AWS console. Since I’m just testing restoring a single file I skipped setting up an automated notification handler and relied on checking the file in the console.

My test file was restored in ~16 hours, and after it was restored I tried downloading the file using the same URL I constructed earlier. This time it successfully downloaded. I waited for the first charge to appear on my bill to make sure it was actually free before downloading the file repeatedly to test the limits of the CloudFront free tier. I got the result I was hoping for in the bill a few days later:

$0.000 per GB - data transfer out under the global monthly free tier: 100.488 GB

So the egress cost should be free up to 1TB! Thanks CloudFront!

Unexpected Costs

A few days after the experiment I checked the AWS bill again since that’s the only way to know what the smaller charges actually are.

The first line item I saw was “Amazon S3 Glacier Deep Archive CompleteMultipartUpload - $0.05 per 1,000 CompleteMultipartUpload requests”. I expected this since the S3 pricing page does list “PUT, COPY, POST, LIST requests (per 1,000 requests): $0.05”, and this should have been the PUT request.

What I didn’t expect were these two line items:

Amazon S3 Glacier Deep Archive InitiateMultipartUpload - $0.005 per 1,000 InitiateMultipartUpload requests
Amazon S3 Glacier Deep Archive UploadPart - $0.005 per 1,000 UploadPart requests

Buried in the multipart upload API docs is this:

in-progress multipart parts for a PUT to the S3 Glacier Deep Archive storage class are billed as S3 Glacier Flexible Retrieval Staging Storage at S3 Standard storage rates until the upload completes, with only the CompleteMultipartUpload request charged at S3 Glacier Deep Archive rates.

I did some quick calculations, showing the multipart upload dominates the initiate and complete API calls:
Initiate: 900 * $0.005 / 1000 = $0.0045
UploadPart: 900 * 1000MB/8MB * $0.005/1000 = $0.5625
Complete: 900 * $0.05/1000 = $0.045

My takeaway is that uploading files would benefit from avoiding multipart uploads, but given it’s a one-time cost I’m not going to hyper-optimize this. I’m not sure how successful uploading ~1GB files in single shots would be. I would try fiddling with the AWS SDK to use a larger multipart upload size though.

It also means there’s no benefit to uploading files to S3 Standard and then using the lifecycle conversion (which costs $0.05/1000 requests). Uploading files to S3 Standard and converting them would cost $0.005 / 1000 requests to complete the multipart upload plus $0.05/1000 for the conversion versus $0.05/1000 for the multipart complete direct to Deep Archive. Going direct to Deep Archive saves $0.005/1000 requests.

In terms of restore costs I got my 2 expected charges: “Amazon Simple Storage Service DeepArchiveRestoreObject” ($0.0025 per GB Bulk retrieval fee) and “Amazon S3 Glacier Deep Archive USE2-Requests-GDA-Tier5” ($0.025 per 1,000 Bulk requests). This matches the pricing docs, so I wasn’t surprised there.

Open Questions

These tests were a useful experiment, but actually using Glacier Deep Archive for my backups will require more automation.

Firstly, I need a way to trigger restore requests in bulk for the downloads. Batch Operations looks like the cleanest solution, though I might be able to script this and issue a request per file.

To download the files I’d need a script that polls an SQS queue for the s3:ObjectRestore:Completed notification. I haven’t implemented anything yet, but I found that Python’s urllib.parse.unquote_plus will be needed to parse the URL-encoded object key name. The alternative is polling for job completion, but using a queue is ~free and I can start & stop a script on my own schedule.

Additionally, once I download the file, I’d like to expire the temporary copy early to stop paying the S3 Standard charge. The docs state that reissuing the restore request will result in “S3 updat[ing] the expiration period relative to the current time”. I’m unclear exactly how this works; I assume that if I make a restore request set to 0 days the file will expire ~immediately/at the next processing run (apparently “the next day at midnight Universal Coordinated Time (UTC)").

Setup#

Changing My Understanding of Restored Files#

Restore#

Unexpected Costs#

Open Questions#

Setup

Changing My Understanding of Restored Files

Restore

Unexpected Costs

Open Questions