28 August, 2018

Shrinking Go Binaries

As part of the efforts to build a new artifact system, I wrote a CLI program to handle Taskcluster Artifact upload and download.  This is written in Go and as a result the binaries are quite large.  Since I'd like this utility to be used broadly within Mozilla CI, which requires a reasonably sized binary.  I was curious about what the various methods are and what the trade-offs of each would be.

A bit of background is that Go binaries are static binaries which have the Go runtime and standard library built into them.  This is great if you don't care about binary size but not great if you do.

This graph has the binary size on the left Y axis using a linear scale in blue and the number of nanoseconds each reduction in byte takes to compute on the right Y axis using a logarithmic scale.

To reproduce my results, you can do the following:

go get -u -t -v github.com/taskcluster/taskcluster-lib-artifact-go
cd $GOPATH/src/github.com/taskcluster/taskcluster-lib-artifact-go
git checkout 6f133d8eb9ebc02cececa2af3d664c71a974e833
time (go build) && wc -c ./artifact
time (go build && strip ./artifact) && wc -c ./artifact
time (go build -ldflags="-s") && wc -c ./artifact
time (go build -ldflags="-w") && wc -c ./artifact
time (go build -ldflags="-s -w") && wc -c ./artifact
time (go build && upx -1 ./artifact) && wc -c ./artifact
time (go build && upx -9 ./artifact) && wc -c ./artifact
time (go build && strip ./artifact && upx -1 ./artifact) && wc -c ./artifact
time (go build && strip ./artifact && upx --brute ./artifact) && wc -c ./artifact
time (go build && strip ./artifact && upx --ultra-brute ./artifact) && wc -c ./artifact
time (go build && strip && upx -9 ./artifact) && wc -c ./artifact

Since I was removing a lot of debugging information, I figured it'd be worthwhile checking that stack traces are still working. To ensure that I could definitely crash, I decided to panic with an error immediately on program startup.


Even with binary stripping and the maximum compression, I'm still able to get valid stack traces.  A reduction from 9mb to 2mb is definitely significant.  The binaries are still large, but they're much smaller than what we started out with.  I'm curious if we can apply this same configuration to other areas of the Taskcluster Go codebase with similar success, and if the reduction in size is worthwhile there.

I think that using strip and upx -9 is probably the best path forward.  This combination provides enough of a benefit over the non-upx options that the time tradeoff is likely worth the effort.

Taskcluster Artifact API extended to support content verification and improve error detection

Background

At Mozilla, we're developing the Taskcluster environment for doing Continuous Integration, or CI.  One of the fundamental concerns in a CI environment is being able to upload and download files created by each task execution.  We call them artifacts.  For Mozilla's Firefox project, an example of how we use artifacts is that each build of Firefox generates a product archive containing a build of Firefox, an archive containing the test files we run against the browser and an archive containing the compiler's debug symbols which can be used to generate stacks when unit tests hit an error.

The problem

In the old Artifact API, we had an endpoint which generated a signed S3 url that was given to the worker which created the artifact.  This worker could upload anything it wanted at that location.  This is not to suggest malicious usage, but that any errors or early termination of uploads could result in a corrupted artifact being stored in S3 as if it were a correct upload.

If you created an artifact with the local contents "hello-world\n", but your internet connection dropped midway through, the S3 object might only contain "hello-w".  This went silent and uncaught until something much later down the pipeline (hopefully!) complained that the file it got was corrupted.  This corruption is the cause of many orange-factor bugs, but we have no way to figure out exactly where the corruption is happening.

Our old API was also very challenging to use and artifact handling in tasks.  It would often require a task writer to use one of our client libraries to generate a Taskcluster-Signed-URL and Curl to do uploads.  For a lot of cases, this is really hazard fraught.  Curl doesn't fail on errors by default (!!!), Curl doesn't automatically handle "Content-Encoding: gzip" responses without "Accept: gzip", which we sometimes need to serve.  It requires each user figure all of this out for themselves, each time they want to use artifacts.

We also had a "Completed Artifact" pulse message which isn't actually sending anything useful.  It would send a message when the artifact is allocated in our metadata tables, not when the artifact was actually complete.  We could mark a task as being completed before all of the artifacts were finished being uploaded.  In practice, this was avoided by avoiding a call to complete the task before the uploads were done, but it was a convention.

Our solution

We wanted to address a lot of issues with Taskcluster Artifacts.  Specifically the following issues are ones which we've tackled:
  1. Corruption during upload should be detected
  2. Corruption during download should be detected
  3. Corruption of artifacts should be attributable
  4. S3 Eventual Consistency error detection
  5. Caches should be able to verify whether they are caching valid items
  6. Completed Artifact messages should only be sent when the artifact is actually complete
  7. Tasks should be unresolvable until all uploads are finished
  8. Artifacts should be really easy to use
  9. Artifacts should be able to be uploaded with browser-viewable gzip encoding

Code

Here's the code we wrote for this project:
  1. https://github.com/taskcluster/remotely-signed-s3 -- A library which wraps the S3 APIs using the lower level S3 REST Api and uses the aws4 request signing library
  2. https://github.com/taskcluster/taskcluster-lib-artifact -- A light wrapper around remotely-signed-s3 to enable JS based uploads and downloads
  3. https://github.com/taskcluster/taskcluster-lib-artifact-go -- A library and CLI written in Go
  4. https://github.com/taskcluster/taskcluster-queue/commit/6cba02804aeb05b6a5c44134dca1df1b018f1860 -- The final Queue patch to enable the new Artifact API

Upload Corruption

If an artifact is uploaded with a different set of bytes to those which were expected, we should fail the upload.  The S3 V4 signatures system allows us to sign a request's headers, which includes an X-Amz-Content-Sha256 and Content-Length header.  This means that the request headers we get back from signing can only be used for a request which sets the X-Amz-Content-Sha256 and Content-Length to the value provided at signing.  The S3 library checks that the body of each request's Sha256 checksum matches the value provided in this header and also the Content-Length.

The requests we get from the Taskcluster Queue can only be used to upload the exact file we asked permission to upload. This means that the only set of bytes that will allow the request(s) to S3 to complete sucessfully will be the ones we initially told the Taskcluster Queue about.

The two main cases we're protecting against here are disk and network corruption.  The file ends up being read twice, once to hash and once to upload.  Since we have the hash calculated, we can be sure to catch corruption if the two hashes or sizes don't match.  Likewise, the possibility of network interuption or corruption is handled because the S3 server will report an error if the connection is interupted or corrupted before data matching the Sha256 hash exactly is uploaded.

This does not protect against all broken files from being uploaded.  This is an important distinction to make.  If you upload an invalid zip file, but no corruption occurs once you pass responsibility to taskcluster-lib-artifact, we're going to happily store this defective file, but we're going to ensure that every step down the pipeline gets an exact copy of this defective file.

Download Corruption

Like corruption during upload, we could experience corruption or interruptions during downloading.  In order to combat this, we set some metadata on the artifacts in S3.  We set some extra headers during uploading:
  1. x-amz-meta-taskcluster-content-sha256 -- The Sha256 of the artifact passed into a library -- i.e. without our automatic gzip encoding
  2. x-amz-meta-taskcluster-content-length -- The number of bytes of the artifact passed into a library -- i.e. without our automatic gzip encoding
  3. x-amz-meta-taskcluster-transfer-sha256 -- The Sha256 of the artifact as passed over the wire to S3 servers.  In the case of identity encoding, this is the same value as x-amz-meta-taskcluster-content-sha256.  In the case of Gzip encoding, it is almost certainly not identical.
  4. x-amz-meta-taskcluster-transfer-length -- The number of bytes of the artifact as passed over the wire to S3 servers.  In the case of identity encoding, this is the same value as x-amz-meta-taskcluster-content-sha256.  In the case of Gzip encoding, it is almost certainly not identical.
You would be right to question whether we can trust these values once created.  The good news is that headers on S3 objects cannot be changed after upload.  These headers are also part of the S3 request signing we do on the queue.  This means that the only values which can be set are those which the Queue expects, and that they are immutable.

Important to note is that because these are non-standard headers, verification requires explicit action on the part of the artifact downloader.  That's a big part of why we've written supported artifact downloading tools.

Attribution of Corruption

Corruption is inevitable in a massive system like Taskcluster.  What's really important is that when corruption happens we detect it and we know where to focus our remediation efforts.  In the new Artifact API, we can zero in on the culprit for corruption.

With the old Artifact API, we don't have any way to figure out if an artifact is corrupted or where that happened.  We never know what the artifact was on the build machine, we can't verify corruption in caching systems and when we have an invalid artifact downloaded on a downstream task, we don't know whether it is invalid because the file was defective from the start or if it was because of a bad transfer.

Now, we know that if the Sha256 checksums of the downloaded artifact, the original file was broken before it was uploaded.  We can build caching systems which ensure that the value that they're caching is valid and alert us to corruption.  We can track corruption to detect issues in our underlying infrastructure.

Completed Artifact Messages and Task Resolution

Previously, as soon as the Taskcluster Queue stored the metadata about the artifact in its internal tables and generated a signed url for the S3 object, the artifact was marked as completed.  This behaviour resulted in a slightly deceptive message being sent.  Nobody cares when this allocation occurs, but someone might care about an artifact becoming available.

On a related theme, we also allowed tasks to be resolved before the artifacts were uploaded.  This meant that a task could be marked as "Completed -- Success" without actually uploading any of its artifacts.  Obviously, we would always be writing workers with the intention of avoiding this error, but having it built into the Queue gives us a stronger guarantee.

We achieved this result by adding a new method to the flow of creating and uploading an artifact and adding a 'present' field in the Taskcluster Queue's internal Artifact table.  For those artifacts which are created atomically, and the legacy S3 artifacts, we just set the value to true.  For the new artifacts, we set it to false.  When you finish your upload, you have to run a complete artifact method.  This is sort of like a commit.

In the complete artifact method, we verify that S3 sees the artifact as present and only once it's completed do we send the artifact completed method.  Likewise, in the complete task method, we ensure that all artifacts have a present value of true before allowing the task to complete.

S3 Eventual Consistency and Caching Error Detection

S3 works on an Eventual consistency model for some operations in some regions.  Caching systems also have a certain level of tolerance for corruption.  We're now able to determine whether the bytes we're downloading are those which we expect.  We can now rely on more than http status code to know whether the request worked.

In both of these cases we can programmatically check if the download is corrupt and try again as appropriate.  In the future, we could even build smarts into our download libraries and tools to request caches involved to drop their data or try bypassing caches as a last result.

Artifacts should be easy to use

Right now, if your working with artifacts directly, you're probably having a hard time.  You have to use something like Curl and building urls or signed urls.  You've probably hit pitfalls like Curl not exiting with an error on a non-200 HTTP Status.  You're not getting any content verification.  Basically, it's hard.

Taskcluster is about enabling developers to do their job effectively.  Something so critical to CI usage as artifacts should be simple to use.  To that end, we've implemented libraries for interacting with artifacts in Javascript and Go.  We've also implemented a Go based CLI for interacting with artifacts in the build system or shell scripts.

Javascript

The Javascript client uses the same remotely-signed-s3 library that the Taskcluster Queue uses internally.  It's a really simple wrapper which provides an put() and get() interface.  All of the verification of requests is handled internally, as is decompression of Gzip resources.  This was primarily written to enable integration in Docker-Worker directly.

Go

We also provide a Go library for downloading and uploading artifacts.  This is intended to be used in the Generic-Worker, which is written in Go.  The Go Library uses the minimum useful interface in the Standard I/O library for inputs and outputs.  We're also doing type assertions to do even more intelligent things on those inputs and outputs which support it.

CLI

For all other users of Artifacts, we provide a CLI tool.  This provides a simple interface to interact with artifacts.  The intention is to make it available in the path of the task execution environment, so that users can simply call "artifact download --latest $taskId $name --output browser.zip.

Artifacts should allow serving to the browser in Gzip

We want to enable large text files which compress extremely well with Gzip to be rendered by web browsers.  An example is displaying and transmitting logs.  Because of limitations in S3 around Content-Encoding and its complete lack of content negotiation, we have to decide when we upload an artifact whether or not it should be Gzip compressed.

There's an option in the libraries to support automatic Gzip compression of things we're going to upload.  We chose Gzip over possibly-better encoding schemes because this is a one time choice at upload time, so we wanted to make sure that the scheme we used would be broadly implemented.

Further Improvements

As always, there's still some things around artifact handling that we'd like to improve upon.  For starters, we should work on splitting artifact handling out of our Queue.  We've already agreed on a design of how we should store artifacts.  This involves splitting out all of the artifact handling out of the Queue into a different service and having the Queue track only which artifacts belong to each task run.

We're also investigting an idea to store each artifact in the region it is created in.  Right now, all artifacts are stored in EC2's US West 2 region.  We could have a situation where a build vm and test vm are running on the same hypervisor in US East 1, but each artifact has to be upload and downloaded via US West 2.

Another area we'd like to work on is supporting other clouds.  Taskcluster ideally supports whichever cloud provider you'd like to use.  We want to support other storage providers than S3, and splitting out the low level artifact handling gives us a huge maintainability win.

Possible Contributions

We're always open to contributions!  A great one that we'd love to see is allowing concurrency of multipart uploads in Go.  It turns out that this is a lot more complicated than I'd like it to be in order to support passing in the low level io.Reader interface.  We'd want to do some type assertions to see if the input supports io.ReaderAt, and if not, use a per-go-routine offset and file mutex to guard around seeking on the file.  I'm happy to mentor this project, so get in touch if that's something you'd like to work on.

Conclusion

This project has been a really interesting one for me.  It gave me an opportunity to learn the Go programming language and work with the underlying AWS Rest API.  It's been an interesting experience after being heads down in Node.js code and has been a great reminder of how to use static, strongly typed languages.  I'd forgotten how nice a real type system was to work with!

Integration into our workers is still ongoing, but I wanted to give an overview of this project to keep everyone in the loop.  I'm really excited to see a reduction in the amount of corruptions for artifacts