24 February, 2016

cloud-mirror – Platform Engineering Operations Project of the Month

Hello from Platform Engineering Operations! Once a month we highlight one of our projects to help the Mozilla community discover a useful tool or an interesting contribution opportunity. This month's project is our cloud-mirror.

The cloud-mirror is something that we've written to reduce costs and time of inter-region S3 transfers. Cloud-mirror was designed for use in the Taskcluster system, but is possible to run independently. Taskcluster, which is the new automation environment for Mozilla, can support passing artifacts between dependent tasks. An example of this is that when we do a build, we want to make the binaries available to the test machines. We originally hosted all of our artifacts in a single AWS region. This meant that every time a test was done in a region outside of the main region, we would incur an inter-region transfer for each test run. This is expensive and slow compared to in-region transfers.

We decided that a better idea would be to transfer the data from the main region to the other regions the first time it was requested in that region and then have all subsequent requests be inside of the region. This means that for the small overhead of an extra in-region copy of the file, we lose the cost and time overhead of doing inter-region transfers every single time.

Here's an example. We use us-west-2 as our main region for storing artifacts. A test machine in eu-central-1 requires "firefox-50.tar.bz2" for use in a test. The test machine in eu-central-1 will ask cloud mirror for this file. Since this is the first test to request this artifact in eu-central-1, cloud mirror will first copy "firefox-50.tar.bz2" into eu-central-1 then redirect to the copy of that file in eu-central-1. The second test machine in eu-central-1 will then ask for a copy of "firefox-50.tar.bz2" and because it's already in the region, the cloud mirror will immediately redirect to the eu-central-1 copy.

We expire artifacts from the destination regions so that we don't incur too high storage costs. We also use a redis cache configured to expire keys which have been used least recently first. Cloud mirror is written with Node 5 and uses Redis for storage. We use the upstream aws-sdk library for doing our S3 operations.

We're in the process of deploying this system to replace our original implementation called 's3-copy-proxy'. This earlier version was a much simpler version of this idea which we've been using in production. One of the main reasons for the rewrite was to be able to abstract the core concepts to allow anyone to write a backend for their storage type as well as being able to support more aws regions and move towards a completely HTTPS based chain.

If this is a project that's interesting to you, we have lots of ways that you could contribute! Here are some:
  • switch polling for pending copy operations to use redis's pub/sub features
  • write an Azure or GCE storage backend
  • Modify the API to determine which cloud storage pool a request should be redirected to instead of having to encode that into the route
  • Write a localhost storage backend for testing that serves content on 127.0.0.1
If you have any ideas or find some bugs in this system, please open an issue https://github.com/taskcluster/cloud-mirror/issues. For the time being, you will need to have an AWS account to run our integration tests (`npm test`). We would love to have a storage backend that allows running the non-service specific portions of the system without any extra permissions.
If you're interested in contributing, please ping me (jhford) in #taskcluster on irc.mozilla.org.

For more information about all Platform Ops projects, visit our wiki. If you're interested in helping out, http://ateam-bootcamp.readthedocs.org/en/latest/guide/index.html has resources for getting started.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.