27 August, 2018

Taskcluster Credential Derivation in EC2 using S/MIME, OpenSSL's C api and Node.js's N-API


Taskcluster uses EC2 instances to run tasks.  These instances are created in response to real time demand from the Queue by a combination of services, Aws-Provisioner and EC2-Manager.  This post is about how we get Taskcluster credentials onto these instances.

In the beginning, we took a simple approach and set credentials in the AMI.  The AMI is a snapshot of a machine.  This snapshot gets copied over to the disk before booting each machine.  This had the disadvantage that the AMI used would need to be kept private.

In this approach, we need to know exactly which worker type an AMI is designed for.  This would result in having hundreds of AMIs for each update of the worker program, since each worker type would need its own credentials.  We wouldn't be able to have credentials which are time-limited to the desired lifespan of the instance, since they'd need to be generated before doing imaging.  The last major limit in this design was that each worker shared the same credentials.  There was no way to securely distinguish between two workers.

Our second approach was to build this logic into the Aws-Provisioner in a system called "Provisioner Secrets".  This was our first pass at removing credentials from our AMIs.  With this system, each worker type definition can have a set of secrets in a JSON field.  These secrets would then be inserted into an Azure-Entities entity identified by a UUID, called the security token.  This security token was passed to the instance through the EC2 Meta-Data service's UserData field.

The UserData field is readable to any process on the machine which can make a request to the service's IP address.  That means we cannot rely on it being secret.  To work around this, we required workers call an API on the Aws-Provisioner to delete the secrets and security token.

This approach had a lot of benefits, but still isn't ideal.  We are limited in how we could request instances from the EC2 API.  Each API request for an instance had to have a unique security token.  Since there is no support for batching independent instance requests, we could only ever have one instance per API request.  This caused us many problems around API throttling from EC2 and precluded us from using more advanced EC2 systems like Spot Fleet.

Additionally, since the security token is created before the instance id is known, we weren't able to generate instance isolated taskcluster credentials.  We want to be able to issue scopes like "worker-type:builder" for things which are specific to the worker type.  These let each worker access anything which is limited to its worker type.  This is for things like claiming work from the Queue.

We would also like to be able to securely distinguish between two instances.  To do this, we want to grant scopes like "worker-group:ec2_us-east-1:worker-id:worker1" and "worker-group:ec2_us-east-1:worker-id:worker2" so that worker1 and worker2 can only call their relevant APIs.

The last concern with this approach is that it's not compatible with the new Worker Manager system which we're building to consolidate management of our instances.

We need a new approach.

The New Approach to Credentials

Through conversations with our wonderful EC2 support team, we were made aware of the signed Instance Identity Documents in the metadata service.  These documents are small bits of JSON which are signed by an AWS provided private key.

Here's an example of the document:

$ cat test-files/document
  "accountId" : "692406183521",
  "architecture" : "x86_64",
  "availabilityZone" : "us-west-2a",
  "billingProducts" : null,
  "devpayProductCodes" : null,
  "imageId" : "ami-6b8cef13",
  "instanceId" : "i-0a30e04d85e6f8793",
  "instanceType" : "t2.nano",
  "kernelId" : null,
  "marketplaceProductCodes" : null,
  "pendingTime" : "2018-05-09T12:30:58Z",
  "privateIp" : "",
  "ramdiskId" : null,
  "region" : "us-west-2",
  "version" : "2017-09-30"

This document is accompanied by two signatures, one based on a SHA-1 digest and one on a SHA-256 digest.  After some back and forth on the specifics with EC2, we decided to implement the SHA-256 based digest version.  The signatures were provided in S/MIME (PKCS#7) format.

This document, once validated, contains enough information for us to know the region and instance id without having known about the instance before it requested credentials.  Since AWS has the private key and the public key is tracked in our server, we can validate that this document was generated by AWS.  AWS does not reuse instance ids, so we can generate Taskcluster scopes for that specific instance id.

This is great!  We have the ability to validate which instance id an instance is without ever having known about it before.  We don't need to store much in our state about credentials because we can generate everything when the credentials are requested.  We also can now request 10 identical instances in a single API request without any worries about security token reuse.

A quick search of the NPM module library to find something to validate S/MIME signatures pointed us to.... absolutely nothing.  There wasn't a single library out there which would do S/MIME signature validation out of the box.  Since OpenSSL was what EC2 provided documentation for,  I looked for full OpenSSL bindings to Node and didn't find any.   It was strange, considering Node.js is built with a full copy of OpenSSL.

We tried a couple ways to work around this issue.  The first was using the Node.js child_process and temporary files to write out the documents and call the OpenSSL command line tool with the temporary files.  This didn't work out because of the performance overhead as well as difficulty managing temporary files at runtime.

Our second attempt was to use the Forge library.  This approach worked, but required us to manually parse the ASN.1 data of the S/MIME signature to find the values we needed.  It's always a bad sign when your solution starts with "... and then we parse an esoteric crypto file format manually ...".

Knowing that OpenSSL has a binary which can do S/MIME verification, a C API and that Node.js supports building C/C++ based modules, it became clear that it was time for me to dust of my C knowledge and write my first C/C++ module for doing our validation.

My First Native Node.js Module

Having never written a native module for Node.js, I needed to take a step back and do some research.  My first step was to figure out how to write a native module.  I found a really great guide to using N-API.  The module described was really simple, but gave enough of a picture of how the API worked to be able to build on it.

N-API is one of the nicest APIs that I've learned in a long time.  It's been designed in a very coherent way and the documentation is excellent!  Almost all of the functions work the same way and the patterns for sharing information between C memory and JS types are clear.

Once I figured out the basics of calling C code from Node.js, I needed to figure out how to validate S/MIME signatures using OpenSSL's C API.

Figuring out OpenSSL's C API

I've never built an application using the C interface to OpenSSL.  I found that the OpenSSL documentation was really great, but that it wasn't as comprehensive as I would've hoped for.  I ended up using a couple different sources.  Not being a C developer, I figured the only way to figure this out was to dive in head first.

My first step was to clone and build OpenSSL.  I knew that the command line program was able to do S/MIME validation, so I used grep to try to figure out where the CLI lived.  I found that the code for the S/MIME portions of the OpenSSL binary lived in apps/smime.c.  After spending a lot of time tracing through the program, I found that the function that I needed was PKCS7_verify.

Figuring out the C portion of the API required a lot of reading about how I/O is handled in OpenSSL as well as learning the OpenSSL stack macros.  My first task was to figure out how to get data in from a Node.js Buffer instance into a C buffer.  I used the napi_get_buffer_info method to extract the Buffer into a raw memory buffer in C.

Once I had the bytes, I needed to figure out how to create the data structures which OpenSSL expected.  This was a PKCS#7 envelope for the signature, some X.509 public keys, an X.509 certificate store and an OpenSSL BIO (Basic I/O) file stream of the document.  For each type of data structure, there are functions for taking a BIO and creating an instance.  Since all of our input data is already in memory and relatively small I could use the BIO_new_mem_buf function.  This BIO is a wrapper around raw memory.

Now that I had each of the required data structures and was calling the verification function correctly, I needed to figure out error handling.  Like many C libraries, some errors were signified by returning a NULL pointer or some other falsy value.  Unlike any C API I've used before, OpenSSL has a full error queue system.  Each time something goes wrong, a new error is put into an error queue.

I wanted to make sure that the errors which the node.js library threw were useful for debugging.  The error queue system in OpenSSL was definitely difficult to work with at first, coming from a dynamic language background.  The documentation did a great job of explaining how things worked.  In the end, I was able to get all of the error information from OpenSSL in order to throw really nice errors in Node.js code.

Here's an example of an error thrown by the Javascript code.  It is caused by a malformed public key being passed in:

{ Error: asn1 encoding routines ../deps/openssl/openssl/crypto/asn1/asn1_lib.c:101 ASN1_get_object header too long
    at verify (/home/jhford/taskcluster/iid-verify/index.js:70:23)
   [ 'PEM routines ../deps/openssl/openssl/crypto/pem/pem_oth.c:33 PEM_ASN1_read_bio ASN1 lib',
     'asn1 encoding routines ../deps/openssl/openssl/crypto/asn1/tasn_dec.c:289 asn1_item_embed_d2i nested asn1 error',
     'asn1 encoding routines ../deps/openssl/openssl/crypto/asn1/tasn_dec.c:1117 asn1_check_tlen bad object header',
     'asn1 encoding routines ../deps/openssl/openssl/crypto/asn1/asn1_lib.c:101 ASN1_get_object header too long' ] }

The error thrown in Javascript not only says what the root cause is, it traces through OpenSSL to figure out where each error in the error queue was generated.  In this case, we know that a header is too long in one of our ASN.1 files from the first error message.  Because we can see that there is an error in reading a PEM file, we know that it's a certificate in PEM format that's a likely culprit.  This should make debugging problems much easier than just throwing an error like "OpenSSL Error".

I also determined that the PKCS7_verify function would also add to the error queue on invalid signature.  For this library, that's not considered an error condition.  I needed to make sure that a failed signature validation did not result in throwing an exception from JavaScript code.  Each error in the OpenSSL error queue has a number.  With that number, there are macros which can be used to figure out which exact error message is the one being looked at from the error queue.  With that, we can figure out if the error in the queue is a validation failure and return validation failure instead of an error.

Results and Next Steps

After all of the code was completed, it was time to measure the performance.  I expected that the C version would be faster, but even I was surprised at how much faster it was.  I'm only going to compare bash using the OpenSSL binary against the node module.  This is an analogue to using child_process to start the binary, without the temporary file creation overhead.  I don't have an analogue to compare the performance of the Forge implementation.

Using bash to call the OpenSSL binary took 78 seconds to run 10.000 validations.  Using the C code directly can run 10.000 validations in 0,84 seconds.  The Node.js module can run 10.000 validations in 0,85 seconds.  Two orders of magnitude faster and we don't need to worry about managing temporary files.  I think that's a successful outcome.

Before this work can go live, we need to finish integrating the new library into our EC2-Manager patch.  We're tracking that work in PR#54 and Bug 1485986.

I wanted to thank Wander Costa for his help on the overall project and for Franziskus Kiefer, Greg Guthe and Dustin Mitchel for the really great reviews.

No comments:

Post a Comment