One of my last projects at SecondMarket was to automate and rebuild the Jenkins infrastructure. We’d previously had a static setup in the NYC office with a build master and three slaves that ran all the time, but this handled developer check-in storms very poorly. For example, when developers were trying to make code cutoff for a feature, many builds would be queued for lack of available executors. But at other times, these agents would be completely idle. It made more sense to move the entire setup to the cloud and implement some kind of auto-scaling for Jenkins.
The Jenkins Amazon EC2 Plugin
We decided to implement the Jenkins Amazon EC2 plugin on the master to spin up new slaves when there are jobs waiting. After a certain configurable period of inactivity, the EC2 plugin will suspend or terminate the instances. The main problem we had to solve is that SecondMarket’s build process & test suite require the presence of many packages, representing the key parts of our stack: a JDK, PostgreSQL, MongoDB, etc. We’ve spent a lot of time automating the provisioning process for these services using Chef. But we don’t want to bootstrap all these transient nodes off our Chef server and have lots of dead client and node objects lying around. How can we leverage the work we’ve done with Chef in a world where a “golden image” AMI is required?
Instance Metadata to the Rescue
Amazon EC2 allows you to pass in “instance metadata” (or just “instance data”) on bootup, so does the Jenkins plugin. By convention, instance data is a shell script run once by the target system after the machine is up. (Of course, the target system’s AMI must include some way to run it: most people have cloud-init in EC2 AMIs for this reason.) Initially, I’d hoped to call Chef Solo directly in the instance data script, so that I could just use a bare, generic CentOS 6 AMI in Jenkins. This proved to be infeasible due to the provisioning time. In addition to the startup time for the instance (about five minutes), the first run of Chef Solo took another five to seven minutes, so it would be almost fifteen minutes before a new slave could come online.
Instead, I agreed on a compromise with our build master: we’d use Chef Solo to build the golden image and automate that process. If changes are required to the AMI, we could commit them to our Chef repository, create a new Chef Solo bundle, and rebuild a new AMI. (Down the road SecondMarket might even consider having Jenkins perform the whole process, which would be a strange Malkovichian development to this whole scheme.)
The Code
First we need a script to generate the Chef Solo tarball. I wrote a very simple one to check out the master branch of every relevant cookbook and create a tarball of it, which could be put on our internal repository server, accessible by HTTP.
Then, we pass an instance data script to the EC2 bootstrap that installs Chef, sets up a dummy role called jenkins-node-solo-role, and runs Chef Solo to provision the machine. This is done using the Amazon EC2 tools:
% ec2-run-instances -f instance-data.sh -g whatever -k some-key -t m1.small ami-XXXXXXXX
Once satisfied, we can snapshot the instance and create an AMI from it:
% ec2-create-image -n "jenkins-node-image-20130226" i-XXXXXXXX
Now just configure Jenkins to use the new AMI ID as the slave image and we’re done. (Don’t forget to kill off the temporary instance.)
Conclusions and Future Work
I learned a couple of lessons from doing this, chief amongst them being that if your role’s JSON doesn’t have "json_class": "Chef::Role" you get some truly bizarre error messages. That took me a half day to track down.
Ideally I’d love to have some BDD tests that would describe the target state of the slave machine (e.g. “must have MongoDB installed”) that could be run right after the Chef Solo run. I spent perhaps ten iterations of building AMIs using this process only for the build master to complain that something was missing on the image. It took a lot of tweaking to get the run list right.