Borg Priorities

The priority of a job helps define how the scheduler treats it. Ranges of priorities that share similar properties are referred to as tiers: • Free tier: jobs running at these lowest priorities incur no internal charges, and have no Service Level Objectives (SLOs). 2019 trace priority <= 99; 2011 trace priority bands 0 and 1. • Best-effort Batch (beb) tier: jobs running at these priorities are managed by the batch scheduler and incur low internal charges; they have no associated SLOs. 2019 trace priority 110–115; 2011 trace priority bands 2–8. • Mid-tier: jobs in this category offer SLOs weaker than those offered to production tier workloads, as well as lower internal charges. 2019 trace priority 116–119; not present in the 2011 trace. • Production tier: jobs in this category require high availability (e.g., user-facing service jobs, or daemon jobs providing storage and networking primitives); internally charged for at “full price”. Borg will evict lower-tier jobs in order to ensure production tier jobs receive their expected level of service. 2019 trace priority 120–359; 2011 trace priority bands 9–10. • Monitoring tier: jobs we deem critical to our infrastructure, including ones that monitor other jobs for problems. 2019 trace priority >= 360; 2011 trace priority band 11. (We merged the small number of monitoring jobs into the Production tier for this paper.)

2.5 Priority, quota, and admission control
What happens when more work shows up than can be accommodated? Our solutions for this are priority and quota.
Every job has a priority, a small positive integer. A highpriority task can obtain resources at the expense of a lowerpriority one, even if that involves preempting (killing) the
latter. Borg defines non-overlapping priority bands for different uses, including (in decreasing-priority order): monitoring, production, batch, and best effort (also known as
testing or free). For this paper, prod jobs are the ones in the
monitoring and production bands.

Upgrading PHP on Ubuntu

One of the weirdities that I have on my personal server is that my public facing site – – is served from my personal `~/public-html/` folder. PHP is disabled from these folders by default, for good reason, but that reason is to keep PHP out of the hands of randos and I’m careful about who’s on my machine.

Anyway – There’s a stanza in /etc/apache2/mods-enabled/php-[7].conf that begins with `Running PHP scripts in user directories is disabled by default` – Do as it says and comment that section out.

Delete keys in redis non-atomically

There’s a lot of information out there about how to atomically delete a sequence of keys in Redis. That’s great, if you want to cause your production cluster to block for minutes at a time while you do so. If you’ve want to delete a bunch of keys with a scan, though, there’s less info.

redis-cli does support a --scan flag, which combined with a --pattern flag allows you to asynchronously list a set of prefixed keys – Like the keys command, except without causing your redis server to block. You can then use this output to feed an xargs command.

For example: redis-cli --scan -h "${REDISHOST}" --pattern "PATTERN" | tee keys | xargs redis-cli -h "${REDISHOST}" del | tee deletions

Prometheus alerting and questions

I’ve been switching my company over to Prometheus, and I’ve come across a few things that need discussion and opinions.

First, concrete advice:
Don’t just write an alert like
alert: foo
expr: sum(rate(bar[5m])) > 5
Write it so you record the rate, and then alert on that metric:
record: bar:rate
expr: sum(rate(bar[5m]))
alert: foo
expr: bar:rate > 5

From my Google days, I can say I should probably specify what the time is on that rate.

1) How long should the rate window be? [5m]? [2m]? 3? 10?
* I’ve adopted 5m as standard across my company, being a compromise between being fast-moving and not overly smoothed
2) How long should alert `for`s be?
3) Metric naming
* I’m using `A_Metric_Name`; Not sure if this is right
4) Recorded rule naming
* I like `product:metric[:submetric]:unit` ; eg. houseparty:websockets_open:byDeviceType:sum

Kubernetes Build best practices

1) Squash your builds
This is now part of default docker, but it was well worth it even before. Docker will create a new tarball for each `stage` – Each ADD, RUN, etc creates a new layer that, by default, you upload. This means if you add secret material and then delete it – you haven’t really deleted it. More commonly, it bloats your image sizes. A couple intermediate files can be a huge pain, and waste your time and bandwidth uploading.

Don’t squash down to a single, monolithic image – Pick a good base point. Having a fully-featured image as a base layer is not a sin – So long as you reuse it, it doesn’t take up any more space or download time, so your lightweight squashed build can build on top of it.

2) Use Multistage builds
Your build environment should be every bit as much a container as your output. Don’t build your artifacts in your local machine and then add them to your images – You’re likely polluting your output with local state more than you know. Deterministic builds require you to understand the state of the build machine and make sure it doesn’t leak, and containers are a wonderful tool for that.

Just use Bazel. Bazel’s is pretty simple to use, powerful, and generates docker-compatible images without actually running docker.

Migrating a SBT project to Bazel.

I’ve been working today on migrating a SBT project to Bazel. I’ve taken a few wrong turns, and I’ll document them later, but this will be my working doc and I’ll add some failures to the end.

Two major components – Bazel’s generate_workspace tool, and SBT’s make-pom command. You’ll create a POM file with the dependencies and repos.

ted:growth$ sbt make-pom
[warn] Executing in batch mode.
[warn] For better performance, hit [ENTER] to switch to interactive mode, or
[warn] consider launching sbt without any commands, or explicitly passing 'shell'
[info] Loading project definition from /Users/ted/dev/growth/project
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See for further details.
[info] Set current project to growth (in build file:/Users/ted/dev/growth/)
[warn] Multiple resolvers having different access mechanism configured with same name 'Artifactory-lib'. To avoid conflict, Remove duplicate project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Wrote /Users/ted/dev/growth/target/scala-2.11/growth_2.11-resurrector-9449dfb1de3b816c5fd74c4948f16496b38952ab.pom
[success] Total time: 5 s, completed Jun 14, 2017 4:00:17 PM

This generates a pom file, but not exactly as generate_workspace wants it. It requires a directory with a pom.xml, so go ahead and turn that into one by making a tempdir and copying the file to it TMPDIR="$(mktemp -d)"; cp /Users/ted/dev/growth/target/scala-2.11/growth_2.11-resurrector-9449dfb1de3b816c5fd74c4948f16496b38952ab.pom "${TMPDIR}/pom.xml"

Next, build

So, on to the failures:
I initially tried to do my own workspace code generation. I took the output of sbt libraryDependencies and turned it into mvn_jar stanzas via script. This didn’t work, for the simple reason that I wasn’t doing it transitively, they mention that in the generate_workspace docs. I also tried specifying that list of deps as a big list of –archive stanzas; That turned out to be a mistake, mostly because of alternate repos. I also had to clean out a broken SBT set of repos; bazel does not play well with repeated repo definitions, while SBT is happy to ignore them.


The big companies I’ve worked at have all had been using security policies. The small companies haven’t. Frequently, all access to production machines have been controlled by a single shared ssh key. This sucks, but is inevitable, given the lack of time to spend on tooling. However, there are some low-cost toolings to make this better.

The basic developer workflow has been – Type in a command, which will generate a SSH certificate, then ask you for your password and u2f auth, and it’ll talk to the central signing server and get that cert signed. This is surprisingly doable for a small org – BLESS and CURSE are two alternatives.

For myself, though, the right thing to do is run ssh-agent. ssh-agent allows you to keep your keys in memory, and can support several keys. It also allows for forwarding the auth socket to a remote host – So if you need to ssh through a bastion host, you don’t have to copy your SSH key to the bastion machine, it can live on your local drive and all authentication requests can go through it. ssh -A enables this forwarding.

The other problem I’ve encountered a few times is that I want to share my ssh-agent across several terminals. This can be a blessing or a curse, but on most of my machines I only have one or two keys, and while I want them encrypted at-rest I don’t care if they’re loaded in memory a bunch. I’ve written the shell script that does this a bunch, and I today asked myself why it’s not in the default ssh toolkit (like ssh-copy-id). Well, it’s not, but there is a tool that does what I’m looking for: Keychain, not to be confused with the OSX tool of the same name. Though, to my surprise, OSX *already has this functionality*; My default terminal opens up with an SSH_AUTH_SOCK already populated, and it’s managed by the system. That’s pretty cool.

Annotated git config.

# Much saner than the old behavior, and new default.
default = simple
# Duh.
email =
name = Ted Hahn
# Corresponsed to my signing key.
signingkey = 1CA0948A
# When pulling, rebase my feature branches on top of what they’ve just pulled.
rebase = true
# Sign all commits
gpgsign = true

Bash tips.

Here’s some things you should start most bash scripts with:


set -e
set -x
set -o pipefail
set -u

TMPDIR=$(mktemp -d)
trap 'rm -rf $TMPDIR' EXIT

Explanations of the lines:


The shebang line is a unix convention that allows scripts to specify their interpreter. Since this is a bash script, we tell it to run this file with bash.

set -e

Exit immediately if any command fails. Makes it easy to spot when a script did not complete, and prevents things further down the line from doing the wrong thing because they were only partially setup.

set -x

Print each command as it’s run. It’s fantastically useful debug output, though some production scripts should have this disabled.

set -o pipefail

Exit with failure if any substage of a pipeline fails. This is about commands chained together with a pipe; e.g. If your grep command fails, the execution will fail, rather than simply outputting nothing to the next stage of the pipeline.

set -u

Makes referencing unset variables an error.

Further explaination of the above three can be found in the Bash Reference Manual entry on Set.

TMPDIR=$(mktemp -d)
trap 'rm -rf $TMPDIR' EXIT

Create a scratch dir, automatically delete it when you’re done. It’s often useful to comment out the trap line during debugging.

See also Pixelbeat’s blog on Common shell script mistakes

Symlinks are (not) hard.

I’ve got two amusing anecdotes related to symlinks. By amusing anecdotes, I of course mean incredibly frustrating weird behaviors that took hours to debug. One java, one chef.


Chef handles environments very well… except when it comes to databags. From my perspective, this is a critical flaw, since the things I want to keep out of the main chef repo (API keys and passwords) are also the things most likely to be affected by the environment. So,  when building, we specify the path to the chef databags, separating out the prod, canary, and dev environments.

For the parts that are common between the databags, I figured I’d use symlinks. Our databags are stored in a git repo, and git interprets symlinks correctly. The full set of databags were copied everywhere, so I could simply include a relative symlink to ../../prod/foo/bar.json for each databag I wanted consistent.  I got the following error:

syntax error, unexpected end-of-input

pointing to a character in the middle of the first line in the file. This made no sense.

It took me several tries with different files to figure out what was going on. The character that was being pointed out, x, was the same as the number of characters in the symlink path. A symlink is sorta just a text file with a pathname and a special flag on it. If you stat the symlink file, you’ll get the length of that pathname, not the size of the file it points to. What Chef seems to be doing is stat-ing that file, then taking that length as gospel – It doesn’t process it as a stream, but as a block of the stat’d size.

I should probably get around to testing that with the latest version and writing a bug.


Java has a really simple package deployment mechanism: JARs. You can put a bunch of classes into a jar, and deploy them as one. If you have a project with a bunch of dependencies, you can ‘shade’ your jar and wrap all your classes into a single mono-jar.

However, for some use cases it’s not that simple. Java up to 1.7 simply won’t accept more than INT_16_MAX class files in a jar (and remember that anonymous classes are a separate file). Further, signatures can’t be retained; A jar has a signing key attached, and all files must be signed using that same signing key, so a ‘shaded’ jar can’t include the original signatures of dependencies.

So, since monolithic jars don’t work in some cases, what do you do instead? You ship several jars. It’s well documented but not well understood that when you specify a jar with java -jar that your classpath is ignored. How do you load multiple jars, then?

Inside the jar is a META-INF folder containing a MANIFEST.MF file. This manifest file contains a bunch of key-value pairs, and one of those keys can be Class-Path. This class-path key can specify additional jars or directories, and it usually will. However, because of deployment concerns, it will generally list them as relative paths or just as filenames. How does java find those files?

In about the worst way possible. Java will dereference any symlinks in the jar it is loading, then search the base directory of the final file it reads for the class-path includes. So, if you have a bunch of projects with common includes, you cannot simply symlink in all your dependency jars; You need hard copies of every jar you include. This also means you can’t simply update a dependency jar in one place, you have to hard-link it in to the working directory of every app you want to deploy.

I guess an option is to simply have a big folder full of all the jars for all the apps you want to run, but that folder can get very cluttered, and it becomes unclear what’s there why – is one of your dependencies shared? Do you have a garbage-collection mechanism for older jars in that folder?

Ted's Excellent Adventure.