In Defense of Over-Engineering

David R Bayer
8 min readFeb 1, 2019

Over the course of my career, I have (so far) been a desktop repair technician, desktop support technician, system administrator, DevOps engineer (automation engineer if you prefer), and technical manager. As I look back, one of the through lines across all of these roles was my tendency to over-engineer solutions. I was not always aware of this.

As I have matured, I have come to the opinion that this is not a bad thing. Sure, I spend more effort developing solutions than I might otherwise, but I also reap the rewards of that effort. It doesn’t matter if I’m working on infrastructure design, administration scripts, or what-have-you. Some of the benefits that I see:

  • Pass it on: Make it easier to hand off to someone else. This particularly applies to administrative scripts. Comment the code. Add help/usage. Make it easy for someone new to use without them having to decipher some arcane idiomatic syntax that they may not be familiar with. “What flavor of regex does this script use anyway?”
  • Make it observable: If you can’t monitor it, you can’t own it and it will bite you in the ass. That backup script may have worked great when you first deployed it, but if you can’t consistently monitor execution one day you may find that it’s pining for the fjords and in fact hasn’t been successful in weeks. Invariably you discover this only when you really need that backup. (I know, I know. You must do restore tests on a regular basis. This is only an example.)
  • Fire and forget: When my solution is deployed, I don’t want to look at it again. Once the solution is fully tested and stable, I don’t want to have to mess with it again until we change something. For infrastructure, that might be changing the footprint by adding a new application cluster. For scripts it might be software upgrades forcing changes to the scripts. Whatever it is, I want to go months without looking at it again.

Today I found myself writing a quick little bash script to run database backups. Yes, I absolutely could have run the backup process as a cron job without any kind of bash script at all, but where’s the fun in that? In this case, the database in question is Neo4j, a graph database that not many people in my organization are familiar with.

Pass It On

First thing’s first, I will do what I can to be able to hand it off to someone else. I don’t want to be the sole owner of this for the rest of my days. If/when someone comes along behind me and needs to know how to use the script I want to make it really easy for them. Except for one-off throw-away solutions, I almost always include named parameters and usage in some form or fashion. In bash I use getopts. Ruby and Python both have OptionParser. Whatever language I’m using, I try to include something along these lines:

I do this often enough that I set up keyboard shortcuts to add these blocks to my code in various languages. By adding the above, anyone coming along later can get help just like most *nix commands:

Now the next person coming along doesn’t have to go through all 130 lines of the script to find out how to use it. Yes, I over-engineered what could have been a one-liner into a 130 line bash script. You’ll notice 40+ lines of that are just the command line options and usage.

In my case, I create a dedicated backup host in AWS that joins the Neo4j cluster but does not receive any traffic. So my backup script shuts the system down when complete. For someone who is not me, seeing the script shut down the host might be concerning, so adding comments is appropriate:

# Let's save some money!
# Shut down the system when the backup is complete.
# Let sleep scheduler start me back up to backup again tomorrow.
shutdown -h now

Make It Observable

Like any good backup script, this one will be run on a schedule without a human on hand to monitor the output. So how do I know that all is right with the world? Again, my one-liner could pipe all output through logger, but that doesn’t make me smile. The default version varies widely across different linux distributions, the the differences in version come with some significant differences in options. In addition, unless you commit to managing logger configurations which can quickly get out of hand, you’re stuck with sending all logger messages to /var/log/messages and nowhere else. If you think that’s OK, read on.

As an organization starts to scale, logging into individual hosts to read logs (and more specifically knowing which hosts to log into) becomes less and less practical. In today’s world of serverless functions and dynamic containerized environments, you can’t even guarantee that the host that generated the logs you need even exists by the time you go to look. Enter the world of log aggregation. You might use Elastic Stack or Splunk or DataDog or any of a myriad other solutions, but you will need something like this if you want to survive as you scale.

Now, think about sending your backup job output to logger and using the most minimal set of options so that you can accommodate whatever version of logger you end up saddled with. So your backup job is logging to /var/log/messages with not much in the way of identifiers, and then getting shipped to your log aggregator. Now you must search the system logs for hundreds or thousands of computers to find out if the backup A) ran, and B) succeeded. While you’re at it, I seem to have lost my needle somewhere…

A better solution: add your own logging. Python and Ruby have Loggers that are quite configurable. In my bash script, I create my own with echo commands, which allows me to test without polluting the logs by echoing to console, or redirect script output to the log of my choice (in this case /var/log/neo4j/backup.log — clever, huh?). My backup script is littered with lines similar to this:

echo "$(date +"$date_format") INFO Starting backup script"

Because I’m lazy and didn’t feel like typing %Y-%m-%d %T.%3N%z all over the place I saved that to thedate_format variable.

Now I’m creating log messages that match the format expected by my log shipper. You’ll notice in the usage script above that I even include a verbose option that let’s me include DEBUG messages when desired. By redirecting my script’s output to a file in the path captured by my log shipper, I can easily search my log aggregator for output from the backup script. I can even use my log aggregator to monitor that log both for errors and for unusual absences of activity (indicating the job didn’t trigger at all for some reason). I can now have a reasonable expectation that I can be notified if the script has a problem and I don’t have to watch it using my personal peepers all the time.

Fire and Forget

Now that I can A) hand it off to someone else, and B) monitor performance, I want to set everything up and walk away. So I try to think of all of the most likely way things could go wrong and take care of those. But not only try to avoid them, but also (getting back to being observable) log things when they are not happy.

First off, let’s capture errors so I can do some last minute housekeeping before exiting the script:

# Halt on errors, even in functions
set -eE
# Things to do in the event of an error
fail() {
for file in $(ls $backup_dir/inconsistencies*); do
s3_move $file "inconsistency_reports"
done
start_neo4j
echo "$(date +"$date_format") ERROR Backup script failed!"
}
# If something errors, run fail before exiting
trap "fail" ERR

The basic logic of the script is

  • Wait for Neo4j to sync up and become available
  • Stop Neo4j in preparation for an offline backup
  • Run a consistency check
  • Run a backup
  • Move backup to AWS S3 for safekeeping

If the consistency check fails it generates a report — in that case trap will capture the failure and copy any inconsistency reports found up to S3, start Neo4j back up (remember this node does not receive traffic), and then write a log message before exiting. This last bit is very important. “Absence of evidence is not evidence of absence.” That log message tells me in no uncertain terms that the script failed and I don’t have to rely solely on “is there a success message?” A missing success message could simple be that the backup took longer than anticipated, or the log pipeline is backed up to the log aggregator, or something else entirely.

By moving backups to S3 instead of just copying, I avoid the risk of 2 backup files with the same name if I were to happen to re-run the backup script manually. If I happen to have the backup naming collision, that means something went wrong moving the file to S3 and I needed to take care of that anyway.

I also included other safeguards in the script, like making sure Neo4j is running and all synced up and happy before trying to backup.

Yes, a lot more effort went into my 130 line script than the one-liner that could have done the work. In the code above I have already shared almost half of the code. Was all of it absolutely necessary? Probably not. But in order to make it truly production-ready, these were not just nice-to-have features. These are gotta-haves if you really want step up your game.

I focused on over-engineering scripts in this post. But similar thought and care can go into any solution you design. If you’re designing cloud infrastructure, think of ways that things might go wrong. Assume they will (anyone remember AWS losing all of the us-east-1 region for a bit?), and think about how to mitigate the risks. Use queues or pub/sub facilities or data streams to buffer application requests so that an outage affecting Service B doesn’t cause grief for Service A. Find ways to fail over to alternate environments gracefully (CloudFront origin groups FTW).

In the bad old days before DevOps, when deployments consisted of development “throwing it over the wall” for deployment by the ops team, one of my favorite quips about the difference between Dev and Ops was:

Dev: It works!
Ops: Do it again!

These days ops teams are moving farther left into development, and are beginning to understand (and in some cases adopt) the developer mindset behind “It works”. Over-engineering solutions is a way of bridging the gap between “It works” and “do it again”.

The final script (as it exists today). You’ll notice there aren’t a huge number of comments — in most cases I felt the log messages where sufficient.

--

--