Sysadmin Still Surviving: April 2017

Not long ago I started a job with a company whose primary product is a very custom application that is comprised of many smaller interoperating applications. Without getting into too much detail, the applications communicate through various APIs, many of which are not well documented.

(What follows are thoughts that are not focused solely on the new employer, but rather a set of experiences I've gathered over the years from several jobs and interactions with others in the technology field. In other words, this isn't about the current employer. It's a conglomeration of experiences, and it's my own opinion. Just figured I'd have to clarify that...)

As a company focuses on growth, there comes a time when maintenance and monitoring is moved to staff that are dedicated to those tasks so the developers no longer have to do triple duty; for the new hire tasked with pioneering that position, gathering statistics to get a feel for the behavior of their systems over time, and taking care of regular maintenance and basic troubleshooting is very daunting when there is little (or no) documentation available outlining how to get the necessary metrics for gauging the health of the system.

And it isn't just a lack of documentation that acts as an obstacle. When a software-based company is first conceived and grows, it's natural for the programmers to work on getting the product into a usable, testable state. This means overcoming problems as they arise and focusing on results, not laying framework for delegating future operations.

That fosters institutional knowledge. The more of your system that is developed in-house, the more information future maintainers must glean about your system without help of outside references. Sites like Serverfault can help when you're trying to figure out why a new deployment of Nginx won't work, but it won't be useful when a log contains output from a Java application Bob, three desks away, wrote while debugging a particular reply encountered from another subsystem's API response.

Small companies with a small number of developers may feel it is inconvenient to be interrupted by the new person's constant questions about why application A is dependant on application B, or how application C discovers a service status on server 3. As a new hire, I feel a little hesitant to approach others with these types of questions, preferring to try looking for answers through other means before taking someone else's time.

(In my opinion, if the answer is to check the source code from the repo and read that to get the answers, you may as well have hired a new programmer; recognizing a need for someone dedicated to operating and maintaining your system outside the coterie of coders is a sign that there may be a need to dedicate time to documenting and tooling the application for non-programmer use.)

How can a new hire get a grasp on this situation?

In this case, I've been writing a series of Nagios plugins specifically configured to pull metrics from the various subsystems in the company application. There are cases where I thought a simple task was actually more nuanced that first appeared; each time, I ended up discovering something more about the operation of the system, and I made sure it was documented for later reference.

Each time there's a failure case, I would make a note and start work on a new monitor so we'd know about it in the future. These monitors didn't just collect a snapshot of the current state of a service, it would gather some metric that was then sent to a database and from there plotted on a graphing application for performance monitoring.

The current product relies on database performance; some queries behave different from others, where some are straightforward and others require processing of filters. Some of my checks measure response times.

Others are querying API endpoints for replies of what the services believe are their current health states.

Some queries are pulling the status of database indexing.

In cases where the application is exposing information through Java beans, my plugins are pulling numbers from JMX and checking for values within established expectations.

In other cases, plugins are checking for the existence of files that are supposed to be regularly updated and when certain records are updated in the database.

Each of these plugins, once finished and deployed, are being documented for operation in a way that when new people are hired he or she should be able to easily find a list of how these work and gather indirect information on some aspects of the in-house application operation without programmer-level institutional knowledge.

In the case of my new position, I've gained a higher respect for the value of meta-applications in gaining insight on how a complicated system works. Having information written out or explained to you is enlightening, and I never feel that documenting how something works is a waste of time. But until you find yourself executing on that knowledge, I'm not sure you really understand the subject. Creating support applications that meaningfully interact with the system pushes knowledge into the realm of wisdom the way reading about the science of flight comes alive after building your first remote control plane.

When confronted with the task of comprehending the colossal, try learning about the limited first with applications that monitor and interact with small aspects of the system. Not only will others benefit with the support applications, but you'll benefit with the mental exercise and in the end have a better model of how everything works!

Sysadmin Still Surviving

Saturday, April 29, 2017

Learning By Creating Support Applications