Sysadmin Still Surviving: On The Importance of Planning A Program

I'm not a professional programmer.

I'm not sure I could even qualify as a junior programmer.

What I have been doing is programming at a level that is above basic scripting, but below creating full applications. I've been churning out command line utilities for system activities (status checking and manipulating my employer's proprietary system, mostly, along with a bevy of Nagios plugins) with the occasional dabbling into more advanced capabilities to slowly stretch what I can accomplish with my utilities.

That said, I've been trying to reflect on my applications after they've been deemed "good enough" to be useful. In a way, I try running a self-post-mortem in hopes of figuring out what I think works well and what can be improved.

I was recently in a position where I had to create a utility, then months later, got permission to rewrite it, giving me a unique opportunity to take an application that had a specific set of expectations for output and let me refactor its workflow in hopes of improving performance and information it gathered in the process.

For reference, the 10,000 foot view is that I have a large set of data from a large database, and we wanted to dump the contents of that database, using an intermediate service providing REST endpoint API calls, to save each record as a text file capable of being stored and uploaded in another database. A vendor-neutral backup, if you will...all you need is an interpreter that is familiar with the text file format and you could feed the contents back into another service or archive the files offsite.

It seems like this would be a small order. You have a database. You have an API. The utility would get a set of records, then iterate over them and pull records to save to disk.

Only...things are never that simple.

First, there's a lot of records. I realize "a lot" is relative, so I'll just say it's in the 9 digits range. If that's not a lot of records to you, then...good on you. But when you reach that many files, most filesystems will begin to choke, so I think that qualifies as "a lot."

That means I have to break up the files into subdirectories, especially if the utility gets interrupted and needs to restart. Otherwise filesystem lookups would kill performance. Fortunately there's a kind of built-in encoding to the record name that can be translated so I can break it down into a sane system of self-organizing subdirectories.

Great! Straightforward workflow. Get the record names. Iterate to get the record contents. Decode the record name to get a proper subdirectory. Check if it exists. If not, save it.

Oh, there are some records that are a kind of cache...they are referred to for a few days, then drop out of the database. No need to save them.

Not a problem, just add a small step. Get the record names. Iterate to get the record contents. Check if it's a record we're not supposed to archive. If we are, decode the record name to get a proper subdirectory. Check if it exists. If not, save it.

During testing, I discover there are records whose records cannot be pulled. The database will give me a record name but when I try to pull them, nothing comes back. That's odd, but I add a tally of these odd names and a check is inserted for non-200 responses from the API calls.

Then there are records that I can't readily decode. They're too short and end up missing parts for the decoding process. At first I write them off as something I have to tally as an odd record in the logs, but discover that when I try pulling them, the API call returns an actual record. I take this to the person who has institutional knowledge of the database contents and after examining the sample of records, states that it looks like the records were from an early time in the company history.

Basically, there's a set of specs that current records should follow, but there are records from days of yore that are valid but don't follow the current specs.

So there are records that should be backed up...but don't follow the workflow, where I have functions that check for record validity through a few tests before going through the steps of making network calls and adding to the load on the servers acting as intermediaries for the transfer. To fix this, I insert a new pathway for processing those "odd" records when they're encountered, so they end up being queried and translated and, if they are a full record, saved to an alternative location. The backups are now separated into the set of "spec" records and another "alternative" path.

The problem is that this organic change cascades into a number of other parts of the utility; my tally counts for statistics are thrown off. The running list of queued records to process have to take into account records that are flowing into this alternative path. Error logging, which also handled some tallying duties since it was an end-of-life for some of the records to be processed, weren't always actually errors but actually a notification that something had happened during the process that was helpful during tracing and debugging but a problem when it would mark certain stats off before the alternative record was processed.

That one organic change in the database contents during the history of the company had implications that totally derailed some of the design of my utility that took into account only the current expected behavior.

In the end, I lost several days of debugging and testing when I introduced fixes that took into account these one-offs and variations. What were my takeaways?

It would be simple to say that I should have spend some days just sketching out workflows and creating a full spec before trying to write the software. The trouble is that I didn't know the full extent to which there were hidden variations in the database; the institutional knowledge wasn't readily available for perusing when it resides in other people's heads, and they're often too busy to try coming up with a list of gotchas I could watch out for in making this utility.

What I really needed to do was create a workflow that anticipated nothing going quite right, and made it easy to break down the steps for processing in a way that could elegantly handle unexpected changes in that workflow.

After thinking about this some more, I realized that it was just experience applied to actively trying to modularize the application. The new version did have some noticeable improvements; the biggest involved changing how channels and goroutines were used to process records in a way that cut the number of open network sockets dramatically and thus reduce the load on the load balancers and servers. Another was changing the way the queue of tasks was handled; as far as the program was concerned, it was far simpler to add or subtract worker routines in this version than the previous iteration.

I'd also learned more about how to break down tasks into functions and disentangle what each did, which simplified tracing and debugging. Granted, there are places where this could still have been improved. But the curveballs introduced as I found exceptions to the expected output from the system, for the most part, just ate time as I reworked the workflow and weren't showstoppers.

I think I could have definitely benefited from creating a spec that broke tasks down and figured out the workflow a bit better, along with considering "what-ifs" when things would go off-spec. But the experience I've been growing in my time making other utilities and mini-applications still imparted improvements. Maybe they're small steps forward, but steps forward are steps forward.

Sysadmin Still Surviving

Friday, March 2, 2018

On The Importance of Planning A Program

No comments:

Post a Comment