Sysadmin Still Surviving: March 2018

Monday, March 12, 2018

Golang: Is The Mutex Is Locked, And Finding The Line Number That Did It

Quick summary of the situation, giving enough details to highlight the problem but not giving proprietary information away...

I have a program that queries a service which in turn talks to a database. The database holds records identified by unique rowkeys. I want to read all of the records, as far as the database knows of their existence, which I can get through an API call, iterating in discrete steps.

The utility I created pulls a batch of these keys, then iterates over them one by one to determine if I want to make a call to the service to pull the whole record (I don't need to if the key was already called before or previously analyzed on a previous run of the program.)

Seems relatively simple, but this is a big database and I'm going to be running this for a long time. Also, these servers are in the same network, so the connections are pretty fast and furious...if I overestimate some capacity, I'm going to really hammer the servers and the last thing I want to do is create an internal DDoS.

To that end, this utility keeps a number of running stats using structs that are updated by the various goroutines. To keep things synced up, I use an embedded lock on the structs.

(Yeah, that's a neat feature...works like this:)

type stctMyStruct struct {
sync.Mutex
intCounter int
}

After that, it's a simple matter of instantiating a struct and using it.

var strctMyStruct stctMyStruct

strctMyStruct.Lock()
strctMyStruct.intCounter = strctMyCounter.intCounter + 1
strctMyStruct.Unlock()

Because the utility is long-running and I wanted to keep tabs on different aspects of performance, I had several structs with embedded mutexes being updated by various goroutines. Using timers in separate routines, the stats were aggregated and turned into a string of text that could be printed to the console, redirected to a file or sent to a web page (I wanted a lot of options for monitoring, obviously.)

At some point I introduced a new bug in the program. My local system was relatively slow when it came to processing the keys (it's not just iterating over them...it evaluates them, sorts some things, picks through information in the full record...) and when I transferred it to the internal network, the jump in speed accelerated exposure of a timing issue. The program output...and processing...and web page displaying the status of the utility...all froze. But the program was still running, according to process monitoring.

I first thought it was a race condition...something is getting locked and not releasing it. But how can I tell if a routine is blocked by a locked struct? Golang does not have a call that will tell you if a mutex is locked, because that would lead to a race condition. In the time it takes to make the call and get the reply, that lock could have changed status.

Okay...polling the state of mutexes is out of the question. But what isn't out of the question is tracking when a request for a lock is granted.

First I changed the struct to have an addressable member for setting the state of the lock.

type stctMyStruct struct {
lock sync.Mutex
intCounter int
}

Next I created some methods for the struct to handle the locking and unlocking.

func (strctMyStruct *stctMyStruct) LockIt(intLine int) {

chnLocksTracking <- "Requesting lock to strctMyStruct by line " +strconv.Itoa(intLine)

tmElapsed := time.Now()
strctMyStruct.lock.Lock()

defer func() {

chnLocksTracking <- "Lock granted to strctMyStruct by line " + strconv.Itoa(intLine) + " in " + time.Since(tmElapsed).String()
}()

return
}

func (strctMyStruct *stctMyStruct) UnlockIt(intLine int) {

chnLocksTracking <- "Requesting unlock to strctMyStruct by line " +strconv.Itoa(intLine)

tmElapsed := time.Now()
strctMyStruct.lock.Unlock()

defer func() {

chnLocksTracking <- "Unlock granted to strctMyStruct by line " + strconv.Itoa(intLine) + " in " + time.Since(tmElapsed).String()
}()

return
}

LockIt() and UnlockIt() methods are now added to instances of stctMyStruct. When called, the function first sends a string into a channel with a dedicated goroutine on the other end digesting and logging messages; the first acts as a notification that the caller is "going to ask for a change in the mutex."

If the struct is locked, the operation will block. Once it is available, the function returns, and in the process runs the defer function which sends the granted message down the channel along with the elapsed time to get the request granted.

How does it know about the line number?

There's actually a library function that can help with that; my problem is that it returns too much information to not be a little unwieldy. To get around that, I created a small wrapper function.

func GetLine() int {
_,_,intLine, _ := runtime.Caller(1)
return intLine
}

If you look at the documentation you can get the specifics of what is returned, but Caller() can unwind the stack a from a call by the number of steps you use as an argument and return the line number, package/module, etc...in my particular case I'm using one source file so I only needed the line number.

Using this, you can insert function calls to lock and unlock the structs as needed. I added the methods to each struct that had a mutex or rwmutex. Using them is as simple as:

strctMyStruct.LockIt(GetLine())

This solution provided a way to trace what was happening, but there is a performance cost. Defer() adds a few fractions of a second each time it's called and I used a lot of locks throughout the program which added up to a significant performance hit. Using this technique is good for debugging, but you have to decide if you want to incur the overhead or find a way to compensate for it.

So what was my lock issue?

I set the goroutine monitoring the locks to dump information to a file and traced the requests vs. granted mutex changes. There was a race condition in a function I used that summarized aggregated information; A lock near the beginning of the summary was granted, and while pulling other information, it requested another lock. The second one was an operation on a struct that was held by a process waiting to get a lock on what was being held by the beginning of the summarize function.

It was a circular resource contention. Function A held a resource that Function B wanted, and Function B had a resource function A wanted. The solution was to add more granular locking, which added more calls but in the end meant (hopefully) there would be only one struct locked at a time within a given function.

Lesson learned: when using locks, keep the calls as tight and granular as possible, and avoid overlapping locks as much as possible or you may end up with a deadlock that Go's runtime wouldn't detect!

Friday, March 2, 2018

On The Importance of Planning A Program

I'm not a professional programmer.

I'm not sure I could even qualify as a junior programmer.

What I have been doing is programming at a level that is above basic scripting, but below creating full applications. I've been churning out command line utilities for system activities (status checking and manipulating my employer's proprietary system, mostly, along with a bevy of Nagios plugins) with the occasional dabbling into more advanced capabilities to slowly stretch what I can accomplish with my utilities.

That said, I've been trying to reflect on my applications after they've been deemed "good enough" to be useful. In a way, I try running a self-post-mortem in hopes of figuring out what I think works well and what can be improved.

I was recently in a position where I had to create a utility, then months later, got permission to rewrite it, giving me a unique opportunity to take an application that had a specific set of expectations for output and let me refactor its workflow in hopes of improving performance and information it gathered in the process.

For reference, the 10,000 foot view is that I have a large set of data from a large database, and we wanted to dump the contents of that database, using an intermediate service providing REST endpoint API calls, to save each record as a text file capable of being stored and uploaded in another database. A vendor-neutral backup, if you will...all you need is an interpreter that is familiar with the text file format and you could feed the contents back into another service or archive the files offsite.

It seems like this would be a small order. You have a database. You have an API. The utility would get a set of records, then iterate over them and pull records to save to disk.

Only...things are never that simple.

First, there's a lot of records. I realize "a lot" is relative, so I'll just say it's in the 9 digits range. If that's not a lot of records to you, then...good on you. But when you reach that many files, most filesystems will begin to choke, so I think that qualifies as "a lot."

That means I have to break up the files into subdirectories, especially if the utility gets interrupted and needs to restart. Otherwise filesystem lookups would kill performance. Fortunately there's a kind of built-in encoding to the record name that can be translated so I can break it down into a sane system of self-organizing subdirectories.

Great! Straightforward workflow. Get the record names. Iterate to get the record contents. Decode the record name to get a proper subdirectory. Check if it exists. If not, save it.

Oh, there are some records that are a kind of cache...they are referred to for a few days, then drop out of the database. No need to save them.

Not a problem, just add a small step. Get the record names. Iterate to get the record contents. Check if it's a record we're not supposed to archive. If we are, decode the record name to get a proper subdirectory. Check if it exists. If not, save it.

During testing, I discover there are records whose records cannot be pulled. The database will give me a record name but when I try to pull them, nothing comes back. That's odd, but I add a tally of these odd names and a check is inserted for non-200 responses from the API calls.

Then there are records that I can't readily decode. They're too short and end up missing parts for the decoding process. At first I write them off as something I have to tally as an odd record in the logs, but discover that when I try pulling them, the API call returns an actual record. I take this to the person who has institutional knowledge of the database contents and after examining the sample of records, states that it looks like the records were from an early time in the company history.

Basically, there's a set of specs that current records should follow, but there are records from days of yore that are valid but don't follow the current specs.

So there are records that should be backed up...but don't follow the workflow, where I have functions that check for record validity through a few tests before going through the steps of making network calls and adding to the load on the servers acting as intermediaries for the transfer. To fix this, I insert a new pathway for processing those "odd" records when they're encountered, so they end up being queried and translated and, if they are a full record, saved to an alternative location. The backups are now separated into the set of "spec" records and another "alternative" path.

The problem is that this organic change cascades into a number of other parts of the utility; my tally counts for statistics are thrown off. The running list of queued records to process have to take into account records that are flowing into this alternative path. Error logging, which also handled some tallying duties since it was an end-of-life for some of the records to be processed, weren't always actually errors but actually a notification that something had happened during the process that was helpful during tracing and debugging but a problem when it would mark certain stats off before the alternative record was processed.

That one organic change in the database contents during the history of the company had implications that totally derailed some of the design of my utility that took into account only the current expected behavior.

In the end, I lost several days of debugging and testing when I introduced fixes that took into account these one-offs and variations. What were my takeaways?

It would be simple to say that I should have spend some days just sketching out workflows and creating a full spec before trying to write the software. The trouble is that I didn't know the full extent to which there were hidden variations in the database; the institutional knowledge wasn't readily available for perusing when it resides in other people's heads, and they're often too busy to try coming up with a list of gotchas I could watch out for in making this utility.

What I really needed to do was create a workflow that anticipated nothing going quite right, and made it easy to break down the steps for processing in a way that could elegantly handle unexpected changes in that workflow.

After thinking about this some more, I realized that it was just experience applied to actively trying to modularize the application. The new version did have some noticeable improvements; the biggest involved changing how channels and goroutines were used to process records in a way that cut the number of open network sockets dramatically and thus reduce the load on the load balancers and servers. Another was changing the way the queue of tasks was handled; as far as the program was concerned, it was far simpler to add or subtract worker routines in this version than the previous iteration.

I'd also learned more about how to break down tasks into functions and disentangle what each did, which simplified tracing and debugging. Granted, there are places where this could still have been improved. But the curveballs introduced as I found exceptions to the expected output from the system, for the most part, just ate time as I reworked the workflow and weren't showstoppers.

I think I could have definitely benefited from creating a spec that broke tasks down and figured out the workflow a bit better, along with considering "what-ifs" when things would go off-spec. But the experience I've been growing in my time making other utilities and mini-applications still imparted improvements. Maybe they're small steps forward, but steps forward are steps forward.