Sysadmin Still Surviving: Golang: Is The Mutex Is Locked, And Finding The Line Number That Did It

Quick summary of the situation, giving enough details to highlight the problem but not giving proprietary information away...

I have a program that queries a service which in turn talks to a database. The database holds records identified by unique rowkeys. I want to read all of the records, as far as the database knows of their existence, which I can get through an API call, iterating in discrete steps.

The utility I created pulls a batch of these keys, then iterates over them one by one to determine if I want to make a call to the service to pull the whole record (I don't need to if the key was already called before or previously analyzed on a previous run of the program.)

Seems relatively simple, but this is a big database and I'm going to be running this for a long time. Also, these servers are in the same network, so the connections are pretty fast and furious...if I overestimate some capacity, I'm going to really hammer the servers and the last thing I want to do is create an internal DDoS.

To that end, this utility keeps a number of running stats using structs that are updated by the various goroutines. To keep things synced up, I use an embedded lock on the structs.

(Yeah, that's a neat feature...works like this:)

type stctMyStruct struct {
sync.Mutex
intCounter int
}

After that, it's a simple matter of instantiating a struct and using it.

var strctMyStruct stctMyStruct

strctMyStruct.Lock()
strctMyStruct.intCounter = strctMyCounter.intCounter + 1
strctMyStruct.Unlock()

Because the utility is long-running and I wanted to keep tabs on different aspects of performance, I had several structs with embedded mutexes being updated by various goroutines. Using timers in separate routines, the stats were aggregated and turned into a string of text that could be printed to the console, redirected to a file or sent to a web page (I wanted a lot of options for monitoring, obviously.)

At some point I introduced a new bug in the program. My local system was relatively slow when it came to processing the keys (it's not just iterating over them...it evaluates them, sorts some things, picks through information in the full record...) and when I transferred it to the internal network, the jump in speed accelerated exposure of a timing issue. The program output...and processing...and web page displaying the status of the utility...all froze. But the program was still running, according to process monitoring.

I first thought it was a race condition...something is getting locked and not releasing it. But how can I tell if a routine is blocked by a locked struct? Golang does not have a call that will tell you if a mutex is locked, because that would lead to a race condition. In the time it takes to make the call and get the reply, that lock could have changed status.

Okay...polling the state of mutexes is out of the question. But what isn't out of the question is tracking when a request for a lock is granted.

First I changed the struct to have an addressable member for setting the state of the lock.

type stctMyStruct struct {
lock sync.Mutex
intCounter int
}

Next I created some methods for the struct to handle the locking and unlocking.

func (strctMyStruct *stctMyStruct) LockIt(intLine int) {

chnLocksTracking <- "Requesting lock to strctMyStruct by line " +strconv.Itoa(intLine)

tmElapsed := time.Now()
strctMyStruct.lock.Lock()

defer func() {

chnLocksTracking <- "Lock granted to strctMyStruct by line " + strconv.Itoa(intLine) + " in " + time.Since(tmElapsed).String()
}()

return
}

func (strctMyStruct *stctMyStruct) UnlockIt(intLine int) {

chnLocksTracking <- "Requesting unlock to strctMyStruct by line " +strconv.Itoa(intLine)

tmElapsed := time.Now()
strctMyStruct.lock.Unlock()

defer func() {

chnLocksTracking <- "Unlock granted to strctMyStruct by line " + strconv.Itoa(intLine) + " in " + time.Since(tmElapsed).String()
}()

return
}

LockIt() and UnlockIt() methods are now added to instances of stctMyStruct. When called, the function first sends a string into a channel with a dedicated goroutine on the other end digesting and logging messages; the first acts as a notification that the caller is "going to ask for a change in the mutex."

If the struct is locked, the operation will block. Once it is available, the function returns, and in the process runs the defer function which sends the granted message down the channel along with the elapsed time to get the request granted.

How does it know about the line number?

There's actually a library function that can help with that; my problem is that it returns too much information to not be a little unwieldy. To get around that, I created a small wrapper function.

func GetLine() int {
_,_,intLine, _ := runtime.Caller(1)
return intLine
}

If you look at the documentation you can get the specifics of what is returned, but Caller() can unwind the stack a from a call by the number of steps you use as an argument and return the line number, package/module, etc...in my particular case I'm using one source file so I only needed the line number.

Using this, you can insert function calls to lock and unlock the structs as needed. I added the methods to each struct that had a mutex or rwmutex. Using them is as simple as:

strctMyStruct.LockIt(GetLine())

This solution provided a way to trace what was happening, but there is a performance cost. Defer() adds a few fractions of a second each time it's called and I used a lot of locks throughout the program which added up to a significant performance hit. Using this technique is good for debugging, but you have to decide if you want to incur the overhead or find a way to compensate for it.

So what was my lock issue?

I set the goroutine monitoring the locks to dump information to a file and traced the requests vs. granted mutex changes. There was a race condition in a function I used that summarized aggregated information; A lock near the beginning of the summary was granted, and while pulling other information, it requested another lock. The second one was an operation on a struct that was held by a process waiting to get a lock on what was being held by the beginning of the summarize function.

It was a circular resource contention. Function A held a resource that Function B wanted, and Function B had a resource function A wanted. The solution was to add more granular locking, which added more calls but in the end meant (hopefully) there would be only one struct locked at a time within a given function.

Lesson learned: when using locks, keep the calls as tight and granular as possible, and avoid overlapping locks as much as possible or you may end up with a deadlock that Go's runtime wouldn't detect!

Sysadmin Still Surviving

Monday, March 12, 2018

Golang: Is The Mutex Is Locked, And Finding The Line Number That Did It

No comments:

Post a Comment