Thursday, July 20, 2017

Golang: HTTP Client Opens Too Many Sockets ("Too many open files")

This relates to a project that is work related, so I have to fuzz some of the details. But on the other hand, some details are naturally fuzzed because I have to remember some of the details and my memory is naturally fuzzy...

I'm working on a utility that is, on the surface, simple. It makes a call to an API endpoint using the http.Client, compares some quick results, and if certain conditions are met it makes a series of API calls to save the JSON responses.

The processing/checking process is carried out in a set of goroutines because the comparisons are easy to do in parallel. If the routine needs to pull a JSON reply, it calls a function that is laid out pretty much like the standard examples from the "here's how you get a web page in Go!" sites.


func GetReply(strAPI string, strServer string) string {

    // The URL to request
    strURL := strServer + /service/" + strAPI

        // Also add timeouts for connections
        tr := &http.Transport{
            Dial: (&net.Dialer{
               Timeout: 5 * time.Second,
            }).Dial,
            TLSHandshakeTimeout: 5 * time.Second,
        }
    client := &http.Client{
        Transport: tr,
        Timeout:   time.Second * 10,
    }

    // Turn it into a request
    req, err := http.NewRequest("GET", strURL, nil)
    if err != nil {
        fmt.Println("\nError forming request: " + err.Error())
        return ""
    }
    req.Header.Set("Content-Type", "application/json")
    req.Header.Set("Accept", "application/json")

    // Get the URL
    res, err := client.Do(req)
    if err != nil {
        fmt.Println("\nError reading response body: " + err.Error())
        if res != nil {
            res.Body.Close()
        }
        return ""
    }

    // What was the response status from the server?
    var strResult string
    if res.StatusCode != 200 {
        fmt.Println("\nError reading response body, status code: " + res.Status)
        if res != nil {
            res.Body.Close()
        }
        return ""
    }

    // Read the reply
    body, err := ioutil.ReadAll(res.Body)
    if err != nil {
        fmt.Println("\nError reading response body: " + err.Error())
        if res != nil {
            res.Body.Close()
        }
        return ""  
   }
    res.Body.Close()

    // Cut down on calls to convert this
    strResult = string(body)

    // Done
    return strResult

}

This is actually a modified version of what I've pulled from various tutorials and examples, adding more calls to Close() and doing a check for whether res is nil before performing that call in an error.

I also added a timeout to the client because by default it is set to 0; no timeout. As you can probably guess this version was modified while troubleshooting.

After a few hours of the application running we had alerts come in about failing functions on the production servers. When I opened logs I discovered a number of "too many files open" errors, and a developer on the call said there were over 18,000 socket connections on each of the balanced servers.

The only difference was my use of this test program, so I killed it. The socket count fell.

Welp...guess we found the cause. But why?

There are a couple basics for beginners when using http.Client requests.
1) Close the response body after reading.
2) Clients are reused.
3) If you defer a call to Close() (as this one originally did, and most tutorials show) the function should call Close() when the function returns. The modified sample I posted simply closes it after reading the Body and checking for errors.

At first I thought it was due to clients not closing; they must close in order to be re-used. I traced the execution path a dozen ways and added more explicit Close()'s in error checks...but those errors were never printing anything during the run, so errors shouldn't be causing spill of sockets.

I added timeouts to the client and dialer. While that didn't hurt and probably made things a little cleaner, it still didn't help the too many open files/sockets error.

Another lead came from a close reading of a Stack Overflow answer. The function is creating a new Transport, tr, with each call. That Transport is what holds the Clients pool for reuse. See where I'm going with that?

Another answer on that page talked about creating a global client for his functions to reuse.

The theme was scope of variables matters when dealing with what allows re-use. Because I'm hitting the same server repeatedly and the function kept re-instantiating the mechanism that was used to govern client re-use,  the number of new connections and left-open sockets ballooned.

My next move was to go to the goroutines that were in charge of processing the replies from the API endpoints and have them create Transport instances, then when they call the function they passed the Transport as a parameter.

I uploaded the program to a remote system instance and re-ran it while watching netstat on the server and the client systems. After initially ballooning to about 4,000 connections it soon settled down to well under 100 connections (using netstat |wc -l).

Takeaways:
1) Modify the default client, and maybe the transport dialer, to add sane timeouts.
2) If you're hitting the same server repeatedly, do it all within the same scope as your transport instantiation or create a transport and pass it as a parameter to functions so you optimize the re-use of the client pool
3) Check that you properly close the response body so the client can be re-used. Check in error paths that it can be properly closed without panicking.

What about separating not just the Transport, but also the Client, then passing the Client around as a parameter? I didn't test that because I wasn't sure how "goroutine-safe" that would be against race conditions, despite the one answer on that Stack Overflow that demonstrated using a global Client instance for use.  It's possible it works fine. At this point it looks like passing the Transport worked fine, though.

I'll also note that my usual self-loathing and insecurity isn't getting the better of me this time because the top answer on that question that inspired me to try this solution was the usual advice I found repeatedly in other sites and blogs (and SO answers); check that you close your response properly. It's the top answer by a significant margin. It was almost an afterthought to realize that maybe what I was doing was pummeling one particular website with multiple instantiations of Client pools so Client reuse was a minimum.

Happy HTTP Client-ing!

1 comment:

  1. Thanks! This was exactly what I needed to solve a bug. And yes, the "usual advice" is very incomplete. Thanks for posting.

    ReplyDelete