Sysadmin Still Surviving: August 2017

I have a project that is fairly straightforward. Again it's work related, so I have to fuzz some details, and again my memory is naturally fuzzy so I doubt it's an issue.

Background

This program I've been working on makes calls to a service (a REST endpoint) that in turn pulls data from a database, then my application parses that information into components and checks the disk to see if the file already exists. If it doesn't exist on disk already, the program makes a call to the API endpoint again asking for specific record information and writes it to the disk. In the end I get a huge set of files sorted in a structure resembling ./files/year/month/day/datatype/subtype/filename.txt.

There are literally millions of records to sort through. A single thread handling this would probably take weeks. Therefore, the program uses several (configurable!) goroutines to pull records simultaneously.

First Problem: Too Many Open Files

I wrote about this fix earlier in the blog, but I'll give a quick recap.

At first everything seemed fine. I have a simple bit of math being output periodically to the console, and it was chugging along at around 20,000 records/minute. The system was functioning fine, no errors were showing up. All was right with the world.

Then a few hours later a few alerts arrived in the email. At this point the utility was running on its own instance with its own storage and was making calls to a load balancer that held a few endpoint servers behind it. The only change to the system that could possibly prevent the system from making connections was the utility I was running, so I killed it, and when the API servers were checked there were still 18,000+ network connections in TIME_WAIT on each system.

Linux systems treat files on the disk as well as sockets as "open files" due to the way Linux handles file handles. Too Many Open Files can mean literally too many files are open or it can mean too many network connections are open, but it usually is a combination of the two.

Research time. The problem here is usually related to "you didn't close the connections." That wasn't the cause here. The calls were straightforward; I had a function that created a transport, created a client, made the connection, and called GET then read the data to return to the caller. It was a textbook example fragment of Go adapted to my purposes, and that included a defer Close() call so when the function exited it should make really sure everything was closed properly.

Check the "did you close the connection" off the list. And I also read the data from the socket before closing it, so that can be checked off the list. I had a hacked together bit of logic to retry connections if there was an error, but it also printed that status to the console when that happened. Nothing appeared as the too many open files errors popped up, so even if that caused a socket leak, it wasn't the likely cause.

The issue was the call to instantiate a transport each time the function was called. Transports hold the pool of client connections; the system should be re-using connections. Because the transport was destroyed each time the function returned, it was creating a new pool of connections, which meant new sets of client connections to the server instead of recycling previous connections and that led to thousands of "open files".

The solution was to create the transport and pass it as a parameter to the call to GET the web endpoint. This allowed the transport to continue to manage the client pool outside the scope of the function call, and that allowed the system to keep a managed pool of connections for re-use.

This wouldn't have shown up if I were making periodic, occasional calls to different websites every few minutes. The problem would still be there, but chances are the connections would eventually close and time out before piling up and becoming a problem.

Too Many Files Leads to Terrible Times

There are a few things that are obvious in affecting the speed information is being processed, and at the risk of sounding immodest, I've been told I'm pretty good at spotting the obvious.

Warning: I'm not a Go expert. I'm citing information here that is just my current understanding, so if I'm wrong, please correct me in the comments.

Because I'm writing files, the drive can definitely affect performance. I have multiple processes that could be trying multiple disk operations in parallel at a given time. To that end, disk seek times, write times, and cache can directly impact the utility's speed.

I'm dealing with millions of files. During the initial testing and design of the utility, I had to deal with a file that would unzip into a directory holding around 100,000 files; then I had to deal with several of those 100K-file containing directories for processing. If you haven't tried that on a Macintosh using the HFS+ filesystem, it's not fun. EXT4 doesn't really handle it well either. Even on an SSD, getting a directory listing is downright painful. Too many files in one directory is difficult for some filesystems to handle.

One solution is to split the directory into more subdirectories, reducing the number of entries the system has to track per directory. This is in fact the solution I used, splitting information into logical subsets.

Timing Out Connections

Another fun fact I learned during this project; by default, the Golang client doesn't have timeouts set. This leads to some fun havoc with stray connections left in weird states that if you're using them to hit random sites in a semi-random fashion, you'd probably never notice. Hammer the same site with hundreds of requests per second, and you can bet this can have some ramifications.

I read about this in a blog post warning against using the default settings in http.Client. After reviewing that information, I went back to the source code and added some timeouts, like so:

 tr := &http.Transport{
  Dial: (&net.Dialer{
   Timeout: 30 * time.Second,
  }).Dial,
  TLSHandshakeTimeout: 30 * time.Second,
 }

 client := &http.Client{
  Transport: tr,
  Timeout:   time.Second * 10,
 }

This is a modification I made to the most intensively-used connection set; I didn't move the transport's scope for a far less-used connection in another function, figuring that yes, they would pile up to a degree, but they should properly close and age out as closed connections. This set will hammer the server with thousands of connections in parallel.

This basically added some sane timeouts to functions that previously did not have any timeouts. This helped noticeably reduce my ghost connections disappear.

Remove a Hindrance, Create a New One

The initial run finished a few days later. I realized that there was a bug in my loop logic. There were some bad words uttered and an updated version compiled.

At this point we also moved the utility, and the volume to which data was being saved, to the same system that held the API endpoint server. Basically the server being queried for information was now also hosting the client requesting and processing results from the API queries.

This eliminated what before was creating a kind of natural bottleneck that throttled performance; hundreds of connections per second simultaneously hitting the server but separated by the network transit time. Sure, it was on the scale of tens of milliseconds (if things were working well), but it really added up.

Now the client was requesting it from the localhost. *Bam*. Within a few moments, the number of open connections (using netstat |wc -l, since I only needed a rough estimate) ballooned to 40,000 connections before this appeared on the console:

dial tcp <ip address redacted>: can't assign requested address

Because dial was in the error, it was most likely the client causing the issue. After some poking around, I ended up making two more changes.

First, I tried to make a change to the number of idle connections the client keeps open. The default is two; more than that, and the client was closing the connections in the idle pool instead of making more efficient use of re-using the clients. Again, working with random connections aren't so bad, but hammering the same IP will highlight the need to alter this (and you probably don't want to change this if you're not making a large number of frequent calls to the same host):

 tr := &http.Transport{
  Dial: (&net.Dialer{
   Timeout: 30 * time.Second,
  }).Dial,
  TLSHandshakeTimeout: 30 * time.Second,
  MaxIdleConnsPerHost: intIdleConns,
 }

The changed setting is MaxIdleConnsPerHost in the transport. Here I set it to a variable that in turn is set from the command line so I could tune it at runtime, but instead of the default 2 I set it closer to 400.

The next change was an alteration on the host server. There is some guidance on a SO question explaining some tuning tweaks, but the gist of the change I made is this...

When the TCP connection is made, the connection is made to an ephemeral port. When I have a ton of tcp connections hitting the server, it would starve the number of ephemeral ports available. The next step was to try increasing the number of ports available, and then the server could support more connections per second, hopefully at a level where the connections would close and age out properly before overloading the system.

In this case, I changed net.ipv4.ip_local_port_range from "32768 61000" to "9000 64500". From the SO question, this means I changed the connectivity from (61000-32768)/60 = 470 sockets/second to (64500-9000)/60 = 925 sockets/second.

There was another change I could make from the page that involved changing the net.ipv4.tcp_fin_timeout setting, along with a couple of others. I avoided that, opting instead to test these changes because the tuning advice was more like "change this on the client" or "change this on the server", not really geared to a situation where the server and client were eating resources on the same host. Making minimal changes to keep it working, for this project, would be fine.

I ran netstat in a loop while the application ran again. This time the open connections quickly climbed to 70,000 connections before leveling out, and it held steady. After 15 hours of elapsed runtime, it had 3 connection errors show up. Otherwise it kept up with the load just fine.

I should also mention that I ran 4 parallel processing tasks, one for each core. When I boosted that number it seemed to be a hindrance to the processing speed; keeping it at 4, the estimated processing speed was over 100K records/minute, easily holding sustained bursts 5 or 6 times the processing speed when the client was on a separate machine.

This Was a Minimal Set of Changes

There were a number of lessons learned so far; above the basic novice checking that connections are properly read from a network client response before calling Close(), be aware that the transport is what controls the pool of connection clients for efficient re-use.

Next, be aware that by default timeouts are missing from the transport and client. Add them.

Also if you're hitting a particular server or set of servers with requests, change your MaxIdleConnsPerHost. Otherwise you're wasting connection use.

Last, an easy way to boost connection rates is to increase the number of ephemeral ports available. There are limits to this...and you don't want to starve other resources by taking away those ports from other clients or servers on the host.

There are plenty of other changes that can be made to increase horsepower of your servers. Some additional changes are in the SO question I linked to; another good blog post discusses how MigratoryData scaled servers to 12 million concurrent connections. I'd only caution that not every task requires this kind of engineering and you might want to exercise restraint in changing things when a few tweaks can accomplish decent performance for your use case.

Performance is a scale. Some things can be overcome with throwing lots of hardware at it. Sometimes a few tweaks will make your app run 5 or 6 times faster.

Happy tuning!

Sysadmin Still Surviving

Tuesday, August 1, 2017

More Tuning Golang Apps for High Concurrency Tasks on Linux