Sunday, January 20, 2013

How to Move Datacenters

(Disclaimer: as always, this is my blog, not my employer's. I'm not their spokesperson. They'll have their own blog entry about the datacenter move, so some details here will be a little vague...after all, I've grown rather attached to being employed.)

I haven't moved many times in my life.

I half-moved when I went to college, which normally entailed hauling a lot of my crap from home to a small shared bedroom space and back again every two semesters. I remember the "big move" when my wife and I bought a house, an adventure that involved a lot of storage totes and a hole in the wall covered by a strategically placed doorknob-guard that happened to match the paint on the wall. And then there was what was possibly the most bittersweet move; hauling as many of my belongings as possible in our Toyota Corolla to New York City in 100 degree weather.

That's about as much fun as it sounds. And seeing as it ended with two of us entering the city and one leaving,...yeah, about as much fun as it sounds.

Thankfully the company moving data centers was not quite like that.

The company I work for was in need of moving data centers. Not for any scandalous reasons or a story of excess drama...it was simply a question of space and resources. Scaling predictions showed we would need more than the current company would be able to provide; we would need to move or hit a scaling wall.

Our company happens to run a fairly popular website with over one and a half million registered users and a large number of anonymous users utilizing our content.

So how do you manage a move of a website like that?

The first step is to become disaster resistant in case your primary data center is hit by a hurricane.

In case you forgot, New York City and New Jersey were recently hit by Hurricane Sandy, taking a number of tech sites offline as data centers around the island were systematically flooded and in several cases rarely-used generators suffered pump failures. This web company lucked out; we have a second data center on the other side of the country that dutifully replicated our data until it needed to step up to the plate.

As the storm intensified, the call was made to fail to our secondary site,...and we didn't fail back.

Months later we still had data served from the backup site, while our New York location was acting as a non-production backup.

This ended up taking a bit of pressure off the team; now that the data center we were moving was no longer the "production" site, there was more flexibility in when things could be moved around.

Second, plan, plan, then plan some more.

There were a number of meetings and pow-wows to discuss minutiae of the move. Type of racks to purchase, the power runs, expected loads placed on circuits, even the color coding to be used so it would be easier to identify what you accidentally unplugged while trying to reach something on a server.

Charts and checklists are made and cross-verified and I even throw the occasional curveball by saying something like, "Okay, but try not to mix brown and green cables or purple and blues too much unless you don't want me to touch it. I'm partially colorblind," which elicited some surprised curses as invisible handicaps don't normally get considered when you don't have those handicaps, which meant more revisions.

Got those checklists and charts all made and ready to go? Good. You'll have a Plan(tm) to follow until something goes kerplooey.

Um...where does that stuff go again?...

Third step: hire a good moving company that specializes in moving computer equipment. In our case, Morgen Industries in Secaucus, New Jersey. Yeah, I named them. Because they were that fucking awesome.

These guys gathered information about our servers...names, placement, etc...along with diagrams mapping where they would physically be placed in the new datacenter's racks. And they provided documentation that they were properly insured for moving our equipment, which is kind of important for moving what is essentially...you know, our entire business...through New York City traffic.

Migration measures were taken in the old data center; DNS names on remaining external services were changed along with the TTL values for the entire site, database clustering taken offline between geographical locations, and then cables were disconnected. Arrangements were made for access to both data centers' freight elevators and security was told to allow the new guys in, along with members of our own team flying in to lend a hand with the move.

The movers came in with boxes for the servers. They un-racked the systems and tucked them into their little foam-padded boxes along with scannable tags inventorying where the servers were at all times. They were fast. They were professional. After the cables were pulled, our team was mostly supervising.

Yeah...put it in that box there...good job, bro.
The moving team hauled everything ahead of schedule to the new location. In fact, that threw a kink into the schedule, as the new building needed to change the schedule in when the freight elevator could be used.

That's right. We were delayed because we needed someone with proper contractual rights to flip a switch on the elevator ahead of what was originally scheduled. Because the movers were too awesome to let something like schedules keep them from doing the job fast.

Step Four is the fun step. Getting things working again.

The moving team unboxed the servers and racked them according to specifications, and our team moved in to re-cable things.

The power cords got their own piles...

Management parts...now imagine boxes of patch cables. Many boxes of patch cables.
Systems were whipped out and cable management plans were pulled up so labeling could begin in earnest.

A labeler ordered just for the move, capable of making self-laminating labels. So. Many. Labels.
Cables were labeled and shuffled to the servers.

Realign the dilithium crystal and reroute power to the flux capacitor, then reboot...easy peasy.
As systems were plugged in, tests were run to test connectivity to the new switching equipment, and firewall rules had to be adjusted accordingly. There were some occasional...um...challenges?
Dammit, the SQL Server's eating Craver again...
In the end, though, the crack coding commandos managed to iron most of the wrinkles out.

Dude, it works. I can crash Reddit twice as fast from this data center!...Where's Craver?...Craver?
In the end, we were pretty happy with the results.

Color coded, management arms, labeled, blinkied, semi-sentient...
That is the 10,000 foot view of a datacenter move. It's not completely finished as of this writing; our data is still flowing from the backup site as testing is performed in the new site. Some DNS has not been migrated. Testing is still proceeding on the firewall rules for our site-to-site interconnects.

Some software upgrades are being implemented, then SQL Server has to be told that our New York site is back online so the data can begin re-syncing; each day, several gigs of data are accumulating in the backup site, waiting to pour back into our primary site. The physical move and cabling took the better part of a week to complete...that's a lot of time for data to pile up. We're also taking this opportunity to upgrade some of the servers to take advantage of less buggy clustering code, a decision made for reasons outside the scope of this blog posting.

New shiny data center. Servers are fully patched and updated. Some of the servers even have new parts, upgraded since they were offline for a period of time. Now we just fight the occasional Chaos Monkey glitch in a switch or a call about a rule in the firewall.

Step five is the big one; fail-back.

We're getting the infrastructure back up. Site to site VPN's. DNS. External services visible to the Internet again. Documenting connections. Testing new PDU's, and monitoring servers for reliability with their new cabling and possible bits shaken loose in transit.

Soon we'll have the meetings to coordinate the fail-back procedure, wherein everything in the remote site is shifted back to our new primary site with as little downtime as possible. This includes web servers, SQL servers, load balancers and internal services.

There you have it; the 10,000 foot view of a major website with lots of jiggly wiggly parts being moved to another data center. This is meant for people with a passing interest in how one company achieves such a move. I didn't get into the excruciating details of SQL cluster reconfigurations, the internal services being migrated, or the VM migrations.

In this particular instance, it really boiled down to a few steps.
1) Have a secondary site to run your business from.
2) Disconnect dependencies between your secondary and primary sites.
3) Physically move the servers.
4) Test the new connections at the new site.
5) Plan the migration of your backup site to the primary site.

A few notes to keep in mind:

1) We happened to have the resources for a second data center, which was in place for historical reasons. Not every business has this, and it's not a "right" or "wrong" thing. It's how things worked out for our particular business and it gave us a big advantage in making our transition.

2) The new data center is restrictive of what can and cannot be shown for security reasons. The pictures I posted above were taken with the understanding that we can show our own equipment and only our own equipment, so I was trying to be careful not to get other equipment housed at the new site in the pictures. If images are pulled, you know why.

3) There were rumors of a plan to keep our site online during the move using new racks on wheels, big UPS's and MiFi's. We'll pretend those were just rumors.

4) I work with a team of highly intelligent and capable people. While this blog posting was glib and probably made the move sound simple, the truth is there were numerous points where things could have gone south in a really bad way and the advanced planning performed by the team kept everything running relatively smooth. It was a week of late nights neck-deep in reconfiguring firewalls and switches and database burps for most of the team while I spent much of the migration handling office issues and helping our sales and remote groups connect to our internal systems as they were brought back online. It takes a lot of hard work to make something like this look easy...those guys deserve a lot of credit, from the guy that handled coordinating the movers and scheduling elevators to the admin that plugged in the last cable and the devops that altered the last firewall rule. These people were awesome...credit where credit is due.

5) This case was just how we ended up moving to a new datacenter. Depending on how a business grew its infrastructure, the behind the scenes methodology and drama could unfold in a very different manner. Our drama was limited to toe shoes and eventually fixing flat tires on a hand truck and the occasional aggressive negotiations when discussing certain logistics of the move. Your mileage may vary.

No comments:

Post a Comment