I haven't moved many times in my life.
That's about as much fun as it sounds. And seeing as it ended with two of us entering the city and one leaving,...yeah, about as much fun as it sounds.
Thankfully the company moving data centers was not quite like that.
The company I work for was in need of moving data centers. Not for any scandalous reasons or a story of excess drama...it was simply a question of space and resources. Scaling predictions showed we would need more than the current company would be able to provide; we would need to move or hit a scaling wall.
Our company happens to run a fairly popular website with over one and a half million registered users and a large number of anonymous users utilizing our content.
So how do you manage a move of a website like that?
The first step is to become disaster resistant in case your primary data center is hit by a hurricane.
In case you forgot, New York City and New Jersey were recently hit by Hurricane Sandy, taking a number of tech sites offline as data centers around the island were systematically flooded and in several cases rarely-used generators suffered pump failures. This web company lucked out; we have a second data center on the other side of the country that dutifully replicated our data until it needed to step up to the plate.
As the storm intensified, the call was made to fail to our secondary site,...and we didn't fail back.
Months later we still had data served from the backup site, while our New York location was acting as a non-production backup.
This ended up taking a bit of pressure off the team; now that the data center we were moving was no longer the "production" site, there was more flexibility in when things could be moved around.
Second, plan, plan, then plan some more.
There were a number of meetings and pow-wows to discuss minutiae of the move. Type of racks to purchase, the power runs, expected loads placed on circuits, even the color coding to be used so it would be easier to identify what you accidentally unplugged while trying to reach something on a server.
Got those checklists and charts all made and ready to go? Good. You'll have a Plan(tm) to follow until something goes kerplooey.
Um...where does that stuff go again?... |
Third step: hire a good moving company that specializes in moving computer equipment. In our case, Morgen Industries in Secaucus, New Jersey. Yeah, I named them. Because they were that fucking awesome.
These guys gathered information about our servers...names, placement, etc...along with diagrams mapping where they would physically be placed in the new datacenter's racks. And they provided documentation that they were properly insured for moving our equipment, which is kind of important for moving what is essentially...you know, our entire business...through New York City traffic.
Migration measures were taken in the old data center; DNS names on remaining external services were changed along with the TTL values for the entire site, database clustering taken offline between geographical locations, and then cables were disconnected. Arrangements were made for access to both data centers' freight elevators and security was told to allow the new guys in, along with members of our own team flying in to lend a hand with the move.
The movers came in with boxes for the servers. They un-racked the systems and tucked them into their little foam-padded boxes along with scannable tags inventorying where the servers were at all times. They were fast. They were professional. After the cables were pulled, our team was mostly supervising.
Yeah...put it in that box there...good job, bro. |
That's right. We were delayed because we needed someone with proper contractual rights to flip a switch on the elevator ahead of what was originally scheduled. Because the movers were too awesome to let something like schedules keep them from doing the job fast.
Step Four is the fun step. Getting things working again.
The moving team unboxed the servers and racked them according to specifications, and our team moved in to re-cable things.
The power cords got their own piles... |
Management parts...now imagine boxes of patch cables. Many boxes of patch cables. |
A labeler ordered just for the move, capable of making self-laminating labels. So. Many. Labels. |
Realign the dilithium crystal and reroute power to the flux capacitor, then reboot...easy peasy. |
Dammit, the SQL Server's eating Craver again... |
Dude, it works. I can crash Reddit twice as fast from this data center!...Where's Craver?...Craver? |
Color coded, management arms, labeled, blinkied, semi-sentient... |
Some software upgrades are being implemented, then SQL Server has to be told that our New York site is back online so the data can begin re-syncing; each day, several gigs of data are accumulating in the backup site, waiting to pour back into our primary site. The physical move and cabling took the better part of a week to complete...that's a lot of time for data to pile up. We're also taking this opportunity to upgrade some of the servers to take advantage of less buggy clustering code, a decision made for reasons outside the scope of this blog posting.
New shiny data center. Servers are fully patched and updated. Some of the servers even have new parts, upgraded since they were offline for a period of time. Now we just fight the occasional Chaos Monkey glitch in a switch or a call about a rule in the firewall.
Step five is the big one; fail-back.
We're getting the infrastructure back up. Site to site VPN's. DNS. External services visible to the Internet again. Documenting connections. Testing new PDU's, and monitoring servers for reliability with their new cabling and possible bits shaken loose in transit.
Soon we'll have the meetings to coordinate the fail-back procedure, wherein everything in the remote site is shifted back to our new primary site with as little downtime as possible. This includes web servers, SQL servers, load balancers and internal services.
In this particular instance, it really boiled down to a few steps.
1) Have a secondary site to run your business from.
2) Disconnect dependencies between your secondary and primary sites.
3) Physically move the servers.
4) Test the new connections at the new site.
5) Plan the migration of your backup site to the primary site.
A few notes to keep in mind:
1) We happened to have the resources for a second data center, which was in place for historical reasons. Not every business has this, and it's not a "right" or "wrong" thing. It's how things worked out for our particular business and it gave us a big advantage in making our transition.
2) The new data center is restrictive of what can and cannot be shown for security reasons. The pictures I posted above were taken with the understanding that we can show our own equipment and only our own equipment, so I was trying to be careful not to get other equipment housed at the new site in the pictures. If images are pulled, you know why.
3) There were rumors of a plan to keep our site online during the move using new racks on wheels, big UPS's and MiFi's. We'll pretend those were just rumors.
4) I work with a team of highly intelligent and capable people. While this blog posting was glib and probably made the move sound simple, the truth is there were numerous points where things could have gone south in a really bad way and the advanced planning performed by the team kept everything running relatively smooth. It was a week of late nights neck-deep in reconfiguring firewalls and switches and database burps for most of the team while I spent much of the migration handling office issues and helping our sales and remote groups connect to our internal systems as they were brought back online. It takes a lot of hard work to make something like this look easy...those guys deserve a lot of credit, from the guy that handled coordinating the movers and scheduling elevators to the admin that plugged in the last cable and the devops that altered the last firewall rule. These people were awesome...credit where credit is due.
5) This case was just how we ended up moving to a new datacenter. Depending on how a business grew its infrastructure, the behind the scenes methodology and drama could unfold in a very different manner. Our drama was limited to toe shoes and eventually fixing flat tires on a hand truck and the occasional aggressive negotiations when discussing certain logistics of the move. Your mileage may vary.
No comments:
Post a Comment