Sysadmin Still Surviving: April 2014

Monday, April 28, 2014

Testing File Transfers over Network

Let's Set Stuff Up

(Note - thanks to Tableizer for making these tables a little easier to create...)

In my previous post I outlined a procedure for transferring disk images over the network using a file image. Along the way I found some interesting things happened when I tried to save time by compressing the file, both in-flight and pre-compressed.

I thought it might be best to try some testing of transfers, just to see what conclusions could be drawn from the results.

First things first. Let's create a few test files. First, a 1 gigabyte file of near-nothings. This would be like a drive image of a new hard disk that is mostly unallocated sectors, therefore highly compressible. On my OS X system, I run:

time dd if=/dev/zero of=./1_gig_zeroed_blocks_fw.img bs=1g count=1

...and let it go. Time will time the file creation. Dd will read from the zero (nothing?) device, outputting a one-gig (bs = block size) file with one count to my external FireWire disk drive. Note that I'm using OS X; with Linux, dd's options are a little different, such as bs=1G instead of the lowercase g.

...looks good. Let's create one on my machine's internal SSD drive:

time dd if=/dev/zero of=./1_gig_zeroed_blocks_ssd.img bs=1g count=1

Let's create a file of random things. This would be a hard disk image full of software and other non-zero content on the filesystem.

On my FireWire drive:

time dd if=/dev/random of=./1_gig_random_blocks_fw.img bs=1g count=1

On the SSD drive:

time dd if=/dev/random of=./1_gig_random_blocks_ssd.img bs=1g count=1

So far so good.

File	FireWire, zeroed	FireWire, random
Time to create (or compress)	0m16s	2m11s
Size (ls -al)	1073741824	1073741824
Size (ls -alh)	1.0G	1.0G
MD5	cd573cfaace07e7949bc0c46028904ff	9e58b0cbae41c2ad7a91cb6b8f2cd6a0

File	SSD, zeroed	SSD, random
Time to create (or compress)	0m4s	1m24s
Size (ls -al)	1073741824	1073741824
Size (ls -alh)	1.0G	1.0G
MD5	cd573cfaace07e7949bc0c46028904ff	0aa62e9b1922cb8a34623afff6648981

There's something to note here; the use of "random" created really random files. The md5 hashes show that the two random files are indeed different. But the files, separately created, consisting of output from /dev/zero have the same hash. That's because zero is giving the same data. Nothing. These are the two extremes between unused stuff on a drive and totally filling it with a mishmash of information. Your drive contents will be in between the two extremes, of course, and as it fills and has data "deleted" and overwritten it will gradually move towards the mishmash side.

This also illustrated how fast the SSD drive performance is compared to the external FireWire drive. Although I can't say I'm too surprised at this.

Second, we create some compressed files. What I expect is the zeroed blocks files should be significantly smaller than the random blocks version. On the FireWire drive:

time bzip2 -c 1_gig_random_blocks_fw.img > 1_gig_random_blocks_fw.img.bz2

time gzip -c 1_gig_random_blocks_fw.img > 1_gig_random_blocks_fw.img.gz

File	FireWire, random, bzip2	FireWire, random, gzip
Time to create (or compress)	3m46s	0m39s
Size (ls -al)	1078480063	1074069404
Size (ls -alh)	1.0G	1.0G
MD5	1584ac641ae4989ef6439000e9a591b9	7e41488b6f4ffbb1b87f4d49dbc7ea02

Holy cow!

Compression works, at the highest abstraction, by looking for data that it can find in common and substituting a shorthand to represent certain patterns. Without forcing any "ultra" compression levels, the files barely shrunk. In fact, they grew, with the overhead of the compression metadata tacked on. The random data just couldn't be compressed, much like trying to compress compressed data.

time bzip2 -c 1_gig_zeroed_blocks_fw.img > 1_gig_zeroed_blocks_fw.img.bz2

time gzip -c 1_gig_zeroed_blocks_fw.img > 1_gig_zeroed_blocks_fw.img.gz

File	FireWire, zeroed, bzip2	FireWire, zeroed, gzip
Time to create (or compress)	0m33s	0m6s
Size (ls -al)	785	1043683
Size (ls -alh)	785B	1.0M
MD5	9192d766e556ac3c470bff28a0af7b04	eaee7c163450c9f739eb16599f1633ea

And another wow. Remember it was looking for shorthand to substitute for patterns? It would appear that "nothing" compresses quite a bit.

(If you want some fun, you can really freak someone out emailing them a 1 megabyte file with instructions to unzip it...no, don't do that. It's not nice.)

Again, these are extreme examples. You can't tell what the real-world performance is for the two compressors from these files.

On the SSD drive:

time bzip2 -c 1_gig_random_blocks_ssd.img > 1_gig_random_blocks_ssd.img.bz2

time gzip -c 1_gig_random_blocks_ssd.img > 1_gig_random_blocks_ssd.img.gz

File	SSD, random, bzip2	SSD, random, gzip
Time to create (or compress)	3m41s	0m39s
Size (ls -al)	1078489250	1074069405
Size (ls -alh)	1.0G	1.0G
MD5	ccfca3a4aafd0bf32b1a0643097af1b4	e9d84661c44c00828610b4908ec33dd1

Well, this is kind of interesting...the times to compress the files are close to what it took to compress the random files on the FireWire drive. That would mean that the compressors, not the drives, were the bottleneck.

But how could the compression take less time than the creation of the files initially? I'd guess it's filesystem caching; it hadn't been flushed to the drive yet, but as far as the applications were concerned, it was done. Don't turn off the computer without flushing buffers first or you'll get a whoopsie.

time bzip2 -c 1_gig_zeroed_blocks_ssd.img > 1_gig_zeroed_blocks_ssd.img.bz2

time gzip -c 1_gig_zeroed_blocks_ssd.img > 1_gig_zeroed_blocks_ssd.img.gz

File	SSD, zeroed, bzip2	SSD, zeroed, gzip
Time to create (or compress)	0m12s	0m6s
Size (ls -al)	785	1043684
Size (ls -alh)	785B	1.0M
MD5	9192d766e556ac3c470bff28a0af7b04	dd03953bfcd54f3c5a8457978079ec36

Still extremely small for the zeroed files. Also notice that the MD5's for the bzip2 files match between the zeroed bzip2 on the SSD and the FireWire drives, but the gzip files do not match. That's kind of interesting...something in the metadata must be different. Perhaps it's storing path data? (Pure speculation)

Does It Matter If I Cat Instead Of DD?

Let's test a 1-gig file transfer and see what time it takes.

On the Linux machine, I tell it to listen for the incoming file. Straight copy with dd, and my working directory is a mounted internal 500GB drive.

nc -l 19000 | dd of=./test.img

On my sending machine, I send the file from my internal SSD drive:

time cat 1_gig_random_blocks_ssd.img | nc <target machine ip> 19000

The result: 1m37s.
MD5Sum: 0aa62e9b1922cb8a34623afff6648981 test.img

I had to use time from the Mac (sending) system because timing the open, "listening" side included time from waiting for me to actually start sending data. Timing from the sending system actually timed the start of sending information to closing the connection.

Now I'll try again, using cat on the receiving side. On the target, after deleting the image:

nc -l 19000 | cat > ./test.img

On the sending machine, same as before:

time cat 1_gig_random_blocks_ssd.img | nc <target machine ip> 19000

The result: 0m13s
MD5Sum: 0aa62e9b1922cb8a34623afff6648981 test.img

Conclusion: Yes, the implementations behind dd vs. cat are different and have a dramatic effect on the speed with which the file is transferred.

Does It Matter If The File Is Mostly "Empty" With Cat?

Let's try transferring the zeroed file versus the random file. Several runs will give a good idea of how much variation in times there are.

On the target machine:

nc -l 19000 | cat > ./test.img

While on the source machine:

time cat 1_gig_random_blocks_ssd.img | nc <target machine ip> 19000

The result: 0m11s, 0m12s, 0m12s
MD5Sum: 0aa62e9b1922cb8a34623afff6648981 test.img

Now, I transfer the mostly empty file. On the target:

nc -l 19000 | cat > ./test.img

On the source:

time cat 1_gig_zeroed_blocks_ssd.img | nc <target machine ip> 19000

The result: 0m13s, 0m12s, 0m11s
MD5Sum: cd573cfaace07e7949bc0c46028904ff test.img

Conclusion: Not a significant difference. A big file is a big file. Period.

What If I Repeat It With DD? Is a "Zeroed" File Faster?

Let's find out. On the target system:

nc -l 19000 | dd of=./test.img

On the source:

time cat 1_gig_random_blocks_ssd.img | nc <target machine ip> 19000

The result: 1m38s, 1m36s, 1m37s
MD5Sum: 0aa62e9b1922cb8a34623afff6648981 test.img

Now with the zeroed file. On the target:

nc -l 19000 | dd of=./test.img

And the source:

time cat 1_gig_zeroed_blocks_ssd.img | nc <target machine ip> 19000

The result: 1m37s, 1m37s, 1m37s
MD5Sum: cd573cfaace07e7949bc0c46028904ff test.img

Conclusion: No, the zeroed file doesn't make any difference with dd compared to writing the random-bits file.

Does Transferring, Using Cat, From My External FireWire Drive Affect The Transfer Time?

Maybe? This would test if the FireWire drive is a bigger bottleneck than the network.

When I was doing the previously-blogged drive image test, I at the time had a 500 gig file, and the only drive I had space to hold that kind of image was the external FireWire drive. I wondered if the internal SSD drive would have significantly sped up the process, and hoped that my wasted time wasn't that bad, if indeed the network was a bigger buffer filler than the transfer rate of the drive.

So to test this I once again set up the target to listen for the transfer:

nc -l 19000 | cat > test.img

And on my source, cd change to my FireWire drive with the set of FireWire-hosted test images and send them over the network:

time cat 1_gig_random_blocks_fw.img | nc <target machine ip> 19000

The result: 0m43s, 0m12s, 0m11s
MD5Sum: 9e58b0cbae41c2ad7a91cb6b8f2cd6a0 test.img

...What? It fell from 43 seconds down to 12 seconds? The only explanation I have is file caching...

Let's first compare to a transfer from the SSD; I use the same command on the target:

nc -l 19000 | cat > test.img

Then from the SSD drive on the source:

time cat 1_gig_random_blocks_ssd.img | nc <target machine ip> 19000

Result: 0m12s, 0m11s, 0m12s
MD5Sum: 0aa62e9b1922cb8a34623afff6648981 test.img

Now let's try that FireWire again...

Target:

nc -l 19000 | cat > test.img

Source:

time cat 1_gig_random_blocks_fw.img | nc <target machine ip> 19000

Results: 0m22s, 0m12s, 0m12s

And if I try a 1 gig file from the FireWire drive that wasn't transferred before?

On the source:

time cat 1_gig_zeroed_blocks_fw.img | nc <target machine ip> 19000

Results: 0m43s, 0m12s, 0m12s

The same giant time drop after the first iteration. I wonder if running

sudo purge

...between tests would affect it? The purge command is supposed to flush disk cache.

Let's see; repeating the above, with the random block img file as before, but running purge between runs...

The results: 0m22s, 0m44s, 0m43s

Yup, it's a cache. The first run has a kind of partial-flush going on from the change from the previous tests (I didn't purge the cache before the first run.)

Conclusion: Yes, there is a difference, but when dealing with a 1 gig file the memory caching on OS X will offset the difference after the first run, as long as you're repeating the process. The cache will level the field. But first run, the SSD spanked the FireWire transfer quite soundly, and if you're dealing with files larger than can be cached there will probably be a loss of subsequent performance.

Transferring the Compressed Files

When imaging systems, I would think that storing an image of the drive as a compressed file on the source system would help speed up the process compared to transferring an image that was 1:1 in size. But does it really? If it does have an effect, how much does the compression affect the transfer time?

Here I created a couple of tiny files relative to what their decompressed size. If I do the transfer, will the tiny file dump across the network and close, leaving the remote system to decompress and write the file to disk? Or is there some kind of mechanism that prevents the network connection from closing, making the savings in imaging time reliant on how fast or efficient the decompression algorithm is processed?

Let's find out. On the target system:

nc -l 19000 | bzip2 -d | cat > test.img

On the source:

time cat 1_gig_zeroed_blocks_ssd.img.bz2 | nc <target machine ip> 19000

The bzip2 version of the zeroed file is 785 bytes, and expanded to a gig, making the ratio of compressed to decompressed sizes absolutely huge. Dumping the file over the network should literally take only a few moments. I'm fairly sure the decompression and writing of the file would take longer than sending the 785 bytes over the wire. So how long did the above commands take? (And I ran purge between attempts...)

Results: 0.007s, 0.007s, 0.007s
MD5Sum of the decompressed file: cd573cfaace07e7949bc0c46028904ff test.img

The MD5 of the decompressed file shows that it did transform into the appropriate original file. And the speed with which it completed, on the source side, implies that the transfer of the compressed file does indeed dump it and close the network connection once the data is transferred and leaving the target system to decompress and write the file out from a buffer. What is the time difference between the two actions (the sender sending the file and the target getting the file and decompressing/writing the file to disk?)

There are a few ways to get a rough idea, but I'm going to utilize iTerm's ability to mirror commands to multiple sessions so that I can set up the two commands on the two systems, both set with "time", and send the enter-key keystroke at the same time to both systems (iTerm calls it "broadcast input"). The result?

Source: 0.032s
Target: 9.6s

Definite discrepancy.

What if I do the same thing with the gzipped version? On the target:

time nc -l 19000 | gzip -d | cat > test.img

On the source:

time cat 1_gig_zeroed_blocks_ssd.img.gz | nc <target machine ip> 19000

And the results:
Target: 7.9s
Source: 7.4s

How long does it take to transfer a random 1 megabyte file, similar in size to the gzipped file? First I create a file to transfer:

dd if=/dev/random of=./1_meg_random_blocks_ssd.img bs=1m count=1

Just for comparison:

1043684 Mar 25 16:55 1_gig_zeroed_blocks_ssd.img.gz

1048576 Apr 6 17:11 1_meg_random_blocks_ssd.img

Then transfer; on the target:

time nc -l 19000 | cat > test.img

And send from the source:

time cat 1_meg_random_blocks_ssd.img | nc <target machine ip> 19000

And the results:
Target: 0.019s
Source: 0.018s

Strange. I don't know how to explain it, at least not without digging in with wireshark to see if there's some kind of 2-way communication, or if there's something with buffers and the algorithm gzip is using by default so it is only taking X amount of data to decompress then blocking until it's ready for the next chunk, while bzip2 is taking the whole amount all at once (although at less than 500 bytes, I would think even if it were doing something similar to gzip, that file size would fit into one chunk to work with.)

Conclusion: It appears that there's some influence from buffers, but the file-sending part will close the socket once it can dump the file to the target side. Depending on the decompressor implementation, sending a compressed file can help.

My Takeaways: After running this series of rough benchmarks, several of my preconceived notions were apparently inaccurate. Apparently you shouldn't simply count on suppositions that should make sense to actually make sense.

There are several questions brought up by the results, and I don't have definitive answers to all of them. Ironically I have speculation that makes sense...but that's what I was testing in the first place.

Monday, April 21, 2014

Performing an Old School DD Over Netcat Clone With Speed Mysteries on the Way

This Post Evolved

I initially wrote this as a tutorial (okay, yet another tutorial) on using dd and netcat to clone systems because I was having trouble getting a successful clone with our existing toolset.

What happened was that what should have worked in my modified attempts at speeding up the process didn't work. The tutorial, written as notes following along the processes, became part of a head-scratching mystery.

I'm sure there are other admins out there that will know that "of course" this happened and "he should have done this" to find the problem. But I'm also sure there are others who can relate to my discoveries and troubleshooting process.

What follows is material that you can read and suss out your own notes for a dd-over-netcat solution along with a bit of a narrative ride investigating a what-the-hell puzzle. I have just enough variables thrown in to the mix that you may not run into these issues. But if you do, especially when you run a mixed platform environment, this scenario may sound all too familiar.

The Cloning

I previously ranted about the difficulty in cloning Windows and how it is seemingly architected to make the cloning process more difficult. But hey...if it were simple, we wouldn't have to find creative solutions to seemingly simple tasks, right?

So how do you perform an old-school clone?

If you have 2 drives that are the same in size, or you need to clone out from a source disk that is smaller than the target drive, this method should work.

Here's how I did it, and from the description you should be able to adapt it to your needs.

I have one Windows 7 laptop with our custom settings and software installs; it needs to be duplicated to three other laptops of the same make and model (and hard disk sizes.)

First Windows has to be prepped. I have software installed for the laptop to act as a pseudo-kiosk and lock settings down. I had to tell this software to temporarily disable and allow for an "imaging mode" which, if the Internet didn't lie to me, basically tells this software to not only disable but to not load its custom drivers. Otherwise there could be some fun times ahead of you.

The next step in prepping Windows is running Sysprep. This strips out system-specific information and puts the machine in OOBE (out of box experience) mode...that's what causes it to ask questions like how to connect to a network and what name you want for the machine when it first boots up.

Just running %WINDIR%\system32\sysprep\sysprep.exe will bring up the little GUI interface for it. I select "generalize" and "OOBE" options. Otherwise run the executable with the "/generalize /oobe /shutdown" options. All turned off? Good. Leave it off. If you turn it on, the monster will try

Second I will need a bootable Linux distribution. There are instructions galore on the webbertubes for creating a bootable USB image. I used UNetbootin. It's an application that automated the creation of a USB distribution. And I mean automated. Right down to downloading a distribution from the source website for you so you didn't need an ISO ready to go first. And UNetbootin has downloads for Windows, Linux and OS X. Create the USB flash drive boot install and plug it into your source computer.

Third step is to boot the original laptop, but catch it before Windows boots. You want to go into the BIOS (or "setup" as it's becoming more popularly labeled) and change the boot order to boot first from the USB device. I put this step here because it was easier when the USB drive was plugged in; in my particular case it saw the drive and in setup specifically listed it as a boot device by name. Save settings, reboot. Some variations of this you can use a one-time boot menu to boot from the USB drive, or set it up earlier to boot from "USB Device." This entirely depends on your BIOS and ability to puzzle through settings.

Fourth step is to actually boot to Linux. I used Ubuntu. Boot it LIVE. Do NOT run the installer. That would be tremendously kick-yourself-stupid at this point. Also, don't let it boot from the hard disk. That'll ruin your Sysprep state. Boot the Live CD version. Once it's open, navigate to a terminal. The beautiful, beautiful command line.

Fifth step is to prep my computer to get the image. I want to image this out to multiple systems, so I'm going to create an image file that is then served out to the target laptops, rather than keep running the copy from the source laptop to the target laptops.

On my Mac (yes, cross platform!) I open a prompt and switch to my external drive, where I can spare 500 gig.

Note that yes, you can make the image smaller. Dd will create an exact image of the hard disk being sourced; you can compress it and decompress on the receiving end. And if the drive is mostly empty this will be significantly faster. However, this is also another place where you can make a mistake. Get the procedure working the simpler way first. Then worry about complicating your life a little more.

That's my advice, anyway. If you're a "Screw it! We'll do it LIVE!" risk taker, feel free to use a compression variant of the commands.

The Mac is the RECEIVER in this case. It's getting data FROM the source (reference) laptop. So I can open a connection to LISTEN for data from it.

nc -l 19000 | dd of=./DriveImageName.img

This runs Netcat, tells it to listen to port 19000, then pipes that data to dd, with the output file of workingdirectory/imagename. On the Mac it popped up a warning asking if I wanted to allow listening to the network because it otherwise needs a firewall rule. Say yes, and it allowed it. Also, I was on the drive I wanted to save to. That's why there's a period for the current directory; otherwise give a full path name.

Do you know your computer's IP address? If not, grab it from another terminal session or your network configuration utility. You'll need it for the...

Sixth step which is sending your image! On the source computer, search through dmesg to find the device pointing to the hard drive. Usually it'll be something like /dev/sda or /dev/sdb, but this is entirely dependent on your configuration and the Linux detection mechanism. Search the messages for the device matching your hard disk size.

Write it down. Verify that it's correct using fdisk to print the partition table. If you want to be extra careful, mount the partition with Windows and verify it's the correct stuff. Be. Careful.

And if you mounted the drive to check it, UNMOUNT IT. For the copy to properly work you want it to be completely unmounted. Write down the drive device. And triple check your command before hitting enter.

What command?

dd if=/dev/device | nc TargetIP 19000

Of course you replace the device with sda or sdb or whatever you found was the device, and the TargetIP is where you're sending the image (in my case the Mac.)

The command will look like it's doing nothing. I went back to my Mac and in another console told it to do a directory listing and lo and behold, DriveImageName.img was created and rapidly growing.

This can take a couple of hours. When completed, the dd command returns to the prompt and your DriveImageName.img file will be approximately the size of your source system's hard drive.

The Seventh step is to take the image and overlay it into a target laptop. This is destructive. That image file is the whole hard drive. That means boot sector, multiple partitions (including the recovery on the original laptop), the whole shebang. I didn't copy a particular partition...this was the lock, the stock, and the barrel. Just so you know.

Shut down the LiveCD Linux on the reference (source) laptop. Close it up. Set it aside. Stick the USB drive into the sacrificial target laptop. Configure the BIOS to boot from USB drive as you did with the source laptop (or use the one-time boot). Go to the command prompt once Linux is up and running on the network.

Now it's time to reverse the flow of the data stream. On the TARGET laptop, scour the dmesg logs for the name of the hard drive. THIS COULD BE DIFFERENT FROM THE SOURCE LAPTOP.

I know, it shouldn't be. But it was for me. Don't assume the detection will be the same. Find that proper device name.

All set? Took notes? Good.

Then we'll move to the eighth step, which is telling the laptop to accept the image. From a command prompt:

nc -l 19000 | dd of=/dev/device

Look familiar? It resembles the command I used on the Mac for listening for the incoming data stream. Mainly because that's what it's doing. Only this time the output file (of) is the device name pointing to the laptop's hard drive.

Now we need the drive data. Do you have the laptop's IP address? If not, grab it from another terminal session.

Step nine is on the Mac. Let's send the file.

dd if=./DriveImageName.img | nc LaptopIP 19000

Double check your information before hitting enter, especially on the laptop, and let it rip. The Mac will read the image file and and stream it to Netcat, sending to the IP address on port 19000. Again nothing will seem to happen, until several hours later, when the command prompt abruptly returns control after telling you how many records were sent out.

The tenth step is to tell the laptop to reboot. When it starts the boot cycle (or it powers down, if you told it to shut down) you yank the drive, boot it to setup, tell it to boot from the hard drive now, and do a boot into Windows. It should run the "out of box experience" setup and once you enter a name and user and blah blah...WINDOWS.

Then repeat steps seven though ten for each of the other target systems.

How Do I Know If It Is Doing Anything?

The process is relatively quiet. That is annoying, to say the least...how do you know you're not wasting a few hours of becoming antsy?

Method 1: tcpdump. I opened another terminal an ran tcpdump, where I saw a helluva lot of packets dumping to the target machine IP. That at least tells me it's working.

Method 2: Activity Monitor. On the Mac, I can run Activity Monitor, which gives some surprisingly useful information. I open the "network" tab and look for nc, and Activity Monitor will tell me how much data has transferred.

Note, though, that if you're compressing data, you won't know when it's done transferring the image. You don't get guaranteed compression ratios and if the drive was mostly empty, you've got a lot of nothing that compresses into negative quantum data or something like that. I saw it on Star Trek.

If you aren't compressing the transfer, you still get an idea of the remaining time and not an exact since block sizes and packet sizes affect the reading. But approximate is better than nothing, right?

Method 3: pv. Pipe Viewer was one of my favorite methods of checking the progress of piped transfers back in the day. The follow-along I used to model and check my instructions here didn't have pv installed by default on the live-boot Linux nor Mac; I was too lazy to install it for my one-off purposes here. If you're going to do this frequently (which I did back in the days of getting a #$%^ lab to work) or stream to multiple systems, this thing was fantastic. Just make sure you insert it into the process at the right point to get an accurate read on data flow.

Method 4: Sending the right signal to dd should make it throw out a status. The command

kill -USR1 dd_pid

...where "dd_pid" is dd's process ID should throw the status to standard error. The link above combines it with the watch command so you can periodically throw out the status update to standard error every X seconds.

Note none of these will tell you if the data is successfully being sent or read or written properly. Dd is pretty stupid when it comes to errors. I haven't tested if you can dump a file to a machine that isn't actually reading them, so all the bits are banging away like a legion of Goa'uld hitting the iris on the stargate. That would be kind of awesome if the bits made thumpy noises like that as they hit the firewall, though.

Let's Make It Faster: The Compressioning

The initial copy of the configured laptop to my Mac's FireWire drive took about 2.5 hours (max speed on the interface from my Mac is 800 Mb/sec). The copy from the Mac FireWire drive to the laptop to be overwritten took roughly 5.6 hours.

Ouch?

Why? It could have been difference in block size. FireWire bottleneck. Cache. Something not quite right in the network card driver detected at boot. I could experiment more and find another way to tune this, but let's first try a simpler optimization.

We'll play with compression.

I have an uncompressed image on the FireWire drive. I don't particularly feel like potentially corrupting or otherwise screwing up an image that I now know is working. Therefore, I'm not going to compress the image on the drive. I'll compress it in-flight.

Now, the first few times I tried this, the clone failed; the process hung, and after Activity Monitor said around 8GB had transferred data would simply stop transferring. No indication why. I thought maybe something was "off" in the network configuration so I tuned a few settings (OS X seems to have some awful defaults.) Didn't work. But consistently seemed to fail around the 8GB mark. I finally made some progress by changing a few things around in the copy. Not quite sure what happened, but it was strange that the non-compressing copies worked but the compressing with bzip2 method was dying on me.

First, I boot another laptop and adjust to boot from the USB Linux distro, then open a terminal and tell it to listen for a network connection after grabbing the IP with ifconfig.

nc -l 19000 | bzip2 -d | dd bs=16M of=/dev/drivedevice

It's pretty close to what I used before for getting the image, only this time instead of piping from Netcat to the dd command, it first pipes the output to bzip2 with the "decompress" switch. The stream of gibbledibble goes from Netcat to the decompressor to dd.

I'm also adding a block-size of 16 meg. This seemed to be a key in getting the copy to work when streaming from bzip2.

And remember, this is the target laptop, so dd uses the "of" switch.

Second, it's time to send the file from my Mac. In a terminal, I use:

dd bs=16m if=./DriveImageName.img | bzip2 -c -z | nc TargetDeviceIP 19000

Hit enter, and we're off to the races.

This command, again, is similar to my previous commands, but this time I first use dd to read in the image file, pipe that as a stream to bzip2 which compresses the data stream, then dump the results to the target laptop. Also notice I used a lowercase "m" on the Mac, and a capital "M" on the Linux system. Oh, the joy of minor differences in utilities. Linux is Linux, OS X is BSD-ish. Muss up the m vs. M and you'll get an error message.

Well That Was Bad...

The copy took approximately 15 hours.

Seriously? How does the addition of compression triple the transfer time?

My first guess is that it wasn't the compression, but rather something odd in the disk caching or data transfer for the FireWire drive and the block size was throwing it out of whack. Blackmagicdesign's free Disk Speed Test told me the drive was giving a 67 MB/s write and 41 MB/s read time (which in itself is a little strange...I would have expected faster read than write speeds. Perhaps that's write caching magic.)

Let's retry the test, but this time on the target laptop I'll use:

nc -l 19000 | bzip2 -d | dd bs=1M of=/dev/drivedevice

...and on the Mac side, I'll use:

dd bs=1m if=./DriveImageName.img | bzip2 -c -z | nc TargetDeviceIP 19000

My working theory here is that something in the point where the data is pulled from the FireWire drive (OS X's I/O scheduler? The physical construction of the drive and its cache?) is mismatched with the way bzip2 is reading in chunks to compress, and that mismatch is enough to throw the pipeline out of whack. Did the change to the smaller block size have an effect?

My iostat numbers (sudo iostat -w 1 disk_device) seem fairly consistent with pushing 3 MB/s using the 1M block size, although there were periods where it shot up to 6 MB/s. Did it translate into any time saved versus the 16M block size?

This took 15 hours.

What the $%^, Lana?!

Okay, Let's see what we have so far. I dd from the FireWire drive to netcat to the network to the remote system to the target's dd on to the target drive, it takes roughly 5 hours. I insert a compression, it triples. And it requires a blocksize when I insert the compression or else it seems to hang.

I still suspect something is problematic with the I/O scheduler in OS X, but I described the problem to some other intelligent people who suggested the problem may be related to bzip2 not being multithreaded. It was speculative, but in checking top while running the transfer, we could see that on my multi-core hyperthreaded Intel-based Mac the bzip2 process was consistently holding near the 100% mark, dipping into the 60% realm once in awhile. When a process holds 100%, it can indicate that the process is gripping a single processor core around the throat and the process isn't multithreaded, or at least not in a way that the scheduler can dole the workload over multiple cores (else you would see a steady over-100% utilization.)

Fortunately there is a (relatively) simple way to test if the compressing stage is the issue.

The Pre-Compressioning

Up to now the Mac is pulling the image file data from the FireWire drive over the bus, compressing it into memory, then pouring it into Netcat and then into the network stack.

If we pre-compress the image on the drive, the Mac will pull the (compressed) image file data from the FireWire drive over the I/O bus and pour it directly into Netcat and the network. If the slowness fault lay in Bzip2, or Bzip2's single-threaded single-core-hogging, this will bypass that issue altogether. It also means there's less to read from the drive since the image is smaller and some advantage gained from not hogging a processor as the Mac had to crunch the data down.

First, we compress the file.

time bzip2 -c DiskImageName.img > DiskImageName.img.bz2

Time isn't really needed; I am using it because I wondered how long it would take to compress the image on the FireWire drive. In case you're not familiar and didn't divine the use of the time command, it tells you how much time the subsequent command took to complete.

The important part is the bzip2 command. The -c is telling it to stream to standard output; the greater-than is redirecting that output to a new file with a .bz2 suffix. If I ran bzip2 alone it would compress the file, but in the process destroy the original and replace it with the compressed version. I'd like to keep both since I know that the original file is not corrupted.

I'll note that this time the iostat output was pretty consistent at bouncing between 7 and 10 MB/s as the bzip2 process read from and wrote to the FireWire drive, slightly more than twice the I/O when doing the dd read previously. I also noted that the KB/t was one meg, just as it was with the dd process that had the block size of one meg specified. I thought before that it was the fact I specified a one-meg block size that caused that KB/t number previously. This hints that I/O transfers from the FireWire interface tops out sending chunks in one meg batches.

Later on in the compression...as in several hours later...iostat was giving results like this:

I suspect the image file was in a portion of the drive that didn't really have data written to the sectors in the image. Just an interesting note and bit of speculation.

When it was finished, time reported:
real 796m39.899s
user 649m0.946s
sys 11m8.899s

I'm getting quite an education on expectations here.

Second I set up the Linux machine to receive the file.

nc -l 19000 | bzip2 -d | dd bs=1M of=/dev/device

I kept the same block size this time around to maintain the conditions of the previous 15-hour attempt.

Third I tell the Mac to read the file and dump it into the network funnel.

cat ./DiskImageName.img.bz2 | nc TargetDeviceIP 19000

The file is...well, it's a file, so I can cat it directly to standard output. Dd is useful because it can read raw devices. This command will cat the compressed file (remember to use the .bz2 version...) directly to Netcat and over the network.

The target system in step 2 above will decompress the stream of data and write it directly to the device.

Did It Improve This Time?

The image file shrunk considerably (which makes sense, since the majority of the imaged disk is mostly "blanked" hard disk sectors thus is highly compressible.) The image file on the FireWire drive shrank from 466GB to 108GB.

So the file to be read was reduced to nearly a quarter of the original size, but despite this, the transfer took approximately 12 hours. An improvement over the 15 hours, but still much more than the original 5.

The compressed file is smaller, so there's less to read and stream. It's already compressed, so there's no overhead on the sending machine due to running bzip2.

This could mean that the Netcat communication is acknowledging when it can take more data, and not buffering much of what is sent (or has a small/limited buffer space allocated.) Then it's possible the target computer is a bottleneck as it decompresses and writes the data.

Maybe The Compressor Sucks?

Let's try a quick experiment with a different compressor/decompressor. I compressed the image with gzip (gzip -c DiskImageName.img > DiskImageName.img.gz) to prep the transfer rather than perform an on-the-fly compress. Notice that, like bzip2, I used -c to output to standard output and redirect it to another file in order to preserve the original image file.

Right off the bat I see the gzip process is hovering between 50% and 80%, whereas bzip2 took up roughly 100% of a core. Like bzip2, gzip is not multithreaded, but it apparently isn't pegging the core either.

Gzip, without specifying a greater degree of compression, shrunk the 466GB file down to 106GB (the bzip2 version is 108GB, if you're keeping score.)

First the target is set to listen for the compressed data stream.

nc -l 19000 | zcat | dd of=/dev/<DriveDevice>

Note that zcat is a form of gunzip/gzip that will decompress from standard input and output to standard output.

Second I send the file from my Mac.

cat ./DriveImageName.img.gz | nc <TargetIP> 19000

After a few hours I checked on the process. Strange...the channel was still open, so the cat process was still dumping data, but it seemed to be quiet, according to iostat.

There was one little burst of data there. What was the laptop doing?

Hmm...KB_wrtn goes through periods of 0, then a burst, then back to zero. I'm guessing that zcat is working with blocks of data to decompress, then as the block (or cache) is finished, it dumps that portion out to dd, which is then written. There is a lot of reading from the drive, though...I don't know why that is happening.

The system monitor on Ubuntu lists gzip using around 700KiB resident and 9MiB virtual memory. Dd is listed with 800KiB and 15MiB virtual in use. So if there's a large portion of memory in use, it's not coming from there.

The command (on Ubuntu) free -hm says that I have 133M free, and +/- the buffers/cache says 6.9G free. So somewhere there is a lot of caching going on...I'm guessing it's related to the data being fed to zcat.

So how long did it take?

Dd on the Ubuntu system said 17932 seconds. That's 299 minutes, or 4.9 hours. Slightly faster than the plain copy.

What Are the Takeaways From This?

There are a few conclusions I've reached from this adventure.

The network isn't the limiting factor in the copy. This should be somewhat obvious. The transfer is the transfer is the transfer. Yes, this imaging would be faster if I did it over a direct connection to the target machine from a portable USB drive, but that defeats the purpose of "dd over netcat." Also, the transfer from a USB source to an internal drive target increases the chances of screwing up the transfer if you haven't done this before.
You'd think compression would make the process go faster. Apparently not always.
The choice of compressor really makes a difference.
These numbers are seriously harshing my calm. I can't help but think there is something "off" in them; I think I'll have to test some more in a separate writeup.

I Tried the Copy, Why Is My Attempt Failing?

I don't know. But some common guesses:

You don't have access. I ran most of these through elevated privileges (sudo is your friend) so I can access /dev files without worrying about that. Which also makes this process a little more dangerous.
You typo'd something, accessing the wrong device or using the wrong port. I told you to triple check your commands before hitting Enter. I warned you. Hopefully you didn't break something or reverse the target/source configurations. That last one will be a huge problem.
Corruption. A copy corrupted in the process of dd, the target file is saved on a drive with a problem sector, the datastream was interrupted, the sun farted a flare at just the wrong moment, who knows?
Drive mismatch. Different manufacturer, sizes, some other random gorbblot that shouldn't fail in fact is causing a fail.
You are writing a file to a specific partition (which won't work in this case) or possess a severe misunderstanding of device files versus files. That's a favorite I still run into. It leads to people writing a disk image file to something like a CD and burning it, so instead of overlaying a filesystem, you get a CD with MyDisc.ISO burned to it as one big file then having them ask, "Why doesn't this work?"

That's just what occurred to me off the top of my head. There are probably other reasons. Most of my readers are imaginary to begin with, so lay off.

Variations

There are a few variations that I can think of. You can experiment with other compression utilities. Or you can use this method to directly write from one system to another. On the source you dd with the input from the source drive, pipe it to netcat, which sends it to netcat on the target system, piping that into dd with the output file being the target drive. Direct one-on-one imaging with no in-between storage or a giant file.

If you really want to experiment, there are some great things that happen when you throw different variables into the mix. From hard disk firmware abstracting the physical drive specs from what is exposed to the software to the size of your packets shuffling around the network, you can discover how to finely tune your transfers so they are optimized for use in your specific network and hardware scenario at a cost of several days of your life. The best way to get started on this is using different block sizes in dd, so it reads in bigger or smaller chunks than the default. You can sometimes get faster speeds in your transfers doing this. Or you might fragment the hell out of your packets and slow things down. Me, I needed this to actually work, so I didn't keep playing with it like it was a lab. But my imaginary readers out there are free to experiment with this until they can get a perfect throughput scenario.

Secure Dat Data

One of the sites I found while Googling for a reminder of how to properly use dd over netcat came up with a page testing ssh to tunnel the stream to a remote computer. Maybe it's because the author likes benchmark porn. I'm not sure. But it's true that ssh encrypts the transfer, so if someone is intercepting your traffic they can't get the contents of the copied workstation.

Personally, I think that if someone is sniffing that level of your traffic on the internal network, you might have a bigger problem.

And if you're cloning a drive over the Internet, you're probably strange or doing something desperate in the first place.

"But the target machine is in a remote office!"

Yeah, I classify that as desperate. Why aren't you using a dedicated VPN? Ugh.

Anyway, if you really really wanted to do that, keep in mind ssh does encrypt things and encryption adds overhead. You thought the dd was slow before? Add a few percent of CPU power to the encryption part. It adds up. But at least no one is eavesdropping.

But really. Get a VPN configured.

Monday, April 14, 2014

Cloning Windows Sucks

The Opening Rant

Should cloning a Windows workstation be easy by now?

I would think so.

Let me think back to a recent example of cloning a competing platform and see how that went. Here's one that is a little fuzzy; the details are just hazy enough that I'm pretty sure this is what I did, but the specifics are not there anymore...

I had to clone several Mac Minis back when we moved to a new office and the person charged with our audio-visual system wanted a set of Macs running our conference/gathering area televisions.

I built one system to the desired configuration. I then needed to copy it out to several other Mac Minis with the same hardware spec. Rather than screw around with other utilities, I booted the configured Mac to target disk mode, mounted it, and used dd to create a sector-perfect copy of the configured drive to the drive I wanted reconfigured. When dd is used on the entire drive, you create an image of the whole device, including unused space, to a file or target device.

Slow, reliable, and potentially completely destructive. Works like a charm. The drives were the same, the OS didn't care that it was on a new machine and wasn't keyed to a particular device, it booted and was happy (as long as I changed the name of the system.)

Recently I had to complete a re-image of these same Macs when management decided they wanted to boost speed through the use of SSD drives and a memory upgrade. This time around I ended up removing the hard drives, attaching them to a USB adapter and running SuperDuper! from Shirt Pocket software. Mainly because I needed a file-level copy, since the spin-drives in the Macs were far larger than the SSD drives replacing them; dd would create images that just wouldn't be happy with that. And SuperDuper! worked just fine.

Easy peasy.

In the past, I know I've used Linux to perform similar copies. That's how I started playing with dd in the first place. Linux, like Darwin (OS X's UNIX-like core under the shiny graphical hood), now has plenty of built-in support for hardware detected at boot-time. Static files are copied at image time, device files and network configuration are generated at boot time. As long as you don't do something stupid with hard coding IP's or network names (yeah, you really need to change that host name at first boot) then you should be fine.

Easy peasy.

Then Along Came Windows

I was reminded of my cloning Windows adventures because I had to clone three laptops from an "okayed" configuration on a master laptop. All same hardware. Should be simple.

But it never is.

Back in my previous life I remembered trying a few different methods for copying a Windows install. Windows itself didn't have great built-in support for imaging. You couldn't even do a straightforward copy.

Well, you could. Chances are you'd have a problem, though.

Sometimes it was simple beginner error. You had to remember silly things like unjoin it from a domain first. That was a great way to have bad things(tm) happen. Machines that looked like they would work but starting acting all hinky, then making you feel stupid because you really should have seen that coming.

Or Windows activation would go wonky. Because Microsoft is really, really protective of your (read: their) rights as owners of the software (you do realize you don't own Windows, right? You license the right to use it. There's a big difference) they like to put stuff in for Windows to properly identify hardware and the installed version of Windows and such. It doesn't necessarily like having the hardware yanked from under it and being transplanted. Sometimes it works, sometimes it mostly works, sometimes it never quite works correctly.

That's when you learn about the next utility every Windows admin gets familiar with, Sysprep. It basically strips the Windows installation down to a "out of box experience" for the next machine, stripping unique IDs and special driver configurations from Windows so you can image it to another machine. It also resets the activation count for Windows licensing. There's whole articles published on how to properly create a Sysprep answer file for proper license activation and configuration if you want to try an unattended imaging.

Oh, yeah. An answer file. Not required, but available if you want an extra dose of "dammit" in your workflow.

In my case, these laptops are running software that was running a little...squirrelly on me to begin with. One of the things this software does is lock a computer's configuration while also altering it to be more or less a kiosk, locking out the ability to write to certain folders and files while presenting a simple view of the desktop with a limited selection of icons to play with.

In other words...I needed to make this image work as easily as possible without spending weeks on troubleshooting if something interfered with Sysprep's file alterations. No answer file for me, thank you. Only need to do this on four laptops.

So configured laptop? Done. Sysprepped? Done. How to clone?

We run a system for deploying images called KACE. Naturally, let's try that first.

I boot from PXE. I tell it to create an image, which KACE uploaded to the server. Then I deploy the image of that partition to the test laptop.

It boots...it's setting up files...it fails. Reboot cycle city. Dammit.

I try again. This time I image to the KACE all three partitions...there's a system partition, the visible partition, and the restore partition. All in one big image. KACE reports it's done with the image. Deploy to the test laptop...setting up files...farther this time!...fails. Permanent reboot cycle with an error message about setup not being able to run.

What caused it? Some hints online say it's possible a network printer driver is screwing it up. Some software, none of which seemed to match what I had installed, can screw up sysprep. More to the point, the initially configured laptop would boot from its Sysprep-induced coma to configure itself just fine (was it resetting my Windows activations? Was it going to suddenly fail because I activated it too many times? Ugh...it's making me PARANOID to keep experimenting...) which hinted to me it wasn't necessarily Sysprep failing but rather something in the image process that wasn't quite capturing properly.

After wasting a chunk of time on this, I decided to go old school on it.

I downloaded an application that automated the process of grabbing a Linux distro and converting it into a bootable USB drive image. I booted the original, Sysprepped, configured system from that USB drive into a "live Ubuntu" Linux. I then used dd over netcat to create a 500 gig image file on my system from the Windows computer.

The dd over Netcat example is a bit outdated, but using Google will show plenty of other examples that give updated caveats.

See, dd is platform agnostic since it's copying the drive sector-for-sector. It doesn't care what's on the drive. It just copies it. Warts (bad sectors, if there are any, and they're readable) and all.

Now I have a really big image file on my spare hard disk. I boot the target laptop with the USB drive and dd my image to the laptop. Although, thanks to what is probably a combination of poor sector size choice and reading a really large file from a USB drive into a network socket, the transfer to the target laptop is slow as hell.

Mental note: if this works, I'm going to use gzip to compress the image.

Finally the process completes. I tell Ubuntu to reboot, remove the USB drive, and Windows setup starts running. It churns through some BS setup crap, I give it a new name, new password and user I'll need to delete later, and TA-DA! I'm at the windows desktop!

I don't know why the KACE had trouble imaging it. I'm sure with more persistence I could find the right variation of the image instructions that will make it like the target system. Or I could buy a disk duplicator and rip apart the laptops for a speedy hardware copy.

But somehow I seem to keep falling back on some variation of booting Linux to get the job done. I end up with dd over Netcat, or Partimage, or dd straight between two drives (DON'T MIX THEM UP!) to get a clone to work. There are utilities that are supposed to simplify things, but it seems like something often goes wrong. It's quite annoying.

Someday I'll figure out how to get KACE to reliably image and deploy. Until then, I use dd. Old school. But old school works.

Monday, April 7, 2014

Expectations and Systems Administrators Asking Questions

Opening warning...this is a bit of a ramble. But like the blog tagline states, every draft is a first draft. I also work for Stack Exchange, but this is not speaking for my employer. These opinions come from my own experiences in a particular community...namely the systems administration community.

ServerFault is a question and answer website operating in the Stack Exchange network. Over the years it has gained traction as a resource for people looking for answers to various network and server-related issues, and it is a wonderful reference source. There are a large number of very talented and smart people donating their time and expertise to help people, and I am proud to have been an active member of the ServerFault community.

I'm not quite as active anymore, but I still lurk and poke my head in on occasion to see what’s going on. It is a trip down memory lane to see a common core of active members still trading memes, jokes that push boundaries a little too far on occasion, and what have become the typical responses given to repetitive situations encountered by the regulars on the site proper.

The visits are nostalgic, but they also remind me of some of the problems within the tech culture. I don’t know if it’s a attitude specific to ServerFault, but judging from discussions with others I’ve met in the tech world, I suspect it's part of the profession.

It’s a problem of expectations and how expectations are managed.

It’s not uncommon for complaints to periodically emerge in the chat regarding new user behaviors in seeking answers. It’s common enough that some people even write blog posts about the “right way” to ask for help as system administrators (in the context of the profession, usually, with ServerFault being the top of the chain of where to finish your quest). I’ll refrain from posting an example link here because, frankly, it does no good to single someone out for having an opinion when the attitude is quite prevalent among those who tend to be outspoken…if you’re interested, lurk in the chat areas and within a few weeks you’ll probably encounter a rant about someone’s question. If you’re an active participant on the site, you’ll probably see comments emerge that belie the undercurrent of disdain for someone’s mistake in asking a particular question. Sometimes people simply snap and there’s no attempt to disguise the contempt.

This isn’t a criticism of people being corrected on posting questions that don’t belong on that particular site. Shopping questions? No. How do I format a disk? No. How do I install a new font on Windows? That question goes over to Super User.

It’s not a criticism against managing the site to prevent discussions. Stack Exchange wants to become a canonical resource for answers to questions, not a forum of debate for best practices or personal opinion. That means that sometimes questions will have to be closed. Feelings will get hurt. New users will be angry, especially if they’re unfamiliar with how the site works. We accept that. And we invite people to lurk and learn and participate when they understand how to be good neighbors in our various communities. No hard feelings as far as the company is concerned...indeed, the company tries very hard to provide resources on how to ask questions within the context of the Stack Exchange network.

But there is a set of expectations from ServerFault systems administrators that can foster a hostile and off-putting environment. People with more than greenhorn reputation can be bitten by a lashing from more experienced community members. Somewhere along the career path for system administrators it seems a set of rules for what is the "proper" way to do something solidifies in the mind. You follow those rules, or you're stupid. Or misguided. But you're definitely in the wrong if you don't do something the right way. On ServerFault one key behavior the established users have come to see as the "right way" is to search for answers on your own in a sort of hierarchy before seeking help from others.

Before you dare post a question to ServerFault, you must make sure that:

You read the man pages

You googled the answer

You read the documentation

You read related RFC's

You tried variations on what you thought would work

You consulted other people's documentation

You tried some random shit to see what would happen

You called tech support

You consulted an astrologer

You slept on the question

Try a fucking fortune cookie. Never know, it might actually work.

New users are expected to know this before participating. But I don’t think it always works that way for people.

I can certainly understand the feeling. We have a resource that people should turn to only after they try to help themselves. If you don’t help yourself first, you’re not worthy of our collective wisdom. This is our creed. Why can’t people understand that helping themselves is more fulfilling? Why can’t they understand that if they understand the problem, and understand how to think this through, they won’t need to burden us with questions that they can answer themselves? And most of all, I'm being harsh with you so you will understand and thank me later for making you a better learner?

But putting people on a quest to find an answer, especially when they’re in the middle of trying to solve a problem, is hardly the way people respond to an issue. It’s detrimental from the community aspect if you tell someone they should be ashamed at having asked a question of a group of people if they didn’t do the prerequisite checklist first.

In the past I often had to stifle the knee-jerk reaction of insulting someone when they asked a question whose answer was already written up in a wiki page or handbook in the organization. “Why are you asking me that?,” I thought. “It’s right on the company handbook. Quick search would have given you that answer.” “How did <insert simple and repeated fix I’ve repeated at least three times to coworkers in a short span of time> not occur to you to try before asking me this?”

Eventually I realized it’s probably because I’m an available resource.

People don’t like that translation of documentation to their internal model of how things work. They dislike it more when they’re in the middle of a problem and having to figure out not only how the documentation pertains to their problem, but how to interpret and translate that understanding into a fix for their specific, current problem. It seems natural, and indeed, intelligent, for the right step to be going to someone who had this problem before and asking them how they dealt with it. That solves a problem much faster than a lesson on the proper progression of how to think and puzzle through the problem.

Also, if you think of how people kind of “work” psychologically, and you’ve read my diatribe this far, you probably know that you’re a rare person willing to read something this far before deciding, “Well, I’m bored, time to check FaceBook.” Words, words, words…get to the point!

But that’s exactly what’s being asked of these people before they can just ask someone…here’s my problem, what do I do?

It’s not my job to school people on how to be self sufficient in finding answers. I can facilitate this, but I can’t force it, and it’s certainly not my place to make people feel bad for thinking of asking me for assistance. If the information being asked for is already available, I reply with a link to the resource along with a note to let me know if that helped them. They might make a note of it for future reference. They might forget about it almost immediately. Doesn’t matter. I passive-aggressively provided a hint to help them help themselves in the future while still (hopefully) providing an answer tailored to their question.

What is being asked for is actually a mentoring situation. The expectation that people follow a procedure, despite not actually being in your company, or knowing how their life path has gotten them where they are now, means you expect them to follow a code within the profession of what to do before sighing and throwing up the white flag to approach you, the guru volunteer. The questioner is a burden. The guru is volunteering time to answer the the questioner couldn’t divine for themselves. You must follow the hierarchy.

Which is rather ironic, given that most of the systems administrators I talk to say they didn’t have formal training. They were thrown into a situation where they may have learned from others how things are done, if they were fortunate enough to have a team of more experienced people, otherwise they were expected to figure it out on their own.

Have you paid attention to what happens when you leave someone to figure out how to do something on their own? Especially something for which there may be more than one way to handle a problem or a workflow?

I’ll tell you what you get. You get emails where someone took a screenshot of a PDF document and pasted the image into a Word document to email as an attachment to me. Yes. This happened.

Or you get people who merge a set of PDF documents by printing them and re-scanning the printed sheets from multiple PDF’s into one scanned-in document. Yes. This has happened.

So you end up with systems administrators that never get formal training yet are ridiculed if they don’t know the expectations laid before them. How did they deal with problems? Probably using Google. Using Google takes you to websites where you drop into pages that offer solutions, maybe similar to the ones they’re having. But not quite. Maybe that brought them to a question an answer site like ServerFault. And…oh, you can ask questions. Here’s what’s happening. What do I do?

It’s no longer a world where people who aren’t “trained” will go directly to man pages. Did you know that man pages are even listed on the Internet now too? The manuals for new devices are often severely lacking, often because the manufacturer has come to rely on forums and knowledge base articles on their site for people to refer to when they have questions. I now get new hard drives that don’t have anything but a warranty packet included in the box.

And yet older, experienced systems administrators still live in a bubble where they expect people to have the same hazing…er, self-training?…they went through. Google is the new manual. And communities of users are the best resources for finding help to a specific problem.

I think it may be more productive for people to start viewing themselves as a resource. I in no way demean mentoring; I’ve complained about the lack of mentoring attitude in various companies. But ServerFault is not a mentoring website. It is a community of people volunteering their time to help people. It is a place where people with problems come to find solutions from people in a like-minded field.

It’s painful to talk to someone and have them tell me that they have heard of ServerFault, but they steer clear of it because it is so unfriendly to people or makes them feel stupid for not knowing something. Yes, that has happened, despite my knowing that so many people have received help from the ServerFault site. Too often people seem to have to pierce a "thick skin" barrier to get the help they're seeking.

As a contributor to the website, before you give in to the urge to ask someone if they’re huffing powdered plaster or if they were always so ignorant about “Googling” for a solution before they imposed on your time, try to step back and empathize with new people. Newer systems administrators that “fell into” the job role, and haven’t spent the time you spent having to navigate a pre-ServerFault world. Heck, you probably remember when equipment came with half-decent manuals. Or just think of the last time you felt like a mental defective because you dared ask someone something only to have it dawn on you two seconds after it left your sound hole that you could have looked up the information on your own. The very sin that makes you bristle.

Realize that the methods by which people naturally learn to deal with issues…like Googling for a problem that then drops them right into ServerFault, and to the end user, ServerFault is just as valid a resource as a software project’s website, and that maybe it’s even better because it’s a community of people that they can connect with for help…are changing. And it’s not your place to tell them why what they learned as a methodology is necessarily wrong.

And realize that if you wanted to see change to better suit what you think works best, even if you find a group of like-minded individuals to validate your beliefs, you need to work on finding a way to reduce the number of times I interview people who tell me they had no formal training in systems administration. They had no mentors. They had no guidance, and were self-taught. Because the path of self-teaching often means they’re not going to know, let alone care, about your list of prerequisites for being worthy of your help.

And most of all realize that when it comes to sharing your knowledge and experience, let this be a rare burden, and a frequent honor. Put back into the world something good. And try to leave the place a little better than you left it.