Two weeks ago, one of our long-term customers returned a late 2008 Xserve to us stating it was not recognizing PCI cards in either slot. I jumped at the chance to take a look at it since we don’t see many broken Xserves coming back in. Xserves are traditionally easy machines to work on. Many of the components are user-installable and the whole thing can be stripped down in about five minutes.
Thinking that I was either going to find a failed PCI slot or Main Logic Board (more likely since both slots supposedly were non-functional), I grabbed a PCI card for testing and powered on the Xserve; it booted to a Kernel Panic while loading the kernel (the part of the boot process where the grey Apple logo is on screen). This was not what I was anticipating. Still going along the lines of a potential issue with the PCI slots, I removed both PCI cards and reboot the machine; Kernel Panic (KP). Ok, time to go back to basic troubleshooting.
First I attempted to boot to the 10.5 Server Install DVD, it KP’d to that as well as an external hard drive with a known good boot volume. Then, I swapped the RAM, which yielded no change. I then manually ran the EFI Firmware Update for that Xserve, but it wouldn’t accept it. Traditionally, with desktop Macs and Xserves if the machine is experiencing Kernel Panics while loading the kernel and both operating system and RAM have been ruled out the issue is with the processor. Luckily, we had an identical Xserve in the shop that I was able to borrow some parts from. I swapped out the processor, but still no change. I was able to then successfully run Apple’s Service Diagnostics in EFI, which told me everything passed. Logically speaking, the issue should be a Main Logic Board at this point, so I ordered one up and let it go for the day.
The next day, Jon, another great SDE tech, installed the replacement logic board and to his chagrin he was greeted with a lovely Kernel Panic on boot. Ugh. He let it sit and the next day I was back in the office and I started scouring the service manual for tips. All status lights were displaying their normal state, with the exception of the System Identifier Light which blinked to let me know that I had the top cover removed. Next step, minimal system! I disconnected everything except for MLB, processor/heat sink, power supply and distribution board, RAM, fan array and video card. I attempted to boot to my known-good external hard drive and still received a KP in return. For my next trick, I replaced all of the minimal system components with the parts from the identical Xserve that we had with the exception of the replacement logic board and processor; still nada!
Just to be thorough (read: stubborn), I then proceeded to replace every component aside from the replacement logic board with the parts from the identical Xserve. My thought was to then work backwards eliminating one component at a time until I found the piece of hardware that was causing the issue. I never got that far. Even with all of the good components in place the same issue still occurred. At this point it was just about comical, and from being in situations like this before I felt it had to be something really simple that I was missing; but what?!
I called in two other techs and talked them through my process. We all stared at the machine for a bit and scratched our heads, but no ideas were generated. Then, an even more bizarre issue occurred. The external hard drive that I was using for testing has three partitions; two 10.5 and one 10.4 boot. During one last attempt at booting the machine the power button was pressed, but none of us bothered holding down the option key to get to the EFI boot manager. I turned around and realized the machine had successfully boot to the 10.4 partition and was functioning. This should not be possible; a late 2008 Xserve should not be able to boot into Tiger! At least from here I was able to verify that the firmware was up to date, but now I was even more confused.
It was time to call in the big guns. Feeling a little defeated, I picked up the phone and dialed Apple Enterprise Support; Apple’s tech line for help with servers and enterprise software. I explained my process and issue to the tech, who also seemed stumped. I’ll admit that my first call wasn’t terribly productive. The tech seemed to have trouble following my triage process and he ended up telling me to reinstall 10.5 Server on the internal hard drive and/or to try the firmware update again. Despite knowing neither should resolve the issue, I did them and then called back when that didn’t work. The second time I called I got a tech who seemed really interested in the case. He ended up putting me on hold while he “asked the room” for advice. The one unanimous answer was that Tiger should not boot on that model Xserve and they suggested that I order yet another logic board, thinking the one I had received was defective.
Ok, one day of waiting for another board. It arrived, and I did the replacement this time. I was not surprised at all when I had yet another Kernel Panic staring back at me on boot. At this point I had the broken Xserve right across from the known-good Xserve that I was using as a parts-donor and after stepping back for a moment, I saw the problem. At first, I didn’t believe it. Even while I was then “fixing” the broken Xserve I was grumbling about how stupid it was. When I boot the Xserve and it happily booted to its internal hard drive without a hitch, I was relieved, annoyed and a little embarrassed all at the same time. So, what did I notice?
Well, there are two slots for the processor; since they can be configured with one or two processors. The good Xserve properly had the processor in CPU A. The defective Xserve had the processor in CPU B. Of course it was panicking on boot! I suppose the only silver lining is it is interesting to know that a late 2008 Xserve is able to boot into Tiger if its processor is in the wrong slot, but I can’t say that’s very useful information. After speaking with the customer, it was confirmed that they had a tech there who had upgraded the Xserve himself to two processors and he accidentally removed the wrong one before shipping the machine back to us. Since it’s incredibly uncommon for a customer to rearrange the processor configuration it hadn’t dawned on me (or the three other techs looking over my shoulder) that the processor was in the wrong place.
The good news is that the original issue—the two non-working PCI-slots—was resolved by replacing the logic board. The machine is once again a happy, functioning Xserve and I have been re-taught the lesson that if a problem seems that convoluted there’s probably a simple solution that’s being overlooked.