Troubleshooting a PXElinux boot loop

This week I was troubleshooting a PXElinux boot loop on a customer's IBM hardware.

The system was added as bare-metal host in Foreman and configured to run CentOS 7. Everything that was necessary to provision the hardware was prepared:

  • (U)EFI configuration (boot order: 1. PXE, 2. local RAID)
  • RAID configuration
  • DHCP configuration
  • Server-side PXElinux configuration
  • Software Repo access

After rebooting the system, network-based OS installation starts and reboots again afterwards to boot into the new operating system. Normally the boot process should look like the following:

  1. Power on
  2. POST/ UEFI magic/ hardware tests
  3. UEFI loads hardware drivers (NIC, RAID controller, etc.)
  4. Enterprise lifecycle tool collects hardware information (optional)
  5. Network DHCP discover, offer, request, ack
  6. TFTP-downloading and loading boot managers and other files from DHCP/TFTP server
  7. If PXE-booting/ DHCP IP gaining fails, boot from local disk(s)/RAID

In my particular case, step 6 failed after downloading the network boot manager/menu. After OS installation has finished by HTTP GETting Foreman (wget http://foreman/unattended/built) successfully, Foreman changes the PXElinux configuration for our specific host (MAC address) to the default PXElinux configuration provision template. This in fact contains, boot from local media: LOCALBOOT 0

The system is unable to execute this command and fails back to PXE-boot again. Files will be TFTP-downloaded and local boot fails again. This happens until a specific amount of retries has reached. The system then stops doing anything. It won't try to boot from local disk(s)/RAID, I think this is because the system thinks PXE-booting was successful.

The solution was very simple. A post by Pascal Legrand on the syslinux mailing list pointed me into the right direction.

I changed the Foreman provision template to the following (notice the last three lines):

DEFAULT menu
PROMPT 0
MENU TITLE PXE Menu
TIMEOUT 200
TOTALTIMEOUT 6000
ONTIMEOUT local

LABEL local
    MENU LABEL (local)
    MENU DEFAULT

    #LOCALBOOT 0
    COM32 chain.c32
    APPEND hd0

Now chainloading Grub and booting from a local RAID works! Using chain.c32 also works for chainloading windows bootloaders and in a VMware vSphere environment, so I've set it to the global default.

Comments