Troubleshooting guide

Connecting

Choosing the interface

Most of the servers have two separate interfaces, which can be looked up via DNS like this:

  • *.pvt.confirm.ch: Private network interface w/ a private IP address

  • *.pub.confirm.ch: Public network interface w/ a public IP address

Passing the firewall

First of all, we need to get you connected to a machine via SSH.

  • The good news: We’re securing & hardening our servers

  • The bad news: SSH connections are rejected by default on the public interface

However, there are ways how you’re able to connect:

Hint

Private interfaces are usually very weak firewall’ed, as all private CIDR networks are allowed.

Important

If knockd doesn’t open the SSH port, it might be because TCP SYN packages don’t arrive in the correct sequence. Try to use the knock’s delay feature knock -d 250 {host} {port…} instead.

Important

In case the site-to-site VPN dies, you most likely want to ssh to the public interface of gw2 to get connectivity to the datacenter servers.

Virtual console

Instead of SSH’ing to the server of your desires, you can also use the virtual system console provided by Proxmox. Just open the Proxmox WebUI, login, locate your VM in the sidebar on the left and finally click on the Console action of the VM panel.

Monitoring

Activating the maintenance window

You might want to activate a maintenance window by following these steps:

  • Go to Configuration, Maintenance and click on generic maintenance

  • Ignore the first Maintenance tab

  • Go to the Periods tab
    • Click Edit on the One time only period

    • Update the Date and Maintenance period length

    • Click the Update link directly in the square box (don’t use the Update button below yet)

  • Go to the Hosts & Groups tab
    • Select your hosts and groups which should be in maintenace

  • Now click the Update button

Deactivating alerts

To deactivate alerts, you’ve to deactivate specific media types:

  • Go to Administration, Media types

  • Click on the Enabled link next to the media you want to deactivate

Deactivating lightalert

While maintaining the servers it’s recommended to deactivate the Lightalert service. Even if the maintenance mode in Zabbix is set, the Raspberry Pi will still trigger error messages and beep an error sound.

Deactivating and Reactivating the light can be done via an SSH connection. To connect to the Lightalert use:

ssh pi@lightalert.confirm.ch

Switch to the root user:

sudo su

Then deactivate the Lightalert service:

systemctl stop ligtalert

After you’re finished maintaining activate the Lightalert service via:

systemctl start ligtalert

Troubleshooting infos

On the systems

If you’re connected to a system, these commands might help:

  • confirm who: Who’s responsible for the server

  • confirm status: Status of all required services on this server

  • confirm notes: Important notes for thi server (helpful for debugging)

Of course, next to:

  • systemctl status <service>: Display the status of a service

  • journalctl -u <service> [-f]: Display the logs of a service

Docs

The docs also contain important informations, such as:

Issues

We might already had a problem with a specific server or service. Thus, there might be an issue which describes the problem and hopefully a solution or workaround.

Just head over to Infrastructure Issues and see if you can find something.

Ansible repository

It might also be helpful if you check out the Infrastructure Ansible Repository. You can use it to:

Hardware issues

Grub on UEFI systems

In case you’ve an UEFI-compatible mainboard and your BIOS is bitching around because it hasn’t found a bootable device, then you might have a problem with your GRUB installation. You need to ensure GRUB is installed properly for UEFI (instead of legacy x86/PC boot).

Boot a Debian Rescue system from an USB stick in UEFI mode and reinstall GRUB.

Important

You need to boot the USB stick in UEFI mode! You can check the boot mode by looking for a /sys/firmware/efi file. Is the file missing, your USB stick is booted in legacy mode instead of UEFI.

Hint

Debian has a good wiki page for GRUB EFI reinstall.

Office barebone disk replace

  • First identify the disk and extract the serial number:

hdparm -i /dev/sd[ab] | grep -i serial
  • Deactivate the /boot/efi(2) filesystem in /etc/fstab.

  • Shutdown the host.

  • Replace the disk and boot the host again.

Hint

In case the host doesn’t boot because there’s no operating system / bootloader found do the following steps:

  • Insert a Debian USB stick

  • Press F7 to display the boot media selection

  • Start Debian from the USB stick in UEFI mode (UEFI: prefix in bootloader)

  • Select the rescue mode (AdvancedRescue)

  • Network drivers are not required

  • Load the disk layout automatically

  • Choose rootlv as root FS

  • Select a separate /boot partition

  • Run a shell in the target FS

  • Check the EFI boot mode with efibootmgr

  • Mount all filesystems with mount -a

  • Verify that /dev/sdX1 is mounted on /boot/efi (and not /boot/efi2)

  • Reinstall grub via grub-install --recheck /dev/sdX

  • Reboot

  • Clone the partition table with sgdisk:

sgdisk -R TARGET-DEVICE SOURCE-DEVICE

Warning

TARGET-DEVICE will be overwritten, don’t mix these up!

  • Randomize the GUID UUID on the new disk:

sgdisk -G /dev/sd[ab]
  • Clone EFI boot partition:

dd if=SOURCE-DEVICE of=TARGET-DEVICE bs=1M
  • (Optional) change the UUID of the EFI vfat partition:

# Lookup the existing UUID
ls -l /dev/disk/by-uuid/

# Backup the Superblock and edit it with hexer
dd if=/dev/sd[ab]1 of=/tmp/uuid bs=512 count=1
hexer /tmp/uuid

# Find the UUID of the existing disk and change it.
# The UUID is written in REVERSE chunks (line 00000040).

# Write the superblock with the new UUID to the vfat partition
dd if=/tmp/uuid of=/dev/sdb1 bs=512 count=1

# Check the new UUID
ls -l /dev/disk/by-uuid/
  • Enable and mount the EFI filesystem again:

vi /etc/fstab
mount -a
  • Add disk to RAID again:

# Add spare disk
mdadm --manage /dev/md0 -a /dev/sd[ab]2

# Check if raid is really rebuilding
mdadm --detail /dev/md0
# or
cat /proc/mdstat

HDD replacement after failure

  • Identify in which RAID array the faulty disc belongs

cat /proc/mdstat
  • Identify the faulty HDD with mdadm

mdadm --detail /dev/md127
  • Extract the faultys’s HDD serial number with SMART

smartctl -a /dev/sdh|grep "Serial Number"
  • Shutdown freaks1

  • Open the rack on the left (a screw needs to be unscrewed in the rack itself)

  • Disconnect all cables on the server

  • Get the server out of the rack (watch out, these are HDD’s so be gentle)

  • Open both sides (left & right) of the server

  • Disconnect the first disc at the back, pull it out from the front and check the serial number with the SMART report

  • Do this again until you’ve the disc in question

  • Connect all cables again and put the server back in the rack

  • Check the status of the RAID array and in which RAID array the disc belongs

cat /proc/mdstat
  • Add disc to correct RAID array

mdadm --manage /dev/md127 --add /dev/sdh
  • Check status of disc (it should be rebuilding)

mdadm --detail /dev/md127
  • After the replacement go to the manufacturer’s website and return the HDD for warranty use