Troubleshooting guide¶

Connecting¶

Choosing the interface¶

Most of the servers have two separate interfaces, which can be looked up via DNS like this:

*.pvt.confirm.ch: Private network interface w/ a private IP address
*.pub.confirm.ch: Public network interface w/ a public IP address

Passing the firewall¶

First of all, we need to get you connected to a machine via SSH.

The good news: We’re securing & hardening our servers
The bad news: SSH connections are rejected by default on the public interface

However, there are ways how you’re able to connect:

You’re connected to the office network in St. Gallen
You’re connected to the roadwarrior VPN (see VPN connectivity)
You’re connected to the site-to-site VPN
Your public IP address is trusted
You use the knockd sequence to open SSH on gw2.pub.confirm.ch or open SSH on a Proxmox.

Hint

Private interfaces are usually very weak firewall’ed, as all private CIDR networks are allowed.

Important

If knockd doesn’t open the SSH port, it might be because TCP SYN packages don’t arrive in the correct sequence. Try to use the knock’s delay feature knock -d 250 {host} {port…} instead.

Important

In case the site-to-site VPN dies, you most likely want to ssh to the public interface of gw2 to get connectivity to the datacenter servers.

Virtual console¶

Instead of SSH’ing to the server of your desires, you can also use the virtual system console provided by Proxmox. Just open the Proxmox WebUI, login, locate your VM in the sidebar on the left and finally click on the Console action of the VM panel.

Monitoring¶

Activating the maintenance window¶

You might want to activate a maintenance window by following these steps:

Go to Configuration, Maintenance and click on generic maintenance
Ignore the first Maintenance tab
Go to the Periods tab
- Click Edit on the One time only period
- Update the Date and Maintenance period length
- Click the Update link directly in the square box (don’t use the Update button below yet)
Go to the Hosts & Groups tab
- Select your hosts and groups which should be in maintenace
Now click the Update button

Deactivating alerts¶

To deactivate alerts, you’ve to deactivate specific media types:

Go to Administration, Media types
Click on the Enabled link next to the media you want to deactivate

Deactivating lightalert¶

While maintaining the servers it’s recommended to deactivate the Lightalert service. Even if the maintenance mode in Zabbix is set, the Raspberry Pi will still trigger error messages and beep an error sound.

Deactivating and Reactivating the light can be done via an SSH connection. To connect to the Lightalert use:

ssh pi@lightalert.confirm.ch

Switch to the root user:

sudo su

Then deactivate the Lightalert service:

systemctl stop ligtalert

After you’re finished maintaining activate the Lightalert service via:

systemctl start ligtalert

Troubleshooting infos¶

On the systems¶

If you’re connected to a system, these commands might help:

confirm who: Who’s responsible for the server
confirm status: Status of all required services on this server
confirm notes: Important notes for thi server (helpful for debugging)

Of course, next to:

systemctl status <service>: Display the status of a service
journalctl -u <service> [-f]: Display the logs of a service

Docs¶

The docs also contain important informations, such as:

Issues¶

We might already had a problem with a specific server or service. Thus, there might be an issue which describes the problem and hopefully a solution or workaround.

Just head over to Infrastructure Issues and see if you can find something.

Ansible repository¶

It might also be helpful if you check out the Infrastructure Ansible Repository. You can use it to:

Read the documentation in the Ansible roles
Have a look at the Ansible group variables
Or even re-run an Ansible playbook

Hardware issues¶

Grub on UEFI systems¶

In case you’ve an UEFI-compatible mainboard and your BIOS is bitching around because it hasn’t found a bootable device, then you might have a problem with your GRUB installation. You need to ensure GRUB is installed properly for UEFI (instead of legacy x86/PC boot).

Boot a Debian Rescue system from an USB stick in UEFI mode and reinstall GRUB.

Important

You need to boot the USB stick in UEFI mode! You can check the boot mode by looking for a /sys/firmware/efi file. Is the file missing, your USB stick is booted in legacy mode instead of UEFI.

Hint

Debian has a good wiki page for GRUB EFI reinstall.

Office barebone disk replace¶

First identify the disk and extract the serial number:

hdparm -i /dev/sd[ab] | grep -i serial

Deactivate the /boot/efi(2) filesystem in /etc/fstab.
Shutdown the host.
Replace the disk and boot the host again.

Hint

In case the host doesn’t boot because there’s no operating system / bootloader found do the following steps:

Insert a Debian USB stick
Press F7 to display the boot media selection
Start Debian from the USB stick in UEFI mode (UEFI: prefix in bootloader)
Select the rescue mode (Advanced → Rescue)
Network drivers are not required
Load the disk layout automatically
Choose rootlv as root FS
Select a separate /boot partition
Run a shell in the target FS
Check the EFI boot mode with efibootmgr
Mount all filesystems with mount -a
Verify that /dev/sdX1 is mounted on /boot/efi (and not /boot/efi2)
Reinstall grub via grub-install --recheck /dev/sdX
Reboot

Clone the partition table with sgdisk:

sgdisk -R TARGET-DEVICE SOURCE-DEVICE

Warning

TARGET-DEVICE will be overwritten, don’t mix these up!

Randomize the GUID UUID on the new disk:

sgdisk -G /dev/sd[ab]

Clone EFI boot partition:

dd if=SOURCE-DEVICE of=TARGET-DEVICE bs=1M

(Optional) change the UUID of the EFI vfat partition:

# Lookup the existing UUID
ls -l /dev/disk/by-uuid/

# Backup the Superblock and edit it with hexer
dd if=/dev/sd[ab]1 of=/tmp/uuid bs=512 count=1
hexer /tmp/uuid

# Find the UUID of the existing disk and change it.
# The UUID is written in REVERSE chunks (line 00000040).

# Write the superblock with the new UUID to the vfat partition
dd if=/tmp/uuid of=/dev/sdb1 bs=512 count=1

# Check the new UUID
ls -l /dev/disk/by-uuid/

Enable and mount the EFI filesystem again:

vi /etc/fstab
mount -a

Add disk to RAID again:

# Add spare disk
mdadm --manage /dev/md0 -a /dev/sd[ab]2

# Check if raid is really rebuilding
mdadm --detail /dev/md0
# or
cat /proc/mdstat

HDD replacement after failure¶

Identify in which RAID array the faulty disc belongs

cat /proc/mdstat

Identify the faulty HDD with mdadm

mdadm --detail /dev/md127

Extract the faultys’s HDD serial number with SMART

smartctl -a /dev/sdh|grep "Serial Number"

Shutdown freaks1
Open the rack on the left (a screw needs to be unscrewed in the rack itself)
Disconnect all cables on the server
Get the server out of the rack (watch out, these are HDD’s so be gentle)
Open both sides (left & right) of the server
Disconnect the first disc at the back, pull it out from the front and check the serial number with the SMART report
Do this again until you’ve the disc in question
Connect all cables again and put the server back in the rack
Check the status of the RAID array and in which RAID array the disc belongs

cat /proc/mdstat

Add disc to correct RAID array

mdadm --manage /dev/md127 --add /dev/sdh

Check status of disc (it should be rebuilding)

mdadm --detail /dev/md127

After the replacement go to the manufacturer’s website and return the HDD for warranty use

Hint

WD warranty: https://support-en.wd.com/app/warrantystatus
Seagate warranty: https://www.seagate.com/de/de/support/warranty-and-replacements/