Troubleshooting guide¶
Connecting¶
Choosing the interface¶
Most of the servers have two separate interfaces, which can be looked up via DNS like this:
*.pvt.confirm.ch
: Private network interface w/ a private IP address*.pub.confirm.ch
: Public network interface w/ a public IP address
Passing the firewall¶
First of all, we need to get you connected to a machine via SSH.
The good news: We’re securing & hardening our servers
The bad news: SSH connections are rejected by default on the public interface
However, there are ways how you’re able to connect:
You’re connected to the office network in St. Gallen
You’re connected to the roadwarrior VPN (see VPN connectivity)
You’re connected to the site-to-site VPN
Your public IP address is trusted
You use the knockd sequence to open SSH on gw2.pub.confirm.ch or open SSH on a Proxmox.
Hint
Private interfaces are usually very weak firewall’ed, as all private CIDR networks are allowed.
Important
If knockd doesn’t open the SSH port, it might be because TCP SYN
packages don’t arrive in the correct sequence. Try to use the knock’s delay feature knock -d 250 {host} {port…}
instead.
Important
In case the site-to-site VPN dies, you most likely want to ssh to the public interface of gw2 to get connectivity to the datacenter servers.
Virtual console¶
Instead of SSH’ing to the server of your desires, you can also use the virtual system console provided by Proxmox.
Just open the Proxmox WebUI, login, locate your VM in the sidebar on the left and finally click on the Console
action of the VM panel.
Monitoring¶
Activating the maintenance window¶
You might want to activate a maintenance window by following these steps:
Go to
Configuration
,Maintenance
and click ongeneric maintenance
Ignore the first
Maintenance
tab- Go to the
Periods
tab Click
Edit
on theOne time only
periodUpdate the
Date
andMaintenance period length
Click the
Update
link directly in the square box (don’t use theUpdate
button below yet)
- Go to the
- Go to the
Hosts & Groups
tab Select your hosts and groups which should be in maintenace
- Go to the
Now click the
Update
button
Deactivating alerts¶
To deactivate alerts, you’ve to deactivate specific media types:
Go to
Administration
,Media types
Click on the
Enabled
link next to the media you want to deactivate
Deactivating lightalert¶
While maintaining the servers it’s recommended to deactivate the Lightalert service. Even if the maintenance mode in Zabbix is set, the Raspberry Pi will still trigger error messages and beep an error sound.
Deactivating and Reactivating the light can be done via an SSH connection. To connect to the Lightalert use:
ssh pi@lightalert.confirm.ch
Switch to the root user:
sudo su
Then deactivate the Lightalert service:
systemctl stop ligtalert
After you’re finished maintaining activate the Lightalert service via:
systemctl start ligtalert
Troubleshooting infos¶
On the systems¶
If you’re connected to a system, these commands might help:
confirm who
: Who’s responsible for the serverconfirm status
: Status of all required services on this serverconfirm notes
: Important notes for thi server (helpful for debugging)
Of course, next to:
systemctl status <service>
: Display the status of a servicejournalctl -u <service> [-f]
: Display the logs of a service
Docs¶
The docs also contain important informations, such as:
Issues¶
We might already had a problem with a specific server or service. Thus, there might be an issue which describes the problem and hopefully a solution or workaround.
Just head over to Infrastructure Issues and see if you can find something.
Ansible repository¶
It might also be helpful if you check out the Infrastructure Ansible Repository. You can use it to:
Read the documentation in the Ansible roles
Have a look at the Ansible group variables
Or even re-run an Ansible playbook
Hardware issues¶
Grub on UEFI systems¶
In case you’ve an UEFI-compatible mainboard and your BIOS is bitching around because it hasn’t found a bootable device, then you might have a problem with your GRUB installation. You need to ensure GRUB is installed properly for UEFI (instead of legacy x86/PC boot).
Boot a Debian Rescue system from an USB stick in UEFI mode and reinstall GRUB.
Important
You need to boot the USB stick in UEFI mode! You can check the boot mode by looking for a /sys/firmware/efi
file. Is the file missing, your USB stick is booted in legacy mode instead of UEFI.
Hint
Debian has a good wiki page for GRUB EFI reinstall.
Office barebone disk replace¶
First identify the disk and extract the serial number:
hdparm -i /dev/sd[ab] | grep -i serial
Deactivate the
/boot/efi(2)
filesystem in/etc/fstab
.Shutdown the host.
Replace the disk and boot the host again.
Hint
In case the host doesn’t boot because there’s no operating system / bootloader found do the following steps:
Insert a Debian USB stick
Press
F7
to display the boot media selectionStart Debian from the USB stick in UEFI mode (
UEFI:
prefix in bootloader)Select the rescue mode (
Advanced
→Rescue
)Network drivers are not required
Load the disk layout automatically
Choose
rootlv
as root FSSelect a separate
/boot
partitionRun a shell in the target FS
Check the EFI boot mode with
efibootmgr
Mount all filesystems with
mount -a
Verify that
/dev/sdX1
is mounted on/boot/efi
(and not/boot/efi2
)Reinstall grub via
grub-install --recheck /dev/sdX
Reboot
Clone the partition table with
sgdisk
:
sgdisk -R TARGET-DEVICE SOURCE-DEVICE
Warning
TARGET-DEVICE will be overwritten, don’t mix these up!
Randomize the GUID UUID on the new disk:
sgdisk -G /dev/sd[ab]
Clone EFI boot partition:
dd if=SOURCE-DEVICE of=TARGET-DEVICE bs=1M
(Optional) change the UUID of the EFI vfat partition:
# Lookup the existing UUID
ls -l /dev/disk/by-uuid/
# Backup the Superblock and edit it with hexer
dd if=/dev/sd[ab]1 of=/tmp/uuid bs=512 count=1
hexer /tmp/uuid
# Find the UUID of the existing disk and change it.
# The UUID is written in REVERSE chunks (line 00000040).
# Write the superblock with the new UUID to the vfat partition
dd if=/tmp/uuid of=/dev/sdb1 bs=512 count=1
# Check the new UUID
ls -l /dev/disk/by-uuid/
Enable and mount the EFI filesystem again:
vi /etc/fstab
mount -a
Add disk to RAID again:
# Add spare disk
mdadm --manage /dev/md0 -a /dev/sd[ab]2
# Check if raid is really rebuilding
mdadm --detail /dev/md0
# or
cat /proc/mdstat
HDD replacement after failure¶
Identify in which RAID array the faulty disc belongs
cat /proc/mdstat
Identify the faulty HDD with mdadm
mdadm --detail /dev/md127
Extract the faultys’s HDD serial number with SMART
smartctl -a /dev/sdh|grep "Serial Number"
Shutdown freaks1
Open the rack on the left (a screw needs to be unscrewed in the rack itself)
Disconnect all cables on the server
Get the server out of the rack (watch out, these are HDD’s so be gentle)
Open both sides (left & right) of the server
Disconnect the first disc at the back, pull it out from the front and check the serial number with the SMART report
Do this again until you’ve the disc in question
Connect all cables again and put the server back in the rack
Check the status of the RAID array and in which RAID array the disc belongs
cat /proc/mdstat
Add disc to correct RAID array
mdadm --manage /dev/md127 --add /dev/sdh
Check status of disc (it should be rebuilding)
mdadm --detail /dev/md127
After the replacement go to the manufacturer’s website and return the HDD for warranty use
Hint
WD warranty: https://support-en.wd.com/app/warrantystatus
Seagate warranty: https://www.seagate.com/de/de/support/warranty-and-replacements/