Troubleshooting guide ===================== Connecting ---------- Choosing the interface ~~~~~~~~~~~~~~~~~~~~~~ Most of the servers have two separate interfaces, which can be looked up via DNS like this: - ``*.pvt.confirm.ch``: Private network interface w/ a private IP address - ``*.pub.confirm.ch``: Public network interface w/ a public IP address Passing the firewall ~~~~~~~~~~~~~~~~~~~~ First of all, we need to get you connected to a machine via SSH. - The good news: We're securing & hardening our servers - The bad news: SSH connections are rejected by default on the public interface However, there are ways how you're able to connect: - You're connected to the office network in St. Gallen - You're connected to the roadwarrior VPN (see :ref:`VPN connectivity`) - You're connected to the site-to-site VPN - Your public IP address is `trusted `_ - You use the `knockd `_ sequence to `open SSH on gw2.pub.confirm.ch `_ or `open SSH on a Proxmox `_. .. hint:: Private interfaces are usually very weak firewall'ed, as all private CIDR networks are allowed. .. important:: If knockd doesn't open the SSH port, it might be because ``TCP SYN`` packages don't arrive in the correct sequence. Try to use the knock's delay feature ``knock -d 250 {host} {port…}`` instead. .. important:: In case the site-to-site VPN dies, you most likely want to `ssh to the public interface of gw2 `_ to get connectivity to the datacenter servers. Virtual console ~~~~~~~~~~~~~~~ Instead of SSH'ing to the server of your desires, you can also use the virtual system console provided by :ref:`Proxmox `. Just open the Proxmox WebUI, login, locate your VM in the sidebar on the left and finally click on the ``Console`` action of the VM panel. Monitoring ---------- Activating the maintenance window ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You might want to activate a `maintenance window `_ by following these steps: - Go to ``Configuration``, ``Maintenance`` and click on ``generic maintenance`` - Ignore the first ``Maintenance`` tab - Go to the ``Periods`` tab - Click ``Edit`` on the ``One time only`` period - Update the ``Date`` and ``Maintenance period length`` - Click the ``Update`` link directly in the square box (don't use the ``Update`` button below yet) - Go to the ``Hosts & Groups`` tab - Select your hosts and groups which should be in maintenace - Now click the ``Update`` button Deactivating alerts ~~~~~~~~~~~~~~~~~~~ To deactivate alerts, you've to deactivate specific `media types `_: - Go to ``Administration``, ``Media types`` - Click on the ``Enabled`` link next to the media you want to deactivate Deactivating lightalert ^^^^^^^^^^^^^^^^^^^^^^^ While maintaining the servers it's recommended to deactivate the Lightalert service. Even if the maintenance mode in Zabbix is set, the Raspberry Pi will still trigger error messages and beep an error sound. Deactivating and Reactivating the light can be done via an SSH connection. To connect to the Lightalert use: .. code-block:: bash ssh pi@lightalert.confirm.ch Switch to the root user: .. code-block:: bash sudo su Then deactivate the Lightalert service: .. code-block:: bash systemctl stop ligtalert After you're finished maintaining activate the Lightalert service via: .. code-block:: bash systemctl start ligtalert Troubleshooting infos --------------------- On the systems ~~~~~~~~~~~~~~ If you're connected to a system, these commands might help: - ``confirm who``: Who's responsible for the server - ``confirm status``: Status of all required services on this server - ``confirm notes``: Important notes for thi server (helpful for debugging) Of course, next to: - ``systemctl status ``: Display the status of a service - ``journalctl -u [-f]``: Display the logs of a service .. _Troubleshooting Docs: Docs ~~~~ The docs also contain important informations, such as: - :ref:`Network informations ` - :ref:`Proxmox informations ` Issues ~~~~~~ We might already had a problem with a specific server or service. Thus, there might be an issue which describes the problem and hopefully a solution or workaround. Just head over to `Infrastructure Issues `_ and see if you can find something. Ansible repository ~~~~~~~~~~~~~~~~~~ It might also be helpful if you check out the `Infrastructure Ansible Repository `_. You can use it to: - Read the documentation in the `Ansible roles `_ - Have a look at the `Ansible group variables `_ - Or even re-run an `Ansible playbook `_ Hardware issues --------------- Grub on UEFI systems ~~~~~~~~~~~~~~~~~~~~ In case you've an UEFI-compatible mainboard and your BIOS is bitching around because it hasn't found a bootable device, then you might have a problem with your GRUB installation. You need to ensure GRUB is installed properly for UEFI (instead of legacy x86/PC boot). Boot a Debian Rescue system from an USB stick in UEFI mode and reinstall GRUB. .. important:: You need to boot the USB stick in UEFI mode! You can check the boot mode by looking for a ``/sys/firmware/efi`` file. Is the file missing, your USB stick is booted in legacy mode instead of UEFI. .. hint:: Debian has a good wiki page for `GRUB EFI reinstall `_. Office barebone disk replace ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * First identify the disk and extract the serial number: .. code-block:: bash hdparm -i /dev/sd[ab] | grep -i serial * Deactivate the ``/boot/efi(2)`` filesystem in ``/etc/fstab``. * Shutdown the host. * Replace the disk and boot the host again. .. hint:: In case the host doesn't boot because there's no operating system / bootloader found do the following steps: - Insert a Debian USB stick - Press ``F7`` to display the boot media selection - Start Debian from the USB stick in UEFI mode (``UEFI:`` prefix in bootloader) - Select the rescue mode (``Advanced`` → ``Rescue``) - Network drivers are not required - Load the disk layout automatically - Choose ``rootlv`` as root FS - Select a separate ``/boot`` partition - Run a shell in the target FS - Check the EFI boot mode with ``efibootmgr`` - Mount all filesystems with ``mount -a`` - Verify that ``/dev/sdX1`` is mounted on ``/boot/efi`` (and not ``/boot/efi2``) - Reinstall grub via ``grub-install --recheck /dev/sdX`` - Reboot * Clone the partition table with ``sgdisk``: .. code-block:: bash sgdisk -R TARGET-DEVICE SOURCE-DEVICE .. warning:: *TARGET-DEVICE* will be overwritten, don't mix these up! * Randomize the GUID UUID on the new disk: .. code-block:: bash sgdisk -G /dev/sd[ab] * Clone EFI boot partition: .. code-block:: bash dd if=SOURCE-DEVICE of=TARGET-DEVICE bs=1M * (Optional) change the UUID of the EFI vfat partition: .. code-block:: bash # Lookup the existing UUID ls -l /dev/disk/by-uuid/ # Backup the Superblock and edit it with hexer dd if=/dev/sd[ab]1 of=/tmp/uuid bs=512 count=1 hexer /tmp/uuid # Find the UUID of the existing disk and change it. # The UUID is written in REVERSE chunks (line 00000040). # Write the superblock with the new UUID to the vfat partition dd if=/tmp/uuid of=/dev/sdb1 bs=512 count=1 # Check the new UUID ls -l /dev/disk/by-uuid/ * Enable and mount the EFI filesystem again: .. code-block:: bash vi /etc/fstab mount -a * Add disk to RAID again: .. code-block:: bash # Add spare disk mdadm --manage /dev/md0 -a /dev/sd[ab]2 # Check if raid is really rebuilding mdadm --detail /dev/md0 # or cat /proc/mdstat HDD replacement after failure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * Identify in which RAID array the faulty disc belongs .. code-block:: bash cat /proc/mdstat * Identify the faulty HDD with mdadm .. code-block:: bash mdadm --detail /dev/md127 * Extract the faultys's HDD serial number with SMART .. code-block:: bash smartctl -a /dev/sdh|grep "Serial Number" * Shutdown freaks1 * Open the rack on the left (a screw needs to be unscrewed in the rack itself) * Disconnect all cables on the server * Get the server out of the rack (watch out, these are HDD's so be gentle) * Open both sides (left & right) of the server * Disconnect the first disc at the back, pull it out from the front and check the serial number with the SMART report * Do this again until you've the disc in question * Connect all cables again and put the server back in the rack * Check the status of the RAID array and in which RAID array the disc belongs .. code-block:: bash cat /proc/mdstat * Add disc to correct RAID array .. code-block:: bash mdadm --manage /dev/md127 --add /dev/sdh * Check status of disc (it should be rebuilding) .. code-block:: bash mdadm --detail /dev/md127 * After the replacement go to the manufacturer's website and return the HDD for warranty use .. hint:: - WD warranty: https://support-en.wd.com/app/warrantystatus - Seagate warranty: https://www.seagate.com/de/de/support/warranty-and-replacements/