Bytemark Healthcheck
====================

* https://gitlab.bytemark.co.uk/operations/healthcheck/

This project is designed to test the health of Linux hosts upon which it is executed.

The health-check is responsible for scheduling several different kinds of tests, at different frequencies, each of which can raise alerts via the in-house MauveAlert system.


## Checks

There now follows a brief list of the checks which this package ships with, along with their frequency.

In the future it is intended that puppet will enable/disable the various checks on a per-host basis, however right now __every__ check is run upon __every__ host with the package installed - and care has been taken to ensure that checks that don't apply exit cleanly.

> For example a test on the MegaRAID-based RAID controller will silently terminate if that hardware isn't present


* `apt_legacy_check` - [30]
  * Raise an alert if any pending-security updates are present on this system.
  * Only used if `unattended-upgrades` is not installed.
* `apt_upgrade_check` [30]
  * Raise an alert if any pending-security updates are present on this system.
  * Only used if `unattended-upgrades` is present.
* `bad_ipv6` - [15]
  * Raise an alert if any IPv6 address is marked as "dadfailed"
* `bond_check` - [30]
  * Raise an alert if any bonded interface has the slowest interface of a pair as the active one.
* `disk_space_checker` [15]
  * Raise an alert if any filesystem is >95% full.
  * Raise an alert if any filesystem has >95% of its inodes used.
* `drbd_checker`
  * Raise an alert if any DRBD device is in a disconnected/standalone state.
* `exim4_mailq` [15]
  * Raise an alert if there are "too many" mails in the queue.
  * Designed to recognize a compromise which has resulted in SPAM-mail sending.
* `hp_hardware_raid` [15]
  * Raise an alert if the RAID controller reports a problem with any drive(s).
* `machine_check_exceptions` [15]
  * Raise an alert if `dmesg` shows any logged exceptions.
* `megaraid_hardware_raid` [15]
  * Raise an alert if the RAID controller reports a problem with any drive(s).
* `ntp` - [30]
  * Raise an alert if the local time is >5 seconds out of sync with our NTP server(s).
* `perm_check` - [30]
  * Raise an alert if any system-directory has bogus permissions.
  * This includes `/bin`, `/usr/sbin`, etc.
* `postfix_mailq` - [15]
  * Raise an alert if there are "too many" mails in the queue.
  * Designed to recognize a compromise which has resulted in SPAM-mail sending.
* `software_raid` - [15]
  * Raise an alert if there is a problem with any RAID array or drive(s).
* `syslog` - [30]
  * Raise an alert if there is no (r)syslog-process running.
* `tw_hardware_raid` - [15]
  * Raise an alert if the RAID controller reports a problem with any drive(s).
