I use Munin to monitor a few machines, and bubble up alerts when issues show up. It’s pretty good, easy to set up, and has a large number of contributed plugins to monitor pretty much everything. If still out of luck, it’s easy enough to write your own.
To ease the task of viewing the data, each machine runs
munin-node, but only a couple of masters do the data collection with
munin-update. This works reasonably well, except that machines monitored by more than one server need to work extra time to provide the same data to both.
Setting this up is relatively easy, and the benefits show quickly, in the form of a reduced collection time, and fewer gaps in the data.
Surprisingly it also showed as a substantially reduced load on low-power machines. But beware of the
--fork parameter to