Introduction
Beginning with release 0.0.6, NetSaint can optionally be configured to support distributed monitoring of network services and resources. I'll try to briefly explan how this can be accomplished...
Goals
The goal in the distributed monitoring environment that I will describe is to offload the overhead (CPU usage, etc.) of performing service checks from a "central" server onto one or more "distributed" servers. Most small to medium sized shops will not have a real need for setting up such an environment. However, when you want to start monitoring hundreds or even thousands of hosts (and several times that many services) using NetSaint, this becomes quite important.
Reference Diagram
The diagram below should help give you a general idea of how distributed monitoring works with NetSaint. I'll be referring to the items shown in the diagram as I explain things...
Central Server vs. Distributed Servers
When setting up a distributed monitoring environment with NetSaint, there are differences in the way the central and distributed servers are configured. I'll show you how to configure both types of servers and explain what effects the changes being made have on the overall monitoring. For starters, lets describe the purpose of the different types of servers...
The function of a distributed server is to actively perform checks all the services you define for a "cluster" of hosts. I use the term "cluster" loosely - it basically just mean an arbitrary group of hosts on your network. Depending on your network layout, you may have several cluters at one physical location, or each cluster may be separated by a WAN, its own firewall, etc. The important thing to remember to that for each cluster of hosts (however you define that), there is one distributed server that runs NetSaint and monitors the services on the hosts in the cluster. A distributed server is usually a bare-bones installation of NetSaint. It doesn't have to have the web interface installed, send out notifications, run event handler scripts, or do anything other than execute service checks if you don't want it to. More detailed information on configuring a distributed server comes later...
The purpose of the central server is to simply listen for service check results from one or more distributed servers. Even though services are actively checked from the central server, the active checks are only performed at long intervals (as will be described later), so lets just say that the central server only accepts passive check for now. Since the central server is obtaining passive service check results from one or more distributed servers, it serves as the focal point for all monitoring logic (i.e. it sends out notifications, runs event handler scripts, determines host states, has the web interface installed, etc).
Obtaining Service Check Information From Distributed Monitors
Okay, before we go jumping into configuration detail we need to know how to send the service check results from the distributed servers to the central server. I've already discussed how to submit passive check results to NetSaint from same host that NetSaint is running on (as described in the documentation on passive checks), but I haven't given any info on how to submit passive check results from other hosts.
In order to facilitate the submission of passive check results to a remote host, I've written the nsca addon. The addon consists of two pieces. The first is a client program (send_nsca) which is run from a remote host and is used to send the service check results to another server. The second piece is the nsca daemon (nsca) which either runs as a standalone daemon or under inetd and listens for connections from client programs. Upon receiving service check information from a client, the daemon will sumbit the check information to NetSaint (on the central server) by inserting a PROCESS_SVC_CHECK_RESULT command into the external command file, along with the check results. The next time NetSaint checks for external commands, it will find the passive service check information that was sent from the distributed server and process it. Easy, huh?
Distributed Server Configuration
So how exactly is NetSaint configured on a distributed server? Basically, its just a bare-bones installation. You don't need to install the web interface or have notifications sent out from the server, as this will all be handled by the central server.
Key configuration changes:
In order to make everything come together and work properly, we want the distributed server to report the results of all service checks to NetSaint. We could use event handlers to report changes in the state of a service, but that just doesn't cut it. In order to force the distributed server to report all service check results, you must enabled the obsess_over_services option in the main configuration file and provide a ocsp_command to be run after every service check. We will use the ocsp command to send the results of all service checks to the central server, making use of the send_nsca client and nsca daemon (as described above) to handle the tranmission.
In order to accomplish this, you'll need to define an ocsp command like this:
ocsp_command=submit_check_result
The command definition for the submit_check_result command looks something like this:
command[submit_check_result]=/usr/local/netsaint/libexec/eventhandlers/submit_check_result $HOSTNAME$ '$SERVICEDESC$' $SERVICESTATE$ '$OUTPUT$'
The submit_check_result shell scripts looks something like this (replace central_server with the IP address of the central server):
#!/bin/sh # Arguments: # $1 = host_name (Short name of host that the service is # associated with) # $2 = svc_description (Description of the service) # $3 = state_string (A string representing the status of # the given service - "OK", "WARNING", "CRITICAL" # or "UNKNOWN") # $4 = plugin_output (A text string that should be used # as the plugin output for the service checks) # # Convert the state string to the corresponding return code return_code=-1 case "$3" in OK) return_code=0 ;; WARNING) return_code=1 ;; CRITICAL) return_code=2 ;; UNKNOWN) return_code=-1 ;; esac # pipe the service check info into the send_nsca program, which # in turn transmits the data to the nsca daemon on the central # monitoring server /bin/echo -e "$1\t$2\t$return_code\t$4\n" | /usr/local/netsaint/bin/send_nsca central_server -c /usr/local/netsaint/var/send_nsca.cfg
The script above assumes that you have the send_nsca program and it configuration file (send_nsca.cfg) located in the /usr/local/netsaint/bin/ and /usr/local/netsaint/var/ directories, respectively.
That's it! We've sucessfully configured a remote host running NetSaint to act as a distributed monitoring server. Let's go over exactly what happens with the distributed server and how it sends service check results to NetSaint (the steps outlined below correspond to the numbers in the reference diagram above):
Central Server Configuration
We've looked at hot distributed monitoring servers should be configured, so let's turn to the central server. For all intensive purposes, the central is configured as you would normally configure a standalone server. It is setup with:
There are two other very important things that you need to keep in mind when configuring the central server:
It is important that you set the check_interval argument for each service definition to a long interval. This will ensure that active service checks account for only a minimal load on the central server. We don't want to disable service checks, as it will be necessary to sometimes force NetSaint to actively check services (as discussed below).
That's it! Easy, huh?
Problems With Passive Checks
For all intensive purposes we can say that the central server is relying solely on passive checks for monitoring. While it does perform active checks of all services, it only does so at very long intervals, so lets disregard that fact. The main problem with relying completely on passive checks for monitoring is the fact that NetSaint must rely on something else to provide the monitoring data. What if the remote host that is sending in passive check results goes down or becomes unreachable? If NetSaint isn't actively checking the services on the host, how will it know that there is a problem?
We can protect against this type of problem by using another addon to monitoring incoming passive check results...
Watchdog Daemon
In order to protect against situations where remote hosts may stop sending passive service checks into the central monitoring server, I've developed the pscwatch daemon. The daemon's sole purpose in life is to ensure that service checks are being either performed actively by the central server or being provided passively be distributed servers on a regular basis.
If the pscwatch daemon detects that a service check has not been performed within a given threshold of time, it will send a command to NetSaint via the external command file telling it to schedule an immediate active check of the service. When NetSaint performs an active check of the service, it will be able to tell if there is a real problem or not. Problem solved.
Note: If service checks are disabled, NetSaint will refuse to actively perform a service check. This is the reason why we don't want to disable active checks on the central server. Instead, we just set the normal check interval for all services to a very long time period.
Combining Distributed Monitoring With Redundancy
Nothing here yet...