Once your network grows beyond a few systems, it's time to think about automated monitoring. There are tons of things you can look at, ranging from bandwidth and usage to application latency, but a good place to start is right in the middle: reachability and uptime.
The goal of reachability and uptime monitoring is to assure that the network and its applications are available when users need them, and that remediation occurs as quickly as possible when there is a problem.
You'll immediately discover some good news when you decide to implement reachability monitoring: you're not the first person to do this. In fact, a slew of outstanding open source and commercial tools are available to help you build the right solution for your network. I can't offer specific advice about which is best, because all excel in different ways. However, I can say that you'll have no problem finding a product that meets your needs because there are so many good choices out there.
No matter which product you choose, though, you'll want to keep in mind some important guidelines in designing your monitoring.
- Mixing network monitoring and application monitoring is a good idea. Many of the tools, concepts, reports and alerting strategies overlap when you're looking at network monitoring versus application and server monitoring. Thus, plan to do all your reachability monitoring at the same time. A single monitoring system is generally capable of handling thousands of systems without breaking much of a sweat, so you can amortize the hardware, software and human resources required to put monitoring in place across multiple functional areas. However, don't be pedantic about doing everything on one platform if it doesn't do a good job. For example, we use both a commercial network monitoring package and an open source bandwidth monitoring package for our network monitoring. The commercial package does do bandwidth monitoring, but it's clumsy and doesn't give us all the trivia we want. Thus, we added in a second tool even though there's overlap.
- Device uptime is not the same as application uptime. Using network reachability tools such as "ping" to determine if a system is up isn't a good idea, because systems can respond to pings yet be entirely dysfunctional. As you identify key systems in your network, focus on the applications running on those systems and making sure that each of the applications is running properly. For example, pinging a Web server tells you very little. Even connecting to port 80 of the Web server doesn't help much. What you want to do is connect to port 80 and retrieve a known document, validating that you're actually getting the document you want. That doesn't tell you that every part of the Web server is running properly, but it tells you a lot more than ping does. Email servers are the same way: ideally, you want to generate a message and send it in using SMTP, then use a protocol such as POP or IMAP to validate that the message was received. This gives you much more end-to-end assurance that things are working properly. For network devices, check metrics within the devices themselves. For example, we look at CPU usage, memory usage, and fan and power supply status in our switches.
- Build a multi-tier alerting strategy. Your network monitoring system should generate reams of reports and display pretty Web pages, but one of the most important functions is alerting in the event of a problem. Step back and build a simple alerting strategy, then be rigorous in your use of templates (or some equivalent feature) to apply these alerts to your devices. Alerting occurs across two axes: first, who are you going to alert and how; second, what are the escalation points for alerting.
The "who and how" of alerting will probably initially use SMTP email, but that won't work if your email server is down, or the network doesn't let the message through. Thus, you should get an out-of-band alerting method, such as dialing a phone line to touch-tone someone's pager or, preferably, using a wireless modem to send SMS messages to cell phones -- something that completely bypasses the Internet. Not every alert has to go to a person. For example, if a system or application is down for only 30 seconds or so, you may want to send this to your central logging server rather than spamming yourself. At our company, we have set up four levels of "who" to alert, ranging from the lowest (a SYSLOG to a server), through two different types of email messages, all the way up to pager and SMS broadcasts.
Use different escalation policies in alerting as you decide how important an application or device is. For example, you will have some critical devices and applications that should escalate up through your various levels quite quickly. Other devices, such as printers, might fall into the "interesting but not urgent" category, with emailed alerts only after extended downtimes. And you might even have a third category of still less important devices. When I design monitoring systems, I sometimes even have a final category of devices that are being monitored, but which never send alerts -- they simply show up on end-of-month reports.
Joel Snyder is a senior partner at Opus One, an IT consulting firm specializing in security and messaging.