Tuesday, November 29, 2011

Making sure you know when a server goes offline


Rumor has it that servers are somewhat important, and bad things may occasionally happen when they go down for an extended period of time.

All joking aside, servers are crucial. Maintaining and monitoring them is a big part of why you are in business. So nothing feels worse than having your customer call to tell you their server has been offline for 30 minutes, when you supposedly have a system in place to alert you in the event of downtime.

Let's make sure it doesn't happen to you.


Plugging the holes on this one is easy, with one thing to do, and one not to do in the dashboard:

To Do:

First, double click on your server in the dashboard to bring up the edit server dialog, and make the changes indicated:


Getting rid of the Offline (Maintenance Mode) option will give you an occasional alert when you have intentionally turned a server off... but isn't that worlds better than some unforeseen event powering down the server in a manner seen as intentional and therefore never being reported?

Note: If you are below the capped pricing for the server in question, increasing the frequency to 5 minute monitoring will incur an additional cost. You can leave it at 15 minute monitoring, follow the rest of the steps, and still have a reasonably airtight (albeit slower) notification system.

Not to Do:

You may have seen this screen when editing a site in the dashboard, and the completist in you may have said "well, here are some fields, so I guess I should go ahead and fill this out..."

In this case, you shouldn't. The rationale of this option is that if the network goes down, which would be confirmed by the inability to ping your site's router, it's an Internet connectivity issue and therefore nothing you can address. Unless you live in certain parts of the world where steady Internet is not available, this mindset probably does not reflect the type of service you hope to provide your customers. Just leave it blank.

While this covers the areas to address regarding configuration, you should also verify the signal flow for the alert messages themselves:
  1. Have you configured your mail templates in the dashboard?
  2. If you are having the alerts addressed from your domain, have you updated your SPF record (if you have one) to include the RemoteManagement servers?
  3. Have you taken steps to ensure your spam filters allow these messages to pass through?
Once you've gone through this list, the chances of you getting that dreaded call will be a lot smaller, restoring the natural order of things where you are calling your customers first.








7 comments:

  1. If you would like an upto the minute test against your cusotmers external gateway, Monitis would be great for this!

    ReplyDelete
  2. Is there a way to change these settings for all or a group of servers at once? Is there a way to set the default for new servers so they automatically use these settings?

    ReplyDelete
  3. You can make the higher frequency the default via Installation Templates, but you would need to make the Maintenance Mode change individually at this time.

    ReplyDelete
  4. Any way to keep this alert going? If we miss the original server is offline alert, when is the notice resent?

    Thanks

    ReplyDelete
  5. Casey - most people want less email, not more. While I understand where you're coming from and added flexibility in Alerting is coming in the product, might I suggest increasing the BREADTH of your message/alert recipients to increase coverage? Add SMS in the console for server alerts ONLY, or change your mail handler to a distribution group for more "eyes" on the issue...

    ReplyDelete
  6. Anyone know of a way to set the server offline to send to a separate email ?

    ReplyDelete
    Replies
    1. It already does! You can see what gets sent via email under Mail Templates -> Server Monitoring -> Data Overdue Alert.
      (If you have SMS enabled, it sends a "fixed" message via our provider.)
      Some of this has changed this month, so an updated article will be published shortly.

      Delete