Updated at bottom
To be clear, this is not an application to monitor app engine itself, but rather to monitor other servers, web servers specifically, from the (hopefully) reliable App Engine infrastructure. If you aren’t familiar with it, Googles App Engine is a true cloud computing platform that operates in terms of hosting applications directly, rather than hosting a Linux instance on their servers and then running applications on top of it. While i’m sure the foundation is probably Linux, you won’t ever see that because it is abstracted away. App Engine is completely free to use for apps like this with very low resource requirements, but if an application did require more resources than Googles fairly generous free allocation, they will let you pay for more.
The driving force for writing this bit of code was twofold:
First, there are plenty of monitoring services available but most of them require payment if you want anything more than 1 hour granularity, and most of them will only tell you a server was unavailable, they won’t tell you why, or give you much in the way of notification choices. As I love the Prowl push notification system on my iPod, support for it was essential, and that required writing something myself.
Second, writing this code was a great exercise that will help when I write other, larger Python web applications using App Engine, Django or other web application frameworks.
“Major” features to note:
- Monitors an arbitrary list of web servers entered by the administrator
- Supports both SSL (port 443), and non-SSL (port 80)
- Keeps track of current uptime, and displays it to the administrator or to everyone
- Stores the last HTTP response code returned by each server
- Notifies the administrator of an event via email or Prowl
- Reports error code 500 to the administrator
- Reports unreachable servers
In the future:
- Notifications via Facebook, SMS, and Twitter
- Integration of libcloud (already in the git repo) to automate actions based on events
- AJAX interface
- More stuff :)
If you require monitoring and notification of this sort, I assume you can probably figure out how to get App Engine going and upload this code to your own App Engine account, tutorials for this are available all over the web. I will however note that the url “aeservmon.appspot.com” is already taken ;) You may need to change the app name in app.yaml in order to serve it in your own App Engine account.
I would provide this as a free service on my App Engine account, however the CPU requirements shoot up pretty fast for each additional server added to the monitor, so it would quickly be shut down by Google for using too many CPU resources during each server check interval. I advise only adding 10-12 servers to each installation, adding more may require changing the code to reduce resource usage. The intended audience is system administrators with only a few servers to watch, so it may not be a problem for most of you.
Now on to the code……….
While development on this little app is not finished and i have much more to add to it, it is functional and reliable in my testing, so I have decided to open source it in its current state, and publish it on GitHub:
Notification methods that have been implemented and tested include Email and Prowl, both work out of the box but Prowl requires you to enter your Prowl API key in the admin interface. If you enter an invalid API key, the interface will show an error icon next to the key. Twitter support is in the codebase but currently disabled, as the twitter module i was using appears to require temporary files which Google does not allow. I may hack around in the module to remove that requirement or just implement twitter notifications myself. Facebook and SMS notifications are also being worked on.
When you login to the admin interface and add a server, the email account you are logged in with is recorded in the database for that server entry, and this address is used to email you if you select the email notification method. Google will not allow email to originate from an email address that is not set as an administrator of the App or the logged in user, so by setting it automatically any problems are avoided. It may be possible to use an external python module to send email, and this MIGHT remove the limitation.
Due to limitations of the App Engine system, checking and uptime recording can only be done in intervals of 1 minute or more, however this is perfectly acceptable for most situations (certainly better than 1 hour). The checking is done by running a specific URL once per minute using the App Engine cron system, and the code behind that URL takes care of updating the database in which status and uptime are recorded for each server, and notifying you of any events if necessary. If you wish to change the checking interval, change the timer in cron.yml.
Each time a notification is sent out, a hold flag is set, so that you don’t receive a flood of notifications every minute. A maintenance script runs every 20 minutes to release the hold, after which time you will receive another notification if the server is still down.
The admin interface is simple, and was built with the templating system on App Engine (which is derived from Django). There is a separate CSS file, some forms, and the rest is dynamically generated using variables. In the future i may build out an AJAX interface on top of the basic forms but it is not a high priority since in normal use you will never see the interface unless you are adding servers or removing them. The interface does display uptime for each server, in the future i may add support for graphing uptime records.
By default only the administration panel is restricted by a login, if you wish to also restrict the main page (the one you can see in the first screenshot), you can add “login: admin” to the main page entry in app.yml (it should be the last one). Use the other entries as an example. If you wish to only restrict login to authenticated users (meaning anyone with a google account, or potentially anyone part of a Google Apps domain), consult the App Engine Python documentation here: here.
As stated above, i intend to integrate Libcloud (it’s already included in the codebase in github), which is a python module allowing remote control of various hosting services such as Linode, Slicehost and others. Primarily, I plan to implement remote automated reboot, for instance if one of our Linodes goes down (meaning NO response) for more than 30 minutes. Normally I would get a notification via Prowl when there is an outage and take care of the problem, but sometimes that isn’t possible so it would be nice to know that at least some action is being taken automatically.
As this is one of the first Python applications i have written, there are probably a few bugs i have not found yet, particularly handling variable situations and exceptions (though there is some exception handling included to solve the most obvious problems).
We’re actually using this in production right now to monitor servers, and while I can’t speak for everyone I am certainly glad to have a monitoring system that works the way I want it to work, essentially for free. Hopefully you will find it useful too :)
December 9th, 2009 – Updated the codebase to support multiple notification methods per server, so you can get notifications via email AND prowl to ensure you see one of them. Also updated templates a bit, and added lines in the model to support eventual facebook, twitter and sms notifications. I did fix some of the space/tab mixing in a few files so if you’ve forked the project or altered it locally in your git repo, it may not merge cleanly.
January 3rd, 2010 – Updated the code to hopefully fix a cache issue with urlfetch. Credit goes to Greg Sheremeta, thanks! :)
January 21st, 2010 – Updated checkservers.py to add a longer deadline for urlfetch, it was timing out for servers that were still online and sending false downtime notifications.