Since I started worked in DERI in January of this year, one of my jobs has been to improve the Sindice infrastructure to make it more robust and fault-tolerant. The Sindice infrastructure consists of a total of 15 servers (including two small hadoop clusters). We’ve taken various actions to improve availability including adopting a basic release process, production-hardening the code, monitoring individual subsystems and tuning systems which exhibited higher failure rates. We’ve been using Zabbix as our monitoring framework since April. It indicates that the overall availability of Sindice’s search functionality since 20-Apr has been 94.2%. This means that Sindice has been unavailable for a total of about 12 days. For a system built on bleeding edge components and developed as a set of research projects first, and a coherent production infrastructure second, this isn’t bad but we are continuously working on improving this (as to whether we’ll ever achieve the mythical five nines, that’s a discussion for another day). About half of that downtime is accounted for by planned outages for maintenance, system upgrades and infrastructure maintenance.

The most recent outage we suffered (for about 2.5 days) started on 18-Nov and the source of the outage was somewhat more serious than a compnent or server failure. Those of you from Ireland will be aware that after very heavy rain, the country experienced severe flooding in various areas. Unfortunately, this excessive rainfall caused part of the ground floor of the DERI building to flood. Thanks to the quick response from DERI staff members and NUI Galway facilities people, all systems in our data centre were shutdown before any damage was caused (water and electricity don’t mix all that well). The Sindice infrastructure is entirely located in the DERI building at this time (while we share some facilities with the main NUI Galway data centre, we don’t currently have a fully replicated infrastructure for the Sindice project). Once the cause of the flooding had been addressed and the data centre had been fully dried out, we took some time to verify that all of the electrical infrastructure was intact before we proceeded to restart the Sindice systems on Friday morning. I’m happy to report that all systems came back up without problems and Sindice (and related projects including resumed operations.

Obviously, we’ve learned some lessons during this outage - we’re currently identifying various measures we can put in place to avoid similar problems in the future and we’re also taking the opportunity to review the overall disaster recovery and business continuity measures we have in place for DERI and the services provided by DERI such as Sindice. As a result of this incident, we believe we’ll be able to deliver a more robust and reliable service and maybe even reach two or three nines availability for Sindice!

