High Availability and Monitoring

We have several different levels of application monitoring to ensure that services are being rendered according to acceptable performance standards.

For our platform availability, please see our System Status page for the latest updates. A member of our support staff can respond to your questions and concerns around availability if you submit a support ticket.

Service Level Agreement (SLA)

Our goal for system uptime is 100% each month, outside of scheduled downtime. We normally try to keep scheduled downtime to less than an hour each month.

If we fail to achieve 99.9% uptime, measured monthly, we will issue pro-rata credit for your monthly subscription fees. This equates to no more than 40 minutes of unscheduled downtime in any given month.

Notifications

We provide a public operational service status page which documents our historical uptimes and provides information in the event of a service disruption. Unusual or degraded operations will also be tracked via our operations Twitter account: @Adm1nistrateOPS

Redundancy

Administrate’s infrastructure is built on Amazon Web Services (AWS). We operate our servers in Ireland, in the EU. We have architected our system so that it is deployed across multiple availability zones - geographically separate data centers managed by AWS.

Usage of our platform is automatically load-balanced across the separate availability zones, allowing us to offer a very high level of availability. Data is replicated across availability zones in near real-time, with updates to one database appearing in the others in just a few milliseconds.

If Amazon were to experience a physical problem in one of the availability zones, our platform would continue to operate in the other zones, and our self-healing architecture would automatically bring additional resources online to compensate within a few seconds. If there was an incident impacting the primary database, a replica would be promoted as the primary; this is automatic and transparent to our customers.

Disaster Recovery

We regularly test our disaster recovery responses, to ensure that a situation that is beyond our ability to self-heal can be resolved quickly and safely by our team. Each test of our disaster recovery responses is treated as though it was a real incident and we use the outputs to help improve and tune our processes.

Recovery Time Objective 36 hours
Recovery Point Objective 1 hour

Also see information on Backups.

Service Level Objectives

Administrate delivers five services within the broader solution offering. Each of these are critical to our customers.

  • External website (www.getadministrate.com)
  • Web application ([client-shortcode].administrateapp.com)
  • Student Portal ([client-shortcode].administratelms.com)
  • Public API (described at https://developer.getadministrate.com)
  • Communications Infrastructure (SMS, Emails, Job Queues, etc.)

All five services are part of the Administrate solution and fall within our SLA (with the understanding that our website integrations do rely on customer infrastructure to correctly operate).

Monitoring and Response

Administrate employs a defense-in-depth approach to monitoring of its platform. We monitor the health of each running process, the components, the services and the entire stack. We monitor the performance, the throughput, the CPU load, the bandwidth and the error rates.

If our automated monitoring sees something operating outside of the normal bounds, we will either automatically repair the problem, or automatically alert an on-call engineer who is responsible for getting things back on track, following our incident management process.

Self Healing

Each component of Administrate’s platform comprises a collection of identical running processes, load balanced across geographically remote locations. Each process answers a “health check” call multiple times per minute. If a process does not appear healthy, or respond quickly enough, it is automatically removed from the load balancer, and the incoming work is shared amongst the remaining processes. The failed process is automatically replaced with a new one.

Geographically Remote Monitoring

We monitor the responsiveness of the Administrate platform using a 3rd-party service which tests the components of our system from over one hundred sites across the globe, every minute. By using a geographically dispersed monitoring service, we are able to measure how the system behaves for all of our worldwide customers.

Alerting

Administrate has several engineers who are trained to deal with unexpected behavior in the system. These engineers operate on a rotational basis, with someone always on call 24/7. These engineers frequently test our incident response playbooks in “fire drills”. Our incident response process includes a route for the on-call engineer to escalate the problem and call on others to aid in the resolution.

If one of our monitoring tools were to detect a fault, it would automatically alert the on-call engineer and a member of the Customer Support team.

Instrumentation

Administrate deploys instrumentation technology inside each component of the platform. This allows us to understand how the code is operating in production, and pin-point areas that can be optimized to give a better experience for our customers. Our instrumentation measures throughput, response items and server loads, allowing us to build an holistic picture of our system as it responds to external usage.