How Google implements disaster recovery for Apps

Crafted in a certain way, with the right message it’s easy to persuade a potential customer about the benefits of running his IT infrastructure inside a public cloud rather than on-premises.
A typical argument for example is the cost of disaster recovery.

In March Google published a post on the subject educating the readers about the Recovery Point Objective (RPO), how much data you’re willing to lose when things go wrong, and Recovery Time Objective (RTO), how long you’re willing to go without service after a disaster.

Google compared the RPO, the RTO and the costs of performing disaster recovery between a traditional on-premises data center and its white label SaaS platform Apps:

In larger businesses, companies will add a storage area network (SAN), which is a consolidated place for all storage. SANs are expensive, and even then, you’re out of luck if your data center goes down. So the largest enterprises will build an entirely new data center somewhere else, with another set of identical mail servers, another SAN and more people to staff them…

…big companies will often build the second data center far away, in a different ‘threat zone’, which creates even more management headaches. Next they need to ensure the primary SAN talks to the backup SAN, so they have to implement robust bandwidth to handle terabytes of data flying back and forth without crippling their network.

For a large enterprise running SANs, the RTO and RPO targets are an hour or less: the more you pay, the lower the numbers. That can mean a large company spending the big bucks is willing to lose all the email sent to them for up to an hour after the system goes down, and go without access to email for an hour as well. Enterprises without SANs may be literally trucking tapes back and forth between data centers, so as you can imagine their RPOs and RTOs can stretch into days. As for small businesses, often they just have to start over.

Some companies have adopted synchronous replication as well, but it is even more expensive than everything else we’ve mentioned. To backup 25GB of data with synchronous replication a business may easily pay from $150 to $500+ in storage and maintenance costs- and that’s per employee. That doesn’t even include the cost of the applications. The exact price depends on a number of factors such as the number of times the data is replicated and the choice of service provider.

At the low end a company might tier the number of times they replicate data, and at the high end they’ll make several copies of the data for everyone.

Google Apps instead:

…our RPO design target is zero, and our RTO design target is instant failover. We do this through live or synchronous replication: every action you take in Gmail is simultaneously replicated in two data centers at once, so that if one data center fails, we nearly instantly transfer your data over to the other one that’s also been reflecting your actions.

We also replicate all the data multiple times, and the 25GB per employee for Gmail is backed up for free. Plus you get even more disk space for storage-intensive applications like Google Docs, Google Sites and Google Video for business.

…we have very high speed connections between data centers, so that we can transfer data very quickly from one set of servers to another. This let us replicate large amounts of data simultaneously.