Release Management/Backup and availability strategy
Contents |
Introduction
The Release Management team primarily uses Amazon EC2/S3 for their servers and storage. This is a plan to make sure nothing is lost and to improve the recovery plan in case of disasters of various severities.
Objectives
Backups:
- Data is never lost. We can have some minutes of downtime, but we can't accept having a data loss.
- Do not depend on one AWS availability zone. so that if one has a problem we can quickly switch to another one.
- Do not depend on one AWS region, so that if one has serious problems we can switch to another one.
- Do not rely entirely on AWS, so that we have our data on another data center as well.
- Do not rely on AWS and the backup data center, so that in case of a total disaster on both we don't loose everything.
- Isolate both data centers one from another. That is, we should follow a pull strategy in which the second data center has read-only access to the S3 backups. And the AWS data never pushes anything to the second data center.
Availability. We must be able to recover any instance in the following cases:
- The hardware in which the instance is running is degraded. AWS notifies us about this.
- We have lost access to the machine because network or SSH is down on the machine and a reboot doesn't solve it.
- We have a data loss in a EBS.
- Our usual availability zone has problems.
- Our usual region has problems.
Persistent storage
We can divide the files of our Linux systems into 2 groups:
- Dynamic: critical data for the correct operation of the service provided in a machine. Examples: the database in issues.openbravo.com, the Mercurial repositories in code.openbravo.com, configuration files, etc.
- Static: the rest of the operating system. This includes binaries, temporary directories, etc.
The static part is covered is usually in the root partition. Earlier this root partition was temporal (instance store). Now with ebs boot we have persistence for root partition as well. Taking regular snapshot is the best way to backup.
The integration between the root partition and the EBS volumes should be seamless. That is, a reboot should not affect it at all. And even starting anew instance of the AMI should mount all the EBS volumes, if they've been attached to the AMI previously.
Goals: move all the dynamic critical data to EBS. And make the reboot a new instance creation automatic.
Continuous monitoring
The first thing you need to recover a machine is to know that it needs to be recovered. So we need to continuously monitor all the critical services or our machines.
We first need to define what we need to monitor in each machine. Then we can select the most appropriate tool for this task (Nagios, Monit).
Ideally this machine should be placed in another data center other than Amazon EC2, to provide a more realistic monitoring.
Goals: monitor all the services, define what needs be monitored per machine, set it up in a different data center.
Instance recovery
We must be able to recover a machine when a disaster happens. There are different levels or problems, namely:
- If an instance is degraded, the recovery procedure should be automatic. We could trigger it through a command.
- If our availability zone falls, we also must be able to recover the instance in another zone automatically, with a command.
- If our EC2 region falls, idem.
- If EC2 falls entirely (unlikely), we should be able to recover everything in 3 days in another data center.
Requirements for this to happen:
- Use Elastic IPs.
- Register these IPs by creating CNAME's to the ec2 reverse dns name.
- Use EBS in all the dynamic parts.
- Automate the process of instance startup and EBS attachment.
- Make backups to another availability zone, region and data center.
Off-site backups
As we do not want to depend on EC2/S3 entirely, we will do backups into another data center. The selected options are:
- Diomede Storage: 3 stages of backups. Nice command line interface. Low power consumption. Reasonable prices.
- Rackspace Cloud Files: nice command line interface. Good offering.
Both offer nearly unlimited storage at reasonable prices (DiomedeStorage is cheaper), open source APIs, . The advantage of Rackspace is that it also is a cloud hosting solution, so we cloud have our monitoring machine in this server. And also, it's known by it's excellent customer support.
Physical backups
We do not want to depend on any data center entirely, so we prefer to have physical backups available locally on the Openbravo headquarters from time to time. Amazon AWS offers this option in the Import/Export service.
Deliverables
These are the proposed steps to be taken:
- Make sure all the dynamic and critical data is stored in a EBS. Not most of it, but 100%. DONE.
- Automate the snapshots taking of those EBS, and define the frequency for each of them. DONE.
- We can automatically start a new instance of any machine, and the EBS volumes are automatically attached (ideally optionally selected at launch time). ???
- Set up monitoring: define what services we want to monitor, select the right tool and set it up. The monitoring will be placed in a different data center. In principle in Rackspace Cloud.
- Do very frequent backups of the critical data into S3. PARTIALLY DONE - ONE TIME BACKUP FOR PHYSICAL BACKUP IS READY.
- Use Elastic IPs in all the critical machines. And register the DNS names as CNAME aliases.
- Dump the following data into another availability zone:
- EBS volumes
- S3 frequent backups.
- Any other backups in S3.
- Transfer weekly the data described above into another Availability Zone.
- Transfer bi-weekly the data described above into another EC2/S3 region.
- Transfer bi-weekly the EC2/S3 data into Rackspace Cloud Files. Started.
- Every 3 months use Export/Import to send us physical backups.
- Share our strategy with internal IT in order to unify and benefit from each other findings
What next in amazon aws
- delete unwanted snapshots
- delete unwanted volume
- stop unused instance and free the required resource
- run daily cron to create snapshot and also remove the old snapshot
- additional to above process we need to backup the critical data to another datacenter/backup service.