Home Wiki > SDB:Disaster Recovery
Sign up | Login

SDB:Disaster Recovery

tagline: From openSUSE


Contents

[edit] Situation

You like to be prepared so that if your system got destroyed you can recreate it as much as possible as it was before regardless what exactly was destroyed, from messed up software or configuration up to broken hardware.

[edit] Basics

In this particular case "disaster recovery" means to recreate the basic operating system (i.e. what you had initially installed from an openSUSE or SUSE Linux Enterprise install medium).

The basic operating system can be recreated on the same hardware or on compatible replacement hardware so that "bare metal recovery" is possible.

In particular special third party applications (e.g. a third party database system which often requires special actions to get it installed and set up) must usually be recreated in an additional separate step.

[edit] While your system is up and running

  1. Create a backup of all your files
  2. Create a bootable recovery medium for your system
  3. Have replacement hardware available
  4. Verify that it works to recreate your system on your replacement hardware

[edit] After your system was destroyed

  1. If needed: Replace broken hardware with your replacement hardware
  2. Recreate your system with your recovery medium plus your backup

[edit] Inappropriate expectations

Words like "just", "simple", "easy" are inappropriate for disaster recovery.

  • Disaster recovery is not "easy".
  • Disaster recovery is not "simple".
  • There is no such thing as a disaster recovery solution that "just works".

[edit] Disaster recovery does not just work

Even if you created the recovery medium without an error or warning, there is no guarantee that it will work in your particular case to recreate your system with your recovery medium.

The basic reason why there is no disaster recovery solution that "just works" is that it is practically impossible to autodetect in a reliable working way all information that is needed to recreate a particular system:

  • Information regarding hardware like required kernel modules, kernel parameters,...
  • Information regarding storage like partitioning, filesystems, mount points,...
  • Information regarding bootloader
  • Information regarding network

For example there is the general problem that it is practically impossible to determine in a reliable way how a running system was actually booted. Imagine during the initial system installation GRUB was installed in the boot sector of the active partition like /dev/sda1 and afterwards LILO was installed manually in the master boot record of the /dev/sda harddisk. Then actually LILO is used to boot the system but the GRUB installation is still there. Or the bootloader installation on the harddisk may not at all work and the system was actually booted from a removable media (like CD or USB stick).

In "sufficiently simple" cases disaster recovery might even "just work".

When it does not work, you might perhaps change your system configuration to be more simple or you have to manually adapt and enhance the disaster recovery framework to make it work for your particular case.

[edit] No disaster recovery without testing

You must test in advance that it works in your particular case to recreate your particular system with your particular recovery medium and that the recreated system can boot on its own and that the recreated system with all its system services still work as you need it in your particular case.

You must have replacement hardware available on which your system can be recreated and you must try out if it works to recreate your system with your recovery medium on your replacement hardware.

[edit] Recommendations

[edit] Deployment via recovery installation

After the initial installation from an openSUSE or SUSE Linux Enterprise install medium set up your system recovery and then reinstall you system via your system recovery for the actual productive deployment.

This way you know that your system recovery works at least on the exact hardware which you use for your production system.

[edit] Be prepared for the worst case

Be prepared that your system recovery fails to recreate your system. Be prepared for a manual recreation from scratch. Always have all information available that you need to recreate your particular system manually. Manually recreate your system on your replacement hardware as an exercise.

[edit] Details

There are two RPM packages which provide frameworks to recreate the basic operating system:

  • rear
  • rear-SUSE

Both packages are intended for experienced users and system admins. There is no easy user-frontend and in particular there is no GUI.

[edit] ReaR

Relax and Recover (ReaR) is a disaster recovery framework. It is written entirely in bash so that experienced users and system admins can adapt or extend it to make it work for their particular cases.

Specify its configuration in /etc/rear/local.conf and then run "rear mkbackup" to create a backup.tar.gz on a NFS server and a bootable recovery ISO image for your system.

A recovery medium which is made from the ISO image boots a special ReaR recovery system. Log in as root and run "rear recover" which does the following:

  1. Recreate the basic system, in particular the partitioning with filesystems and mount points.
  2. Restore the backup from the NFS server.
  3. Install the boot loader.

Finally remove the recovery medium and reboot the recreated system.

In "sufficiently simple" cases it "just works" (provided you specified the right configuration in /etc/rear/local.conf for your particular case). But remember: There is no such thing as a disaster recovery solution that "just works". Therefore: When it does not work, you might perhaps change your system configuration to be more simple or you have to manually adapt and enhance the various bash scripts of ReaR to make it work for your particular case.

[edit] Limitations

The limitation is what the special ReaR recovery system can do.

The ReaR recovery system is totally different compared to the installation system on an openSUSE or SUSE Linux Enterprise install medium. Therefore when the initial installation of the basic operating system from an openSUSE or SUSE Linux Enterprise install medium had worked, the special ReaR recovery system may not work in your particular case.

For example:

In current SUSE systems disks are referenced by persistent storage device names like /dev/disk/by-id/ata-ACME1234_567-part1 instead of traditional device nodes like /dev/sda1 (see /etc/fstab /boot/grub/menu.lst /boot/grub/device.map).

If "rear recover" is run on a system with a new hard drive (e.g. after the disk had failed and was replaced) the reboot may fail because the persistent storage device names are different.

In this case ReaR shows a warning like "Your system contains a reference to a disk by UUID, which does not work".

The fix in the running ReaR recovery system is to switch to the recovered system via "chroot /mnt/local" and therein check in particular the files /etc/fstab, /boot/grub/menu.lst and /boot/grub/device.map and adapt their content (e.g. by replacing names like /dev/disk/by-id/ata-ACME1234_567-part1 with the matching device node like /dev/sda1). After canges in /boot/grub/menu.lst and /boot/grub/device.map the Grub boot loader should be re-installed via "/usr/sbin/grub-install".

See https://github.com/rear/rear/issues/22

Alternatively: If your harddisk layout is sufficiently simple so that you do not need disks referenced by persistent storage device names, you could change your system configuration to be more simple by using traditional device nodes (in particular in /etc/fstab, /boot/grub/menu.lst and /boot/grub/device.map).

The same kind of issue (with different symptoms) can also happen with rear-SUSE.

[edit] rear-SUSE / RecoveryImage

The rear-SUSE package provides the bash script RecoveryImage which creates a bootable ISO image to recover your system.

Experienced users and system admins can adapt or extend the RecoveryImage script to match even special needs.

To create the bootable ISO image RecoveryImage does the following:

  1. Run "rear mkbackuponly" to store a backup.tar.gz on a NFS server.
  2. Run AutoYaST clone_system.ycp to make an autoinst.xml file.
  3. Make a bootable system recovery ISO image which is based on an install medium, for example a SUSE Linux Enterprise install DVD plus autoinst.xml so that AutoYaST can recover this particular system. In particular a so called 'chroot script' is added to autoinst.xml which is run by AutoYaST to restore the backup from the NFS server.

A recovery medium which is made from the ISO image would run AutoYaST with autoinst.xml to recreate the basic system, in particular the partitioning with filesystems and mount points.

Then AutoYaST runs the 'chroot script' to fill in the backup data into the recreated basic system.

After the backup was restored, AutoYaST installs the boot loader.

Then the recreated system boots for its very first time and AutoYaST does the system configuration, in particular the network configuration. Finally the configured system moves forward to its final runlevel so that all system services should then be up and running again.

rear-SUSE uses the backup method of ReaR (via "rear mkbackuponly") but the recovery image is made in a totally different way.

I.e. same backup but totally different way of system recovery.

With rear-SUSE the recovery of the basic system (i.e. partitioning, filesystems, mount points, boot loader, network configuration,...) is delegated to AutoYaST and AutoYaST delegates the particular tasks to the matching YaST modules.

The crucial point is that autoinst.xml controls what AutoYaST does so that via autoinst.xml experienced users and system admins can control how their particular systems are recreated.

In "sufficiently simple" cases it "just works", but remember: There is no such thing as a disaster recovery solution that "just works". Therefore: When it does not work, you might perhaps change your system configuration to be more simple or you have to manually adapt and enhance your autoinst.xml to make it work for your particular case.

[edit] Basic reasoning behind

  • The recovery medium is based on a pristine openSUSE or SUSE Linux Enterprise install medium. Therefore when the initial installation of the basic operating system from an openSUSE or SUSE Linux Enterprise install medium had worked in your particular case, it should also be possible to recreate your particular basic operating system from the recovery medium.
  • AutoYaST can be used for an automated installation of various different kind of systems. Therefore with an appropriate autoinst.xml it should also be possible to recreate (almost) any system, in particular your basic operating system.

[edit] Limitations

The limitation is what AutoYaST via the matching YaST modules can do.

For example:

It depends on the particular openSUSE or SUSE Linux Enterprise product which filesystems are supported by YaST. Basically the filesystems which are supported by YaST are ext2, ext3, ReiserFS, XFS, and btrfs.

If an unsupported filesystem is used on your system, AutoYaST (via the matching YaST module) cannot recreate this filesystem.

Some examples of filesystems which are not supported by YaST (to only name some more known ones): AFS, GFS, GPFS, JFFS, Lustre, OCFS2, StegFS, TrueCrypt, UnionFS,...

In many cases this means to remove sections regarding unsupported filesystems from autoinst.xml so that what is related in your system to such filesystems cannot be recreated. In this case you need to manually recreate such filesystems and what depends on them (at least all files on unsupported filesystems).

The same kind of issue can also happen with ReaR.

[edit] Virtual machines

Usually the virtualization host software provides a snapshot functionality so that a whole virtual machine (guest) can be saved and restored. Using the snapshot functionality results that the virtual machine is saved in files which are specific for the used virtualization host software and those files are usually stored on the virtualization host. Therefore those files must be saved in an additional step (usually the complete virtualization host must be saved) to get the virtual machine safe against failure of the virtualization host.

In contrast when using ReaR or rear-SUSE the virtual machine is saved as backup and ISO image which are independent of the virtualization host.

[edit] Full/hardware virtualization

With ReaR and rear-SUSE it is possible to save a fully virtual machine which runs in a particular full/hardware virtualization software environment on one physical machine and restore it in a same full/hardware virtualization software environment on another physical machine. This way it should be possible to restore a fully virtual machine on different replacement hardware which mitigates the requirement to have same or compatible replacement hardware available. Nevertheless you must test if this works in your particular case with your particular replacement hardware.

Usually it is not possible to save a fully virtual machine which runs in one full/hardware virtualization software environment and restore it in a different full/hardware virtualization software environment because different full/hardware virtualization software environments emulate different machines which are usually not compatible.

[edit] Paravirtualization

Paravirtualized virtual machines are a special case, in particular paravirtualized XEN guests.

A paravirtualized XEN guest needs a special XEN kernel (vmlinux-xen) and also a special XEN initrd (initrd-xen). The XEN host software which launches a paravirtualized XEN guest expects the XEN kernel and the XEN initrd in specific file names "/boot/ARCH/vmlinuz-xen" and "/boot/ARCH/initrd-xen" where ARCH is usually i386 or i586 or x86_64.

Furthermore a paravirtualized XEN guest needs in particular the special kernel modules xennet and xenblk to be loaded. This can be specified in /etc/rear/local.conf with a line "MODULES_LOAD=( xennet xenblk )" which lets the ReaR recovery system autoload these modules in the given order (see /usr/share/rear/conf/default.conf).

Neither ReaR nor rear-SUSE provide functionality to create a special "medium" that can be used directly to launch a paravirtualized XEN guest. Both ReaR and rear-SUSE create an usual bootable ISO image which boots on usual PC hardware. In particular ReaR creates a bootable ISO image where kernel and initrd are located in the root directory of the ISO image.

To use ReaR or rear-SUSE to recreate a paravirtualized XEN guest, the configuration of the XEN host must be adapted so that it can launch the ReaR or rear-SUSE recovery system on a paravirtualized XEN guest. Basically this means to launch a paravirtualized XEN guest from an usual bootable ISO image.

Remember: There is no such thing as a disaster recovery solution that "just works". Therefore: When it does not work, you might perhaps change your system to be more simple (e.g. use full/hardware virtualization instead of paravirtualization) or you have to manually adapt and enhance the disaster recovery framework to make it work for your particular case.

[edit] Non PC compatible architectures

Non PC compatible architectures are those that are neither x86/i386/i486/i586 (32-bit) nor x86_64 (64-bit) like ppc, ppc64, ia64, s390, s390x.

[edit] Recovery medium compatibility

Both ReaR and RecoveryImage create an usual El Torito bootable ISO image which boots on usual PC hardware.

Neither ReaR nor RecoveryImage provide special functionality to create whatever kind of special bootable "media" that can be used to boot on non PC compatible architectures.

Therefore ReaR and RecoveryImage cannot be used without appropriate enhancements and/or modifications on hardware architectures that cannot boot an El Torito bootable media.

[edit] Bootloader compatibility

Basically GRUB as used on usual PC hardware is the only supported bootloader.

There might be some kind of limited support for special bootloader configurations but one cannot rely on it.

Therefore it is recommended to use GRUB with a standard configuration.

If GRUB with a standard configuration cannot be used on non PC compatible architectures, appropriate enhancements are needed to add support for special bootloader configurations.

It is crucial to check in advance whether or not it is possible to recreate your particular non PC compatible systems with ReaR or RecoveryImage/AutoYaST.

[edit] Help and Support

[edit] Feasible in advance

Help and support is feasible only "in advance" while your system is still up and running when something does not work when you are testing on your replacement hardware to recreate your system with your recovery medium and that the recreated system can boot on its own and that all its system services still work as you need it in your particular case.

[edit] Hopeless in retrospect

Help and support is usually hopeless "in retrospect" when it fails to recreate your system on replacement hardware after your system was destroyed.

The special ReaR recovery system provides a minimal set of tools which could help in some cases to fix issues while it recreates your system (see the "ReaR" section above) but a precondition is that the ReaR recovery system at least boots correctly on your replacement hardware. If the ReaR recovery system fails to boot, it is usually a dead end.

[edit] Be prepared for manual recreation

If it finally fails to recreate your system, you have to manually recreate your basic system, in particular your partitioning with filesystems and mount points and afterwards you have to manually restore your backup from your NFS server. Therefore you must have at least your partitioning, filesystem, mount point, and networking information available so that you can manually recreate your system. It is recommended to manually recreate your system on your replacement hardware as an exercise to be prepared.

It is crucial to have replacement hardware available in advance to verify that your can recreate your system because there is no such thing as a disaster recovery solution that "just works".

[edit] See also

[edit] Related articles

[edit] External links