Kubic:MicroOS

Jump to: navigation, search
openSUSE MicroOS is a modern Linux Operating System, designed for containers and optimized for large deployments. It is the operating system part of openSUSE Kubic, a Container as a Service platform.

What is openSUSE MicroOS?

openSUSE MicroOS is a modern Linux Operating System, designed for containers and optimized for large deployments. It inherits the openSUSE Tumbleweed and SUSE Linux Enterprise knowledge while redefining the operating system into a small, efficient and reliable distribution.

openSUSE MicroOS is not a separate distribution, but is bundled as part of openSUSE Kubic, a Container as a Service platform. However, there is a system role "openSUSE MicroOS" which can be selected during installation to get a standalone openSUSE MicroOS system installed. openSUSE Kubic itself is an openSUSE Tumbleweed variant that shares technology with SUSE CaaS Platform.

In a Nutshell

  • OS focused only on containers
    • Minimal images designed for one special Use Case
  • Focused on large deployments
    • Reduced end-user interactions
  • An always up-to-date Operating System
    • Safe way to update the system

Highlights

  • Btrfs with snapshots and rollback for transactional updates
  • Read-only root filesystem
  • cloud-init for initial system configuration during first boot
  • Rolling Release: Every time we release a new openSUSE Tumbleweed snapshot, we will also release a new openSUSE Kubic snapshot
  • Designed to fit perfectly into existing openSUSE or SUSE Linux Enterprise environments

Architecture

Package format

openSUSE MicroOS is using RPM packages.

There is absolutely no reason to switch to another package format for a minimal system like openSUSE MicroOS or transactional updates. RPM as a package format is well known, since it is used by several major distributions. There are proven working toolchains from building RPMs to delivery of the resulting packages to users. Additionally a lot of users already have policies and toolchains for RPM updates.

Other RPM advantages are:

  • Signed, easy to verify
  • Verification of installed system possible
  • Delta-RPM to save bandwidth

Installed Packages

The openSUSE MicroOS installation media contains all packages, which are

  • necessary to boot the system.
  • necessary to run containers.
  • necessary to configure and run the "Container as a Service" stack.

The package list is similar to the SUSE Linux Enterprise Server minimal system.

There is no guarantee for a stable ABI: packages will be introduced if needed and removed if no longer needed. This is not considered to be a disadvantage, as the customer workload runs in a container. On the contrary the advantage is that only the minimal set of software necessary to do the requested job.

Additional RPM packages for hardware enablement, logging, monitoring and similar tasks are available on the installation media. The online repository is identical to the openSUSE Tumbleweed repository.

Job scheduler

openSUSE MicroOS is using systemd timers (see man systemd.timer) for job execution. Compared to cron systemd timers provide better control and debug options and avoid problems with cronjobs and systemd session management. cron is not installed by default, so regular cronjobs will not be executed.

Init system

openSUSE MicroOS is using systemd as it's init system. Support for legacy SysV init scripts (which includes LSB compatible init scripts) is not included by default, SysV init scripts should be converted to systemd services instead.

Filesystem

The only available and supported filesystem for the root filesystem is btrfs. Other filesystems like ext4 and xfs are available and supported for data partitions. The root filesystem is read-only, some subvolumes are available to store data, like /var, /home and /root. To store modified configuration files, overlayfs is used for /etc. The work directory for /etc/ is /var/lib/overlay/etc.

Filesystem Layout

Subvolumes

  /@/<subvolumes>            - Default subvolumes (see the list of default subvolumes on the BTRFS support page)
  -> /root                   - root user home directory
  -> /cloud-init-config      - Configuration files for cloud-init stored in the image
  -> /.snapshots/1/snapshot  - Initial installation of Base OS
  -> /.snapshots/2/snapshot  - Base OS after first update
  -> /.snapshots/3/snapshot  - Base OS after second update
  -> /.snapshots/X/snasphot  - Base OS after (X-1) updates

With the exception of the .snapshots subvolumes, the non-default subvolumes listed above are added by default in openSUSE Kubic in order to ensure it's possible to write to those locations when the rest of the root filesystem is read-only.

Important Folders

   /var/lib/docker

to store containers should be an own btrfs partition. This is so snaphots and rollback have qgroups enabled on the root filesystem. This will be a massive performance bottleneck for containers.

   /var/lib/overlayfs

is used to provide the overlay store mounted for /etc. It doesn't have to be on a separate partition.

Configuration

The system is pre-configured as far as possible during installation - usually no additional configuration is needed by the system administrator. For configuration openSUSE MicroOS is using cloud-init to adjust the system during the boot phase. Primary configuration items are network and ssh keys to allow the admin to login to the machine.
Note that by default a root password will only be set if the installation is done with YaST2. Otherwise a local login is not possible, so either an account needs to be configured or ssh keys have to be installed.

System configuration (cloud-init)

Cloud-init is a flexible and popular framework for customizing cloud instances; it is used to customize the openSUSE MicroOS installations. The cloud-init configuration was modified to support caasp roles setup (master and cluster nodes).

Some enhancements were necessary to configure repositories (e.g. the update repositories) and to be able to read the configuration from a local directory. Else an USB disk would always be necessary.

The default search order for configuration files is:

  • Local directory
  • USB flash drive or ISO image
  • Configuration server (No advanced Network configuration possible):
    • NoCloud
    • OpenStack

There is some documentation how to setup cloud-init.

Health Check

Several checks for errors are done during boot phase. If an error was detected, the following rules will be used:

  • Error with new snapshot:
    • Rollback to last known working snapshot if one exist
  • Error with already successfully booted snapshot
    • Try first reboot
    • Shut down services, inform the system administrator

This process needs access to the hard disk. If the boot process fails in or before initrd, the system administrator has to fix this manually.

Security and Immutability

  • Apparmor
    • Fully supported
  • SELinux
    • Under evaluation, the framework is there, but:
      • There are problems with overlayfs (/etc)
      • A policy is missing
  • IMA & EVM
    • "Secureboot down to the filesystem"
    • All files are signed cryptically or with hashes
    • Implementation is work-in-progress for SUSE Linux Enterprise Server, afterwards we will evaluate whether this will work with openSUSE MicroOS, too

Installation

openSUSE Kubic and thus openSUSE MicroOS are RPM based distributions and can be installed from media or with PXE/tftpboot with YaST2. For mass-deployment, an autoyast profile can be created. For openSUSE Kubic this is done by velum, the administration dashboard for the cluster. For openSUSE MicroOS, there is a script create_autoyast_profile, with which an autoyast profile can be created.

PXE/tftpboot

openSUSE Kubic comes with a RPM containing a tftpboot installer: tftpboot-installation-openSUSE-Tumbleweed-Kubic-<architecture> Install or unpack this RPM on your tftpboot server and follow the steps in the README to configure PXE boot for it. There is no need anymore to download the full ISO image and setup your own install server with it.

Hardware requirements

openSUSE MicroOS needs minimal 1GB RAM and 16GB disk space for installation. At runtime, additional memory and disk space is needed for the containers depending on your workload. If you want to bootstrap a full Kubernetes cluster with openSUSE Kubic, 8GB RAM and 40GB disk space are required.

Update and Reboot Strategy

For security and stability reasons, the Operating System and the application stack should always be up-to-date. While this is not a problem with single machines, where you can apply all updates by running the commands manually, this can become a real burden in a big cluster. For this reason we believe that automatic updates are the right thing to do.

This section is about the update strategy in general and the reboot strategy with rebootmgr for openSUSE MicroOS particularly. openSUSE Kubic will use salt to trigger the reboot.

To update the system fully automatic 'Transactional Updates' are used. The automatic update process can be disabled, or a maintenance window can be configured, in which the update and, if necessary, reboot of the server will be done. Standard RPMs are used for updates, and they will be delivered in the same way as for openSUSE Tumbleweed or SUSE Linux Enterprise. If needed, SMT can be used as local proxy.

How does this work?

To limit the risk for your machines, updates are applied as transactional updates. This means:

  • They are atomic
  • They don't influence the running system
  • They can be rolled back
  • The system needs to be rebooted to activate the changes

Responsible for this part is the script transactional-update. It is called by systemd.timer once a day. This is configurable by creating a file etc/systemd/system/transactional-update.timer.d/local.conf containing:

 [Timer]
 OnCalendar=

For more information about which options can be configured and possible values, please see systemd.unit(5) and systemd.time(5).

It should be made sure that not all machines start the update at the same time. Depending on the network infrastructure and the number of machines, this could create a really high (too high) load.

This script checks first, if updates are available. If yes, a new snapshot of the root filesystem is created and updated with zypper dup. So all RPMs which are released at that point in time and not yet installed will be applied. Afterwards, the snapshot is marked as active and only used after the next reboot. For this reason, the script can reboot the machine itself afterwards or tell rebootmgr to schedule a reboot according to the configured policies.

rebootmgr is a daemon, which can be configured to reboot the machine according to special policies. It can be controlled by rebootmgrctl.

After the next reboot, the system verifies itself and if mandatory daemons were not started correctly, a rollback to the last known working snapshot is done automatically.

Reboot Strategy Options

rebootmgr supports different strategies, when a reboot should be done:

  • instantly - when the signal arrives other services will be informed that we plan to reboot and do the reboot without getting any locks or waiting for a maintenance window.
  • maint-window - reboot only during a specified maintenance window. If no window is specified, reboot immediately.
  • etcd-lock - acquire a lock at etcd for the specified lock-group before reboot. If a maintenance window is specified, acquire the lock only during this window.
  • best-effort - this is the default. If etcd is running, use etcd-lock. If no etcd is running, but a maintenance window is specified, use maint-window. If no maintenance window is specified, reboot immediately (instantly).
  • off - rebootmgr continues to run, but ignores all signals to reboot. Setting the strategy to `off` does not clear the maintenance window. If rebootmgr is enabled again, it will continue to use the old specified maintenance window.

The reboot strategy can be configured via /etc/rebootmgr.conf and at adjusted at runtime via rebootmgrctl. This changes will be written to the configuration file and survive the next reboot. A default configuration file would be:

 [rebootmgr]
 window-start=03:30
 window-duration=1h30m
 strategy=best-effort
 lock-group=default

Which means the machine is only allowed to reboot in the night between 3:30 and 5:00 o'clock. If etcd is running, it tries to get a lock during that time and reboots only afterwards. If no lock could be get during this timeframe, no reboot is done. The format of window-start is the same as described in systemd.time(7). The format of the window-duration is [XXh][YYm]

Locking via etcd

To make sure that not all machines reboot at the same time, the machines can be sorted into groups and the number of machines of a group which are allowed to reboot at the same time can be configured and controlled via etcd. So you can create a group etcd, which contains all machines running etcd, and specify that only one etcd server is allowed to reboot at one time. And a second group worker, in which a higher number of machines are allowed to reboot at the same time.

The etcd path to the directory containing data for a group is: /opensuse.org/rebootmgr/locks/<group>/

This directory contains two variables: mutex, which is by default 0 and can be set via atomic_compare_and_swap to 1 to make sure that only one machine has write access, and a variable `data` containing the following json structure:

 {
   "max":1,
   "holders":[]
 }

holders will contain a unique ID of the machine, in this case the one from /etc/machine-id.

So a record containing two locks out of 10 possible one would look like:

 {
   "max":10,
   "holders":[
     "3cb8c701b4d3474d99a7e88b31dd3439",
     "71c8efe539b280af2fe09b3b5771345e"
   ]
 }


A typical workflow of a client which tries to reboot would look like:

  • check for are free locks, else watch the data variable until it changes
  • get the mutex
  • add yourself to the holders list
  • release the mutex
  • reboot
  • on boot, check if we hold a lock. If yes:
    • get the mutex
    • remove the ID from the holders list
    • release the mutex

Disable Automatic Updates

Automatic updates can be disabled with:

 systemctl --now disable transactional-update.timer

Automatic reboot can be disabled with:

 systemctl --now disable rebootmgr

or

 rebootmgrctl off

Manual triggers

Old snapshots can be marked for removal by calling transactional-update cleanup after a reboot. Updates can be run at every time with transactional-update. A reboot according to the current active policy can be triggered with rebootmgrctl reboot.



Transactional Updates

Definition

A transactional update is a kind of update that:

  • Is atomic
    • Either fully applied or not at all. This means, at any point during the update, you can switch of the machine and at the next boot either the old, unmodified installation boots or the new one, never a mix. This also means, that old snapshots will not be destroyed (as you would happen if you use two partitions and switch between them) and you can still do a rollback if needed.
    • The update does not influence your running system. This means that the running processes don't see, that an update is happening and they will not be restarted. Which would be pretty useless, as a restart of the running daemons would only start the old binary again, the new one is only available after a reboot.
  • Can be rolled back
    • If the update fails or if the update is not compatible, you can quickly restore the situation as it was before the update.

How it works

XXX insert link to transactional-update documentation

Cleanup of old snaphots

Snapper cleanup policies and rules are used:

  • Configured number of snapshots stay, rest will be removed
  • Unimportant snapshots will be deleted first

transactional-update cleanup

  • Run regular by systemd.timer
  • Snapshots used in the past:
    • Marked as important and that they can be deleted
  • Snapshots not used in the past:
    • Marked as can be deleted, but not as important

Important commands

XXX Document how to update, install and rollback packages. Document repository handling. Document automatic update with third party RPMs.

Download

ISO images for installation

An ISO image for manual install can be downloaded from here: http://download.opensuse.org/tumbleweed/iso/openSUSE-Tumbleweed-Kubic-DVD-x86_64-Current.iso For openSUSE MicroOS select the corresponding system role during installation.

Images for virtualisation environments

Untested images for different virtualisation environments are currently only available from the devel project: https://download.opensuse.org/repositories/devel:/CaaSP:/images/images/

During the first boot of this images, you need to set a password or add a ssh key for remote login with cloud-init. There is some openSUSE Kubic and openSUSE MicroOS specific Documentation available.

Communication

Mailing list

IRC/Chat

#kubic on the freenode IRC network is the channel that the Kubic project uses for live chat.

See also

Related articles

External links