openSUSE:Ceph

Jump to: navigation, search


What is Ceph

Ceph is a unified, distributed storage system designed for excellent performance, reliability and scalability.

Documentation

If you've got 20 minutes spare, you might enjoy Tim Serong's Gentle Introduction to Ceph talk. Slides and some more links can be found here.

Upstream Ceph documentation can be found here.

The SUSE Enterprise Storage documentation should also be generally applicable to running Ceph on openSUSE.

Installing Ceph on openSUSE

Available Versions

Ceph is included in openSUSE Leap and Tumbleweed, but some packages needed to deploy it are still missing from the distros, so, while no extra repositories are needed to *install* Ceph, one extra repository is needed to *deploy* it. As of 2018-12-13, installation and deployment has been tested on:

In the past, we have used other combinations:

openSUSE Tumbleweed is currently tracking the Ceph master branch. At some point, the nautilus branch will be split off from the master branch: when that happens, it will track nautilus.

The OBS projects will shift as upstream releases occur; filesystems:ceph is the devel project for Ceph in Tumbleweed, and will generally track the latest release. LTS Ceph releases are from subprojects as mentioned above, and will go out with particular Leap releases. Similarly, upcoming releases will be staged in subprojects (e.g. Luminous is currently in filesystems:ceph:luminous).

Deploying Ceph

Ordinarily a Ceph cluster would consist of at least several physical hosts, each containing many disks (OSDs). But if you just want to play, you can create a toy Ceph cluster on a few VMs. If you're doing this on a laptop or small desktop system, and the VMs are backed by qcow2 volumes on the same disk, you really want to be using an SSD, not spinning rust.

Manual Setup

  • Install openSUSE Leap 15.x (or Tumbleweed, if you prefer to live on the edge) on at least three VMs.
  • You probably want to give each VM 2GB RAM and 2 CPU cores, and enable KSM on the VM host.
  • Give each of your test VMs at least one additional disk, at least 20GB in size. These will be your OSDs.
  • Make absolutely sure that hostname resolution works, i.e. $(hostname) and $(hostname -f) need to show something sensible when you're logged into your VMs. Your VM host will also want to be able to resolve these hostnames.
  • Make sure the time is sync'd nicely.
  • Add the filesystems:ceph:nautilus OBS repo to all the VMs.
  • Proceed to set up a Salt cluster and deploy Ceph using deepsea (read on)

Using Salt/DeepSea

For any reasonable sized Ceph cluster, you'll want to automate deployment somehow. For that matter, automation is a win even if you're only setting up a toy test cluster. On openSUSE (and on SLES with SUSE Enterprise Storage), we've got DeepSea, which is a collection of Salt state files, runners and modules for deploying Ceph using Salt.

For an introduction to DeepSea, including a walkthrough of setting up a small test cluster, see this blog post.

The latest DeepSea packages for openSUSE can be found in filesystems:ceph:nautilus/deepsea on OBS.

For more information about DeepSea, check out the wiki. For assistance, please join the mailing list. We'd love to get your feedback.

Using Rook

Rook is an open source cloud-native storage orchestrator, providing the platform, framework, and support for a diverse set of storage solutions to natively integrate with cloud-native environments.

Please refer to the rook documentation on the Rook web page.

To use Rook for Ceph you need Kubernetes cluster. One of the method to setup one is described in section "Using Rook in Vagrant cluster".

Deploying Ceph in Vagrant cluster

If you don't care to deal with setting up VMs yourself, you can use https://github.com/openSUSE/vagrant-ceph to automate the process.

# curl https://raw.githubusercontent.com/openSUSE/vagrant-ceph/master/openSUSE_vagrant_setup.sh -o openSUSE_vagrant_setup.sh
# chmod +x openSUSE_vagrant_setup.sh
# sudo ./openSUSE_vagrant_setup.sh

# vagrant up
# vagrant provision

This will deploy 3 nodes development cluster on Leap 42.3.

# vagrant ssh admin
# su

password is standard: vagrant

# deepsea stage run ceph.stage.0
# deepsea stage run ceph.stage.1
# deepsea stage run ceph.stage.2
# deepsea stage run ceph.stage.3
# deepsea stage run ceph.stage.4

# ceph -s

So small development cluster is ready.

For different boxes and more useful configurations please refer to: https://github.com/openSUSE/vagrant-ceph

Using Rook in Vagrant cluster

To setup Vagrant cluster you can use https://github.com/openSUSE/vagrant-ceph to automate the process.

# curl https://raw.githubusercontent.com/openSUSE/vagrant-ceph/master/openSUSE_vagrant_setup.sh -o openSUSE_vagrant_setup.sh
# chmod +x openSUSE_vagrant_setup.sh
# sudo ./openSUSE_vagrant_setup.sh

Now you need to add Kubic box, check this link for latest one, Download the box and add the box you downloaded:

# vagrant box add --provider libvirt --name opensuse/Kubic-kubeadm-cri-o /your/local/dir/openSUSE-Tumbleweed-Kubic.x86_64-15.0-kubeadm-cri-o-Vagrant-x86_64-Build8.1.vagrant.libvirt.box

Start Kubernetes cluster on top of Kubic:

# cd vagrant-ceph
# BOX="opensuse/Kubic-kubeadm-cri-o" vagrant up
# BOX="opensuse/Kubic-kubeadm-cri-o" vagrant provision

that will bootstrap Kubernetes cluster with predefined insecure token, so make sure it is used only for local development purposes. You might get some "ssh" connection issues during "vagrant up" phase, you might ignore them and proceed with "provision" step.

Those steps would bring 3 nodes cluster, check vagrant-ceph README to learn how to bring up more complex environment.

Now lets deploy Ceph with Rook.

# vagrant ssh admin
# sudo su
# kubectl get nodes

you should see 3 nodes in the cluster. If there any issues please report to vagrant-ceph github.

vagrant-ceph clones "SUSE/Rook" repo "suse-master" branch to the admin node, so you could start deploying.

This will create Rook operator and RBAC:

# cd SUSE-rook*/cluster/examples/kubernetes/ceph/
# kubectl create -f common.yaml -f psp.yaml -f operator.yaml

Check if "cephclusters" is present in crd:

# kubectl get crd

Now we can create Ceph cluster:

# kubectl create -f cluster.yaml -f toolbox.yaml

You could check containers that were created for rook-ceph namespace:

# kubectl -n rook-ceph get pod

It will take some time to create all needed roles, you could check Ceph status after toolbox container was created:

# kubectl -n rook-ceph exec $(kubectl -n rook-ceph get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[0].metadata.name}') -- ceph -s

Follow Rook documentation for other commands.

Deploying Ceph with Containers

https://github.com/ceph/ceph-docker provides Docker files and images to run Ceph in containers. This can be used to deploy Ceph Jewel for openSUSE Leap 42.2. We've got a test environment that will let you deploy a Ceph cluster running all Ceph daemons each in its own container. This test environment uses Vagrant to spin up a cluster with 3 VMs, and provisions the VMs with docker and, builds and installs the Ceph docker images in those VMs. Then there are a bunch of bash scripts that can deploy an entire Ceph cluster running MONs, OSDs, 1 RGW, and 1 MDS. The Vagrant setup can be found at https://github.com/rjfd/vagrant-ceph-docker

To use this:

Deploying Ceph with terraform

Use terraform for deployment and saltstack, Deepsea: https://github.com/MalloZup/ceph-open-terrarium

Deploying Ceph with ceph-deploy

NOTE: ceph-deploy is still relevant for Ceph "Jewel" (openSUSE Leap 42.2), but not for anything newer than that. Use DeepSea instead.

ceph-deploy can be used to deploy small clusters, but it rapidly becomes cumbersome for any real-world deployment. Please seriously consider using Salt/DeepSea instead as mentioned above (or, indeed, any other serious configuration management tool or automation framework). ceph-deploy will be deprecated and eventually disappear from openSUSE Tumbleweed and SUSE Enterprise Storage.

That said, here's how to use ceph-deploy:

  • Make sure passwordless ssh login works from a regular user from the VM host, to root on the VM guests.
  • For example if you've named your VM guests "leap1", "leap2" and "leap3", your ~/.ssh/config could include:
   Host leap*
           User root
  • And you'll want to run:
# ssh-copy-id leap1
# ssh-copy-id leap2
# ssh-copy-id leap3
  • On your VM host (as the user who can do passwordless ssh to the VM guests) run the following commands. Replace leap1, leap2 and leap3 with your actual hostnames:
# zypper in ceph-deploy
# mkdir leap-test
# cd leap-test/
# ceph-deploy install leap1 leap2 leap3
# ceph-deploy new leap1 leap2 leap3
# ceph-deploy mon create-initial
# ceph-deploy admin leap1 leap2 leap3
# ceph-deploy mgr create leap1 leap2 leap3
  • Now, if you ssh in to one of your VMs and run ceph -s, you should see something like this:
# ceph -s
    cluster ad773abb-4063-416c-ad42-ecba87880b6a
     health HEALTH_ERR
            64 pgs stuck inactive
            64 pgs stuck unclean
            no osds
     monmap e1: 3 mons at {leap1=192.168.122.67:6789/0,leap2=192.168.122.18:6789/0,leap3=192.168.122.134:6789/0}
            election epoch 6, quorum 0,1,2 leap2,leap1,leap3
     osdmap e1: 0 osds: 0 up, 0 in
            flags sortbitwise
      pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                  64 creating
  • Back on your VM host, create some OSDs. You have to run ceph-deploy osd prepare and ceph-deploy osd activate for each of the OSD disks for each of the VMs. For example:
# ceph-deploy osd prepare leap1:sdb
# ceph-deploy osd activate leap1:sdb1
# ceph-deploy osd prepare leap2:sdb
# ceph-deploy osd activate leap2:sdb1
# ceph-deploy osd prepare leap2:sdb
# ceph-deploy osd activate leap2:sdb1
  • Now, ceph -s on the VMs should give something like:
# ceph -s
    cluster ad773abb-4063-416c-ad42-ecba87880b6a
     health HEALTH_OK
     monmap e1: 3 mons at {leap1=192.168.122.67:6789/0,leap2=192.168.122.18:6789/0,leap3=192.168.122.134:6789/0}
            election epoch 6, quorum 0,1,2 leap2,leap1,leap3
     osdmap e13: 3 osds: 3 up, 3 in
            flags sortbitwise
      pgmap v26: 64 pgs, 1 pools, 0 bytes data, 0 objects
            100 MB used, 58234 MB / 58334 MB avail
                  64 active+clean

And you're done.

Kinks

Daemon Startup

If your guest VMs are configured via DHCP, the default timeouts may not be sufficient for the network to be configured correctly before the various Ceph daemons start. If this happens, the Ceph MONs and OSDs will not start correctly (systemctl status ceph\* will show "unable to bind" errors). This can be avoided by increasing the DHCP client timeout to at least 30 seconds on each node in your storage cluster. This can be done by changing the following settings on each node:

  • In /etc/sysconfig/network/dhcp set DHCLIENT_WAIT_AT_BOOT="30"
  • In /etc/sysconfig/network/config set WAIT_FOR_INTERFACES="60"

Old ceph-deploy Versions

Prior to ceph-deploy 1.5.31, there was a bug which may result in ceph-deploy mon create initial stalling trying to create some keys. The workaround is to SSH to each of your ceph nodes and run chown -R ceph.ceph /var/lib/ceph/mon as root, then re-run ceph-deploy mon create-initial.

Replication / Stuck PGs

By default, pools will have min_size=2 and size=3. This means that your data will be replicated across a minimum of two OSDs (on two separate hosts, with the default CRUSH map), but preferably across three OSDs on three hosts. So, if you've got three hosts with one OSD each (i.e. a total of three OSDs), at least two of the OSDs must be up in order for everything to work.

If ceph status perpetually shows stuck PGs, make sure all the OSDs are up, and make sure your CRUSH map actually looks like a tree, for example:

# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05548 root default
-2 0.01849     host leap1
 0 0.01849         osd.0       up  1.00000          1.00000
-3 0.01849     host leap2
 1 0.01849         osd.1       up  1.00000          1.00000
-4 0.01849     host leap3
 2 0.01849         osd.2       up  1.00000          1.00000

In the above, "root" contains three hosts (leap1, leap2 and leap3), and each host contains one OSD (osd.0, osd.1, osd.2), all of which are up.

If you've got a CRUSH map that doesn't look like a tree, something is wrong. For example:

# ceph osd tree
ID WEIGHT TYPE NAME    UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1      0 root default
 0      0 osd.0             up  1.00000          1.00000
 1      0 osd.1             up  1.00000          1.00000
 2      0 osd.2             up  1.00000          1.00000 

Here, there's no hosts, and it's completely flat. This happened on a test system where hostname resolution wasn't working, so none of the hosts knew what their names were, and thus couldn't be added to the CRUSH map.

Communication

Mailing Lists, IRC, etc.

Team members