Pluggable CPU schedulers

Jump to: navigation, search

The CPU scheduler is the kernel component responsible for managing threads, which are the fundamental unit of execution in application software. Time on the CPU is a precious resource, and the scheduler allocates it to threads according to some scheduling algorithm; it decides when a thread should start executing, on which of the many CPUs it should run, and for how long. Until recently, the scheduler needed to be part of the core kernel; to adopt a different scheduler other than the one provided by the distribution's kernel, one had to install a patched kernel. This changes with version 6.12 of the Linux kernel, thanks to a newly implemented feature called "sched_ext", or the "extensible scheduler class": developers can now write new schedulers as loadable eBPF programs, and Linux distributions such as openSUSE Tumbleweed can now ship alternative schedulers as rpm packages.

Installing the scx package

The scx package contains a set of alternative CPU schedulers that can be used in Tumbleweed. First, make sure you're running a kernel with version 6.12 or above. At the moment this is available at the Kernel:HEAD repository:

# zypper addrepo https://download.opensuse.org/repositories/Kernel:/HEAD/standard Kernel-HEAD
# zypper refresh --repo Kernel-HEAD
# zypper search --details --repo Kernel-HEAD --type package /^kernel-default$/

S  | Name           | Type    | Version               | Arch   | Repository
---+----------------+---------+-----------------------+--------+------------
v  | kernel-default | package | 6.12~rc4-2.1.ga082c88 | x86_64 | Kernel-HEAD
# zypper install --repo Kernel-HEAD kernel-default

Now add the sched-ext repository from home:flonnegren and install the scx package from it.

# zypper addrepo https://download.opensuse.org/repositories/home:/flonnegren:/sched-ext/standard sched-ext
# zypper search --details --repo sched-ext --type package /^scx$/

S  | Name | Type    | Version   | Arch   | Repository
---+------+---------+-----------+--------+-----------
   | scx  | package | 1.0.5-3.1 | x86_64 | sched-ext
# zypper install --repo sched-ext scx

Alternative schedulers from the scx package

Some of the schedulers from the scx package are more mature and field-tested than others, and some have been written exclusively for demonstrating the usage of the sched_ext API, as a form of documentation. They're all included in the Tumbleweed package, but is up to you to try them and select the one that best suits your needs. Assuming a desktop-like usage (interactive apps, productivity, multimedia, gaming etc) your best bets are likely scx_bpfland, scx_lavd and scx_rusty, but feel free to experiment for youself; the point of the sched_ext project is exactly to offer users a wide range of choices and to encourage tinkering. The complete list of packaged schedulers follows.

More mature and tested:

  • scx_bpfland: threads that block frequently (i.e. perform many voluntary context switches per second) are assumed to be interactive, and thus prioritized. Upstream readme file.
  • scx_lavd: developed by Igalia for the Steam Deck, Valve's handheld gaming console, and optimized for video gaming. Upstream readme file.
  • scx_rusty: designed with CPU load balancing in mind. CPUs are partitioned into "scheduling domains", one per Last-Level Cache (LLC), and threads are kept in their original domain whenever possible to leverage data locality. Upstream readme file.
  • scx_rustland: the precursor of scx_bpfland, by the same author. The algorithm was first implemented here as a user-space scheduler. Upstream readme file.
  • scx_layered: allows the user to classify threads into multiple "layers" (cgroups) and applies a different scheduling policy to each layer. Upstream readme file.
  • scx_nest: threads are placed on cores that are likely to be running at higher frequency, based upon recent usage. Description in the source code, research paper.

Demonstrative schedulers:

  • scx_simple: threads with the least past runtime are selected for execution. Can be switched to "FIFO scheduling", which means threads are left on the CPU indefinitely until they perform a voluntary context switch, eg. by issuing a blocking system call. The selection of the thread to run happens in Firt-In First-Out order. Description in the source code.
  • scx_rlfifo: user-space FIFO scheduler implemented with the same Rust framework used in scx_rustland. "rlfifo" stands for "rustland FIFO". Upstream readme file.
  • scx_qmap: implements five degrees of priority among threads. Description in the source code.
  • scx_pair: organizes CPUs into pairs, and makes sure each such pair always executes threads from the same cgroup. This demonstrates how sched_ext could be used to mitigate hardware security vulnerabilities where a thread could read information from its hyperthreading sibling. Description in the source code.
  • scx_central: a single CPU is devoted to running all scheduling logic. No other CPU needs to run the timer interrupt (tick); they instead send an inter-processor interrupt (IPI) to the CPU in charge of scheduling when they need work. Description in the source code.
  • scx_flatcg: demonstrates flattening the cgroup hierarchy into a single layer, where weights from each level are compounded. This removes cgroup hierarchy traversals to retrieve the weight values, and gives faster scheduling in case of deeply nested cgroup hierarchies. Description in the source code.
  • scx_userland: scheduler implemented in user space, which provides a mix of FIFO and runtime-based scheduling. Description in the source code.

How to start a sched_ext scheduler

Schedulers are started by invoking the corresponding binary, for example from a root terminal session like so:

# scx_bpfland 
17:45:15 [INFO] scx_bpfland 1.0.5 x86_64-unknown-linux-gnu SMT off
17:45:15 [INFO] primary CPU domain = 0x3ff
17:45:15 [INFO] cpufreq performance level: auto
17:45:15 [INFO] L2 cache ID 0: sibling CPUs: [0]
17:45:15 [INFO] L2 cache ID 4: sibling CPUs: [4]
17:45:15 [INFO] L2 cache ID 5: sibling CPUs: [5]
17:45:15 [INFO] L2 cache ID 9: sibling CPUs: [9]
17:45:15 [INFO] L2 cache ID 6: sibling CPUs: [6]
17:45:15 [INFO] L2 cache ID 8: sibling CPUs: [8]
...

In the above example, scx_bpfland is now scheduling threads in the system, instead of the default built-in kernel scheduler. To get back at the default kernel scheduler, terminate the program with CTRL+C:

^C
EXIT: unregistered from user space
17:45:30 [INFO] Unregister scx_bpfland scheduler

Each sched_ext scheduler has a help text to explain its command line options. -h gives a help summary, while --help prints the full text:

# scx_bpfland -h
scx_bpfland: a vruntime-based sched_ext scheduler
that prioritizes interactive workloads

Usage: scx_bpfland [OPTIONS]

Options:
      --exit-dump-len <EXIT_DUMP_LEN>  Exit debug dump buffer length. 
                                       0 indicates default [default: 0]
  -s, --slice-us <SLICE_US>            Maximum scheduling slice duration
                                       in microseconds [default: 5000] 
  -S, --slice-us-min <SLICE_US_MIN>    Minimum scheduling slice duration
                                       in microseconds [default: 500]
  ...
  -h, --help                           Print help (see more with '--help')

Systemd services to start/stop sched_ext schedulers

While invoking a scheduler from the command line is practical for a quick evaluation, it is inconvenient to do every time: we would like our favorite scheduler to start at boot. To do so, the scx package provides two alternative methods: the scx.service and scx_loader.service unit files. They get the job done, but none of them is truly ideal: scx.service uses two separate environment variables holding the scheduler name and command line options, and requires several user commands to switch from a scheduler to another; scx_loader.service is in principle a more powerful approach, as it relies on a daemon running in background (scx_loader), but needs the user to issue commands with dbus-send manually since we still don't have a utility to do that in a more concise way.

scx.service: simple and essential

Environment variables

scx.service contains the following settings (see systemctl cat scx.service):

EnvironmentFile=/etc/default/scx
ExecStart=/bin/bash -c 'exec ${SCX_SCHEDULER_OVERRIDE:-$SCX_SCHEDULER} 
                             ${SCX_FLAGS_OVERRIDE:-$SCX_FLAGS}'

The environment variables in the snippet above determine the behavior of the service. They are:

  • SCX_SCHEDULER: name of the scheduler started at boot, eg. scx_bpfland or scx_lavd. The variable is set in /etc/default/scx, see EnvironmentFile= in the snippet above. Set its value in /etc/default/scx to permanently set a sched_ext scheduler as the default one.
  • SCX_SCHEDULER_OVERRIDE: set this variable with systemctl set-environment SCX_SCHEDULER_OVERRIDE=... to switch to a scheduler other than the default sched_ext scheduler. This change will not persist across reboots.
  • SCX_FLAGS: command line arguments for the sched_ext scheduler. Its value set in /etc/default/scx, so change it there for permanent effect.
  • SCX_FLAGS_OVERRIDE: set this variable with systemctl set-environment SCX_FLAGS_OVERRIDE=... to temporarily change the command line arguments to the current scheduler. This change will not persist across reboots.

Default behavior

As the scx package is freshly installed, the /etc/default/scx environment file contains the following:

# cat /etc/default/scx
SCX_SCHEDULER=scx_bpfland

That is, scx_bpfland is the scheduler started by the service (at boot or otherwise), without any command line option, since SCX_FLAGS isn't set. When enabling and starting the service we'll get:

# systemctl enable scx.service
# systemctl start scx.service
# systemctl status scx.service
● scx.service - Start scx_scheduler
     Loaded: loaded (/usr/lib/systemd/system/scx.service; enabled; preset: disabled)
     Active: active (running) since Fri 2024-10-25 14:04:45 CEST; 31min ago
 Invocation: 7634f0a0b005440690a228f672a08b1f
   Main PID: 27224 (scx_bpfland)
      Tasks: 4 (limit: 19088)
        CPU: 385ms
     CGroup: /system.slice/scx.service
             └─27224 scx_bpfland

If there was any command line argument to scx_bpfland, it would show at the "CGroup" entry in the output above.

Temporarily change a scheduler's command line arguments

By setting the value of SCX_FLAGS_OVERRIDE with systemctl set-environment, one can change the command line options of a sched_ext scheduler managed by scx.service. Note that after the change it's necessary to restart the service:

# systemctl set-environment SCX_FLAGS_OVERRIDE='--slice-us 20000'
# systemctl restart scx.service
# systemctl status scx.service
● scx.service - Start scx_scheduler
     Loaded: loaded (/usr/lib/systemd/system/scx.service; enabled; preset: disabled)
     Active: active (running) since Fri 2024-10-25 15:26:50 CEST; 5s ago
 Invocation: e1d83698c0fa44db8214a8c36934c0d1
   Main PID: 33401 (scx_bpfland)
      Tasks: 4 (limit: 19088)
        CPU: 55ms
     CGroup: /system.slice/scx.service
             └─33401 scx_bpfland --slice-us 20000

See how the command line invocation of scx_bpfland has now changed. Note that this setting won't persist across reboots; for that, one has to set SCX_FLAGS in the /etc/default/scx environment file.

Temporarily switch to a different scheduler

This is done setting the value of SCX_SCHEDULER_OVERRIDE (and possibly SCX_FLAGS_OVERRIDE too) with systemctl set-environment, and then restarting the service:

# systemctl set-environment SCX_SCHEDULER_OVERRIDE='scx_lavd'
# systemctl set-environment SCX_FLAGS_OVERRIDE='--performance'
# systemctl restart scx.service
# systemctl status scx.service 
● scx.service - Start scx_scheduler
     Loaded: loaded (/usr/lib/systemd/system/scx.service; enabled; preset: disabled)
     Active: active (running) since Fri 2024-10-25 15:35:55 CEST; 3s ago
 Invocation: 963a7c1b2a604f99beac4ef3bc321a05
   Main PID: 33421 (scx_lavd)
      Tasks: 4 (limit: 19088)
        CPU: 127ms
     CGroup: /system.slice/scx.service
             └─33421 scx_lavd --performance

Note that this setting won't persist across reboots; for that, one has to set SCX_SCHEDULER (and possibly SCX_FLAGS too) in the /etc/default/scx environment file.

Restore the scx.service default values

The following commands make systemd forget about whatever change we made to the environment variables, and restart the service with the default values from the /etc/default/scx environment file:

# systemctl unset-environment SCX_SCHEDULER_OVERRIDE
# systemctl unset-environment SCX_FLAGS_OVERRIDE
# systemctl restart scx.service

Restore the built-in kernel scheduler

By terminating the service, the kernel built-in scheduler gets back in charge:

# systemctl stop scx.service

scx_loader.service: powerful but still unrefined

The version of scx_loader currently packaged doesn't allow the user to specify a default sched_ext scheduler to start at boot (future versions will have this feature). When starting scx_loader.service, the daemon will be waiting in the background for user commands.

# systemctl enable scx_loader.service
# systemctl start scx_loader.service

There are commands for all common operations, such as starting a scheduler, switching to a different one, querying the daemon for the currently running scheduler. Lacking a better utility, these commands have to be manually issued with dbus-send, and they tend to be rather long. This is, for instance, the command to start scx_bpfland -k -c 0:

dbus-send --system \
          --print-reply \
          --dest=org.scx.Loader \
          /org/scx/Loader \
          org.scx.Loader.StartSchedulerWithArgs \
          string:scx_bpfland \
          array:string:"-k","-c","0"

The full list of available scx_loader commands is at the upstream readme file.