SDB:Checking System Stability

From openSUSE


Version: 1.0 -

Request

You would like to make sure that Linux runs and is stable on your hardware setup.

Procedure

Provided that you have installed the kernel sources and the necessary developer tools (e.g., the compiler), you can check your system stability quite reliably with the following little script:

 #!/bin/bash
 #
 # Adapted from http://www.bitwizard.nl/sig11
 #

 cd /usr/src/linux

 t=1
 while [ -f log.$t ]
   do
   ((t++))
 done

 if [ ! -r .config ]; then
   echo -e
   echo -e "There was no .config file found in /usr/src/linux ."
   echo -e "This means that the kernel sources have not been configured yet."
   echo -e "If you continue, "make cloneconfig" will be executed to create"
   echo -e "a kernel configuration based on the currently running kernel."
   echo -e "\n"
   echo -e "Press <Ctrl>-<C> to abort or <ENTER> to continue ..."
   read
   make cloneconfig
 fi

 touch log.1
 watch "ls -lt log.*" &

 while true
   do
   mv .config .config.stress
   make mrproper &> /dev/null
   mv .config.stress .config
   make V=1 > log.$t 2> /dev/null
   ((t++))
 done
 

The original version of this script can be found on http://www.bitwizard.nl/sig11. There, also find some background information regarding this test.

Advantages of this test compared to other methods (e.g., memtest86) are:

  1. The complete system will be tested, not just the RAM
  2. The system can stay operational while this test is running (see note of caution below, though)

The script will run an endless loop of kernel compiles (make bzimage) and saves the output of make in a separate log file (which will be quite large) for each run.

Normally, it would be expected that each run will result in identical output.

While running, the script will give a continuously updated view with

ls -l /usr/src/linux/log.*

It could look like this:

Every 2s: ls -lt log.*               Wed Aug  8 15:22:02 2001
-rw-r--r--    1 root     root         5472 Aug  8 15:22 log.4
-rw-r--r--    1 root     root       127120 Aug  8 15:21 log.3
-rw-r--r--    1 root     root       127120 Aug  8 15:12 log.2
-rw-r--r--    1 root     root       127120 Aug  8 15:04 log.1

In this example, the first three runs have been completed already. It is a good sign that the file size of the corresponding log files does not differ. However, to be on the safe side, one should let this test run for about 24 hours. On top of comparing the file size of all log files, you can use md5sum to create check sums for all these files to check if these files are really identical:

linux:/usr/src/linux # md5sum log.*
51e25c01370ce034b2c00d4c71995f02  log.1
51e25c01370ce034b2c00d4c71995f02  log.2
51e25c01370ce034b2c00d4c71995f02  log.3
a014cc76b1fb46a3cc5b84484403a1b7  log.4

It is no surprise that the fourth log file has a different check sum, since this run was not completed yet. All the other (completed) runs should show identical check sums, however.

Note: Under certain circumstances, the first run can result in a slightly different output compared to all the following runs. As a general rule, therefore, one can say that all but the first and the last (not completed) run must result in identical log files.

Note of Caution: While in general this test can run in parallel to normal operations of the system, be aware that it can heavily impact both I/O and compute performance of the system. One customer even reported that it forced them to stop the VMware server on their production system.

SDB:General Hardware Problems

TID 3301593 - Linux system hangs or is unstable