SDB:SCSI debugging hints

Jump to: navigation, search


SCSI Debugging

Some helpful information for SCSI debugging can be found here.

scsi_logging_level

The SCSI subsystem has a general logging facility which can be enabled by writing to

/proc/sys/dev/scsi/logging_level

This is a _general_ logging facility, ie it cannot be restricted to individual devices or HBAs. However, it has several distinct logging areas which can be individually selected. Each of these areas span across a 3 bit field in the logging_level value.

The definitions can be found in

drivers/scsi/scsi_logging.h

The possible areas are:

ERROR
Used by any command which has to be retried/recovered via the SCSI error handling mechanism
TIMEOUT
Used by drivers/scsi/sg.c
SCAN
Used during device scan, ie whenever a new device / HBA is initialized
MLQUEUE
Mid-layer queue; requests are being pulled from the block-layer queue and submitted to the HBA
MLCOMPLETE
Mid-layer queue; requests are being completed by the HBA and results are being pushed back to the block-layer
LLQUEUE
Low-layer queue; Not used
LLCOMPLETE
Low-layer queue; Not used
HLQUEUE
High-layer queue; command preparation in drivers/scsi/sd.c
HLCOMPLETE
High-layer queue; command completion in drivers/scsi/sd.c
IOCTL
SCSI IOCTL logging

The detailed error level description:

ERROR

Error logging when a command is recovered via the SCSI error handling mechanism. The following levels are uses:

  1. Error handler thread statistics
  2. Error handler command statistics
  3. Error handler command logging
  4. Not used
  5. Error handler command details
  6. and higher: not used

SCAN

Logging during HBA / target scanning. The following levels are used:

  1. Logging of unusual devices where LUN 0 has a pqual of 3
  2. Logging of devices with pqual 3
  3. Logging of SCSI commands sent during scanning
  4. and higher: not used.

MLQUEUE

The MLQUEUE area is used when a command is being pulled from the block-layer queue and send to the HBA. It has the following levels:

  1. nothing (match completion)
  2. log opcode + command of all commands
  3. same as 2 plus dump cmd address
  4. same as 3 plus dump extra junk
  5. and higher: not used

MLCOMPLETE

Logging of command completion from the HBA, before the completion is being called for the block-layer request. It has the following levels:

  1. log disposition, result, opcode + command, and conditionally sense data for failures or non SUCCESS dispositions.
  2. same as 1 but for all command completions.
  3. same as 2 plus dump cmd address
  4. same as 3 plus dump extra junk
  5. and higher: not used

Starting with SLES10 SP2 there is a command 'scsi_logging_level' which allows you to set the areas and levels without having to calculate the bit offsets by hand, very convenient if you want to enable a logging level other than 0xffffffff.

A useful setting of the logging level without being buried in logging details is ERROR=3, SCAN=3, MLQUEUE=2, MLCOMPLETE=2, which evaluates to

echo 9411 > /proc/sys/dev/scsi/logging_level

SCSI rescan on FibreChannel

Device detection on the SCSI bus works on two levels; on the first level the HBA detect the targets (using HBA / transport specific methods), and after that the SCIS midlayer scans each target for the presented LUNs. On FibreChannel, the (SCSI) target is mapped to a FC port. So the scan for target actually a scan for the visible FC ports.

FibreChannel topology

A HBA device connected to a FibreChannel SAN might operate in two different modes: Arbitrated Loop (AL) or Switched Fabric (SW). SW mode is designed for Switch-to-Switch communication, so it knows about NameServers etc. A scan in SW mode requires a query to the fabric name server, parsing the result, checking each resulting port etc.. AL mode on the other hand is a simplification as it assumes that all remote ports are in a loop, so by querying each possible loop-id all devices in the Loop will be detected. This initialisation routine is known as Loop Initialization Procedure (LIP).

As SW mode has quite some issues with interoperability all linux FC driver (except for the zSeries 'zfcp' driver) run in AL mode.

Loop Initialisation Procedure (LIP)

A LIP is triggered whenever the existing SAN information needs to be updated. Most obviously this is the case during startup, as then the driver needs to detect the available ports.

During operation a LIP should be triggered whenever the SAN configuration changes. However, depending on the Switch configuration this might or might not be the case. Reasoning here is that triggering a LIP is a disruptive operation, causing all remote ports to reconfigure. During this time all I/O on the ports connected to the HBA is suspended. After the LIP has completed I/O will resume (if the ports are still present) or kept suspended if the remote port is not visible anymore. In that case the dev_loss_tmo timer and fast_io_fail_tmo timer (if present) are started. They are responsible for removing the remote port from the system resp. stopping all I/O on the remote port. As this is a full initialisation the remote ports will be reset.

Due to this reason most FC Switches allow for a configuration where a LIP is not automatically started whenever the SAN configuration is changed.

rescan-scsi-bus.sh

As a LIP is not generally triggered if the SAN configuration changes the rescan-scsi-bus.sh has a switch -i|--issue-lip, which causes a LIP to be triggered on the specified HBAs. One has to be aware of the consequences of this option, most notably the device reset.