SDB:SSD discard (trim) support

Jump to: navigation, search


Icon-warning.png
Warning: This article is outdated and needs maintenance.

Tested on openSUSE

Recommended articles


This article discusses a specific SSD feature and what some may consider a shortcoming. This "shortcoming" in no way means that SSDs are not a good solution with openSUSE today. Basic SSD operation is supported in all released versions of openSUSE and provides full ATA-7 support and significant performance improvements over rotating disks are seen for most SSDs with most workloads. It is only the new ATA-8 TRIM feature that is not yet fully implemented nor optimized. Not having this single feature fully integrated into openSUSE (or the linux kernel in general) is no reason not to benefit from the significant performance gain available in modern SSDs.

Terminology

There are three terms often used to interchangeably describe this same basic functionality: Discard, UNMAP, and TRIM.

Discard is the linux term for telling a storage device that sectors are no longer storing valid data and applies equally to both ATA and SCSI devices. ie. For ext4 filesystems, there is a discard mount option, not a trim or unmap option.. Historically this feature has not existed, but recently SSD manufacturers have requested this ability to increase the performance capability of their designs and SCSI array manufacturers have requested similar functionality to better support thin provisioning.

Thus it would be typical to discard small sector ranges with a SSD, but only large ranges with a SCSI array. ie. Some SCSI arrays may clip ranges to 4 MB boundaries. The implementation of array specific as to how ranges are clipped.

TRIM is the actual ATA-8 command that is sent to a SSD to cause a sector range or set of sector ranges to be discarded. As such it should only apply to ATA devices, but is often used generically. Given the prevalence of ATA devices, trim is often the most used of these terms.

UNMAP is a SCSI specific command that is similar in nature, but primarily intended to support thin provisioning. UNMAP too is often used as a generic term as opposed to reserved for SCSI devices.

SCSI also supports WRITE SAME which can also be used to implement discard functionality. WRITE SAME for now appears to only be used when it specifically applies.

Situation

Working with new generation SSD drives is an evolving solution. Basic operation should be supported with all openSUSE releases, but discard /trim functionality is still being implemented fully. SSDs are soon to become common in laptops. To optimize their performance and extend their life, the draft ATA-8 specification calls for trim support. If a range of sectors is trimmed, then those sectors no longer hold meaningful data and the SSD is free to delete the contents and erase them. Erasing is a slow operation for a SSD, so the ability to erase sectors when the sectors are freed instead of when the sectors are needed is a major advantage.

Unfortunately, this advantage is only gained if the userland, kernel and hardware implementation is optimized. As of early 2010, the only optimized implementation is a pure userland implementation that works similarly to defragmenting. ie. It is meant to be called routinely such as by a cron script.

Current status

Prior to 11.2, there is no support for trim.

As of 11.4, fstrim is part of the linux-util package and is the recommended choice for invoking trim for most users.

Kernel support

Kernel realtime discard support

The kernels in openSUSE 11.2 and above support realtime discard. ie. As files are deleted the underlying data blocks are discarded. These kernels support realtime discard for ext4 and xfs.

To use the kernel realtime discard feature, you must mount with the "mount -o discard" option. openSUSE will not automatically set this option, so you must either mount the partition manually or update your /etc/fstab file to do so.

The kernel implementation of realtime trim in 11.2, 11.3, and 11.4 is not optimized. The spec. calls for trim supporting a vectorized list of trim ranges, but as of kernel 3.0 trim is only invoked by the kernel with a single discard / trim range and with current mid 2011 SSDs this has proven to cause a performance degradation instead of a performance increase. There are few reasons to use the kernels realtime discard support with pre-3.1 kernels. It is not known when the kernels discard functionality will be optimized to work beneficially with current generation SSDs.

Kernel swap device discard support

The linux kernel since 2.6.29 has supported discard for pages in swap that are no longer used. The performance impact is variable and can either help or hurt specific SSDs.

In openSUSE 11.2 and 11.3 there is no way to control this feature. It is assumed by the kernel to be beneficial for all devices that support discard.

In the 2.6.36 kernel an option controlled by userspace was introduced. For 11.4 userspace control support defaults to disabled. util-linux-ng version greater then v2.18 is required to enable the discard capability. (Note that util-linux-ng was renamed to util-linux in early 2011.)

For details: see http://marc.info/?l=util-linux-ng&m=128796678428160

Kernel mkfs discard support

One recent kernel optimization which will effect new ext4 partitions created with openSUSE 11.4 and newer is that the ext4 formatting tool mke4fs will now discard the entire partition at format time and the process has been optimized in the kernel to use 64 contiguous ranges per TRIM command. As such it is well optimized. For performance reasons, it is important that newly created partitions have all of their unused blocks in a TRIMed state at the end of the formatting process. Thus it is highly recommended that mkfs from 11.4 or newer be used when formatting a SSD which previously contained valid data.

Kernel batched discard support

The 2.6.37 kernel gained a new ioctl for ext4 only: FITRIM. openSUSE 11.4 uses this kernel. The openSUSE 11.4 userspace tool to invoke this new feature is fstrim. For older releases fstrim is currently available via git from:

http://www.spinics.net/lists/util-linux-ng/msg03646.html

One feature of fstrim / FITRIM that is not fully supported via wiper.sh below is discard support for device mapper linear and stripe volumes:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5ae89a8720c28caf35c4e53711d77df2856c404e

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7b76ec11fec40203836b488496d2df082d5b2022

More importantly, FITRIM has been accepted as a mainstream kernel functionality and as such gets broader review that wiper.sh below. As such, FITRIM (invoked by fstrim) should be far less likely to cause data loss by being attempted in a non-supported environment.

Userland support

fstrim from the util-linux (or util-linu-ng) package is the recommended tool for trimming SSDs in 11.4. It must be invoked from time to time, just like a defrag tool would be. Setting it up to be called from cron would be best if that is feasible.

Prior to 11.4, both 11.2 and 11.3 have updated versions of hdparm which can bypass the normal block I/O layers and directly submit optimized trim commands to the SSD. hdparm requires a control script to decide which sectors are appropriate for trimming. That is provided in the hdparm package by script wiper.sh which is available in both the 11.2 and 11.3 hdparm packages.

At present neither 11.2 nor 11.3 will invoke wiper.sh automatically. It is up to the user to create a cron script to do so. In general wiper.sh should only take a few seconds to run and it must be run as root.

How Often should I run fstrim / wiper.sh

For most users, once a week should be sufficient. That may seem too long between runs, but it should work as long as you are not running a workload that is creating and deleting tons of files (with content) at a high rate.

In theory a SSD is handling known free areas like a LIFO and it only takes a few msecs after you put a block into the LIFO before it is erased and available to be pulled off the top. So as long as you have sufficient available erased blocks, it can run at full speed pushing newly freed blocks on the bottom and pulling fresh erased blocks off the top.

If the workload were a database, conceptually every write goes to a newly allocated block, but the old block is immediately freed and put into the erase queue. With newer generation SSDs it seems a garbage collector is run behind the scenes by the SSDs internal firmware to scavenge these overwritten areas of the SSD storage and efficiently make the space available for future use.

ie. SSDs do not allow for physical data blocks to be updated. Every write requires a new data block be pulled off the free block stack, and the previous physical block is freed for erasing. Thus a process like a database the keeps updating previously allocated data blocks just causes a lot of churn, but doesn't really make the stack of free blocks any smaller.

(Note: Newer generation SSDs no longer work as described above. Instead they appear to be tracking used / unused at a more granular level than the EB level. Thus a database style update causes a portion of a EB to marked free and a new EB be allocated to hold the new data. Thus heavy database style updates can cause a shortage of free EBs due to all the partially used EBs left behind. Modern SSDs use a EB consolidator running in the background to address this issue. Having SSDs properly trimmed allows the EB consolidator to work at optimal performance. See SDB:SSD_Idle_Time_Garbage_Collection_support for more detail.)

It is file deletion that leaves the SSD assuming the blocks are valuable and thus not available for erasing, while the filesystem itself knows those blocks will never be used again.

So the whole purpose of running wiper.sh or fstrim is to tell the SSD to put the no longer allocated blocks into the erase / free queue.

As long as that SSD doesn't stall trying to pull blocks off the top of that queue, it really doesn't matter how deep it is. So if you have 10GB of free space on your partition, you only need to call wiper.sh / fstrim once every 10GB worth of file deletions.

Hdparm Bugs

A online update for hdparm exists which fixes the below.

- For openSUSE 11.2 and openSUSE 11.3 you can install the online update via:

    zypper in -t patch hdparm-3298


History of bug 635920 fixed via patch hdparm-3298

For OCZ Vertex 2E SSDs, see https://bugzilla.novell.com/show_bug.cgi?id=635920

Also, per Mark Lord, the hdparm maintainer, the Intel SSDs do not fully support trim per the draft specification and suffer from the same bug. Therefore the script wiper.sh or the hdparm binary itself has been updated in an effort to support the Intel SSDs.

The updates have been made in hdparm v9.32 and backported to openSUSE 11.2 and 11.3 for the above online update.

Although wiper.sh is still considered experimental, it is not believed that calling wiper.sh on any SSD will cause any data loss. Trim payload incompatibilities should simply cause the trim operation to fail. As of the above online update, there are no known issues.

User feedback

Comment from the EXT4 kernel mailing list

From: Mark Best

I have a new Vertex2 Solid State Drive. When I try to install any distributions using EXT4 (or LUKS with EXT4). My hard drive times out during the 'file copy' process. (OpenSUSE 11.2 for example would crash after the 2nd file of X11. Other distros get a bit into the install before it gets 'read-only' errors.)

DISTRO'S ATTEMPTED:

  • Ubuntu 10 - x64
  • Debian 5 - x64
  • PCLinuxOS 9 - x32
  • CentOS 5.4 - x64
  • OpenSUSE 11.2 (Linux 2.6.31) - x64
  • OpenSUSE 11.3 - Build 0625 (Linux 2.6.34) - x64

Attempts:

  • If I use my Seagate 1TB drive I can install just fine to any EXT4 partition. (Same SATA cable and SATA controller port)
  • If I use the EXT2 filesystem on my Verex2 SSD I can install any distro.
  • Windows 2003/NTFS also installs fine to the Vertex2 SSD.
  • Writing random 'dd' data to the entire SDD returns no errors.
  • Upgraded EVGA motherboard BIOS to latest version (P33) didn't help.
  • Installing the OpenSUSE 11.2 OS to my HDD, and then copying files from the HDD to the Vertex2 causes the same I/O timout/remount in read only errors.

Related articles

External links