Magic soup: ext4 with SSD, stripes and strides.

My recipe:

stride = Page size / Filesystem block
stripe-width = Erase Block / Filesystem block

For my Kingston HyperX 3K SSD SH103S3 (data provided by the Kingston customer service):

  • Page size = 8 kB
  • Erasure Block = 2 MB

Filesystem block = 4 kB
stride = 8 kB / 4 kB = 2
stripe-width = 2 MB / 4 kB = 512

Now the long story. So you have a new SSD and you want to use ext4, if you are reading these lines you must be confused (and really desperate to scroll down and down in Google until you reached this post). I also was in that position (and I reached my own block) and let me share with you, my last monkey colleagues, my findings.

My advice is to start by reading the best SSD guide or this other one which is also very good but a bit more technical. But if you want to omit that step (you shouldn’t) because you are in a hurry (did you leave the chicken in the oven?) I will present some important concepts so you may follow up if you are already familiar with the topic.

The solid state drives are basically flash memories (think of a USB memory or a SD card), the information is kept in cells in the form of electrical voltage. A cell can store a single bit or multiple bits and here we have the first terms:

SLC memories are faster and more reliable but MLC are cheaper and therefore they are more common.

The cells are organized together in groups which are called pages; a page is the minimum amount of data an SSD can read/write. Likewise pages are grouped together in blocks, a block is the minimum data unit which can be erased and for that reason they are called erase blocks or erasure blocks and they constitutes the bigger drawback of SSD. These days it is normal to have pages of 8 kB and erase blocks of 2 MB.

Let’s use those values and understand what they imply in a simplified manner. So you’ve got your SSD which has pages of 8 kB and erase blocks of 2 MB, that means that your SSD cannot write or read less than 8 kB at a time, even if you want to read only 1 kB it will read 8 kB no matter if you care or not for the other 7 kB. Similarly, but worse, your lovely SSD can only delete blocks of 2 MB, so what happens when you only want to delete 5 kB of data? Well, the SSD will have to work extra, it will copy the whole 2MB block to a temporary (internal cached) memory, delete the 5 kB and copy the result back to disk. This is a read-modify-write cycle (guess why) and it is very similar for overwrites. No doubt SSD will try to avoid those.

I’ll encourage you again to read the SSD articles (one and two), in this post I just presented the concepts I need for the ext4 parameters stride and stripe.

Now we move to the ext4 parameters stride and stripe-width. These two parameters are related to RAID and not to SSD, why do we care then? Because they seem quite similar to pages and erase blocks. Let’s see first what the manual page for mkfs.ext4 says about them.

Configure the filesystem for a RAID array with stride-size filesystem blocks. This is the number of blocks read or written to disk before moving to the next disk, which is sometimes referred to as the chunk size. This mostly affects placement of filesystem metadata like bitmaps at mke2fs time to avoid placing them on a single disk, which can hurt performance. It may also be used by the block allocator.

Configure the filesystem for a RAID array with stripe-width filesystem blocks per stripe. This is typically stride-size * N, where N is the number of data-bearing disks in the RAID (e.g. for RAID 5 there is one parity disk, so N will be the number of disks in the array minus 1. This allows the block allocator to prevent read-modify-write of the parity in a RAID stripe if possible when the data is written.

I remarked in bold the important part (usually I remark in bold the unimportant but today I felt like a change). The definitions look a bit far from SSD but let’s look closely.

  1. stride: This is the number of blocks read or written to disk before moving to the next disk, which is sometimes referred to as the chunk size.
    Interpretation: this is the smallest amount of data the filesystem will use to copy/read data from one disk, now this sounds more familiar.
  2. stripe-width: This allows the block allocator to prevent read-modify-write of the parity in a RAID stripe if possible when the data is written.
    Interpretation: the filesystem will try to arrange writes in blocks of this size in order to minimize expensive operations. In RAID configurations using parity (like RAID 5) read-modify-write operations are expensive because the original data and parity are read, the data is modified, the new parity is recalculated and then the new data and the new parity are written back to disk.

Note that both stride and stripe-width are expressed in terms of filesystem block, which usually are 4 kB.

Now we have all the concepts let’s wrap up:

  • Data is store in cells:
    Single Level Cell (SLC): stores one bit
    Multi Level Cell (MLC): stores more than one bit, usually two bits
  • Page: cells are grouped into pages, a page is the smallest amount of data that a SSD can read/write.
  • Blocks (Erasure Blocks): pages are grouped into blocks, an erase block is the smallest structure that a SSD can erased.
  • stride: minimum unit used to read/write data.
  • stripe-width: targeted block of data for read-modify-write operations, ext4 will attempt to perform such operations using this block size.

What is the distributed knowledge relating strides and stripes with pages and blocks? Well, there is not a common conclusion, there different opinions which I have gathered for you (to be honest it was for me but saying “for you” is more polite). Below you can read the approaches, each one has a representing link, the formulas they use, my interpretation of them and, if it applies, why I disagree.

Approach 1: set the stride and stripe-width to erase block size.
This approach seems to be the most extended in the forums and guides about SSD optimization. It identifies the stride as the erase block and the stripe-width is computed like if it were a RAID configuration with only one disk, after such identification, the formulas are taken from the RAID wiki.

stride = Erase Block / Filesystem block
stripe-width= stride x (number of data disk in RAID) = stride x 1 = stride

Reasoning: when data is modified or overwritten, the SSD writes are expensive, let’s set the minimum data unit to the erase block size to try to minimize the execution of expensive operations.

IMO flaw: from the basic three operations (read, write and delete) only deletion is expensive. Keep in mind that data modification/overwrite is what can trigger deletion (and of course the actual action to delete data), setting the stride equal to the erase blocks penalize reads and simple writes which don’t imply modifications.

Approach 2: set the stride to the page size and the stripe-width related to the erase block.
This approach defines the stride as the page size of the SSD and the stripe-width as the number of strides in an Erase block.

stride = Page size / Filesystem block
stripe-width= Erase Block / (stride x Filesystem block) = Erase Block / Page size

Reasoning: ext4 can write and read units of page size without penalties so that should be the minimum data unit. For the stripe-width we will indicate to ext4 to use the erase block size to minimize read-modify-write operations so hopefully ext4 will try to fill an entire Erase block before moving on to the next.

IMO flaw: the stripe-width formula is not defined in terms of how many filesystem blocks it contains (but how many strides) however the manual of ext4 says “stripe-width filesystem blocks per stripe” which means that the size of stripes will be ‘ stripe-width x filesystem blocks’.

Let’s say that we have filesystem block of 4 kB, page size of 8 kB and Erase block of 2 MB, then:
stride = 8 kB / 4 kB = 2
stripe-width = 2 MB / ( 2 x 4 kB) = 2 MB / 8 kB = 256
Therefore, ext4 will try to group modify actions in 256 x 4 kB = 1 MB blocks which is half the size of our erase block.

Approach 3: let ext4 to decide about the stride and set the stripe-width to the erase block.

stride = automatic by ext4.
stripe-width = Erase Block / Filesystem block

Reasoning: stride is not related to SSD concepts but stripes are, so let’s leave the stride alone and work out the proper stripe-width. We should tell to ext4 to try to group actions which modify data in sets of erase blocks, because if we have to delete something the minimum we can delete is an entire erase block.

Approach 4: heuristic, test and set
From the above link:

For most of us SSDs are magic boxes we push data into and pull data out of.

We know the data gets stored on NAND chips and that many (most?) NAND
chips have 128KB Erase Blocks.

But we have no knowledge of how the data itself is organized.
Assuming that an Erase Block contains contiguous sectors is wrong in
most cases. There is sophisticated logic going on that is re-mapping
the data. Those algorithms are NOT public. We definitely don’t know
enough to know what stride etc. is optimal.

Approach 5: mix 2 and 3
This is what I have come to conclude
stride = Page size / Filesystem block
stripe-width = Erase Block / Filesystem block

And at least one other person agrees not explicitly but it uses the same formulas.

And that’s for today.


Other links:

This entry was posted in SSD and tagged , , , , , . Bookmark the permalink.

2 Responses to Magic soup: ext4 with SSD, stripes and strides.

  1. Søren B says:

    You seem to have missed the article I consider the most authoritative on the matter, given it was written by an actual filesystem guy instead of random laymen like most of the articles out there:

    • thepadawan42 says:

      It’s true Ted Tso is an actual expert in the field (and also true I am one of those laymen) but his article was not included because it did not define the value for “stride” and because I considered that the formula for “stripe-width” was better explained in this other article (which also has a better presentation). But I can add it, along with the follow-up article, in the links section for future reference.

      Thank you for you comment.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s