Wednesday, October 7, 2009

I Have a Schedule to Keep - IO Schedulers

It almost goes without saying that the Linux kernel is a very complex piece of software.

It is used in embedded devices that need real-time performance, hand-held devices, laptops, desktops, general servers, database servers, video servers, DNS systems, very large supercomputers, and on and on.

All of these uses for the kernel have very different requirements. Some require the system be responsive to user input so you don’t interrupt streaming music or video or other interactivity.

At the same time there are requirements for good IO performance (throughput, IOPS, etc.) and for some workloads, these requirements are very high.

To make sure that there is balance within the system for all users and processes, there is a concept of schedulers within the kernel.

The schedulers do exactly what the title says — they schedules activities within the kernel.

Since this column is all about IO, the scheduler of interest is, aptly enough, the IO scheduler.

This article discusses the IO scheduler concepts and the various options that are available.

Introduction - IO Scheduler Concepts

Virtually all applications running on Linux do some sort of IO. Even surfing the web produces a great number of small files that are written to disk.

Without an IO scheduler, every time there is an IO request, there is an interrupt to the kernel and the IO operation is performed.

More over, you can get a great mix of IO operations that move the disk head around the disk to satisfy read and write operations to different blocks on the drives.

Perhaps more importantly, over time the disparity in the performance of disk drives and the rest of the system has grown very rapidly meaning that IO has become more important to overall system performance.

As you can imagine when the kernel has to address the interrupt so any kind of processing or interactive work is paused.

Consequently the system may appear unresponsive, or it may appear that the system has slowed down.

How do you schedule the IO requests to preserve the interactivity while also ensuring good IO performance?

The answer, as with most things, depends upon the workload. In some cases it would be nice to be able to do IO while doing other things.

In other cases, it is desired to do IO as fast as possible. To balance these two very different workloads or to ensure that one workload is not emphasized for others (unless you intend it that way), the concept of the IO scheduler was born (actually it’s a pretty old concept).

Scheduling IO events has many pieces to it that must be addressed. For example, the scheduler may need to store the events for some future execution in some sort of queue.

How it stores the events, possibly reordering the events, the length of time it stores the events, does it execute all stored events when some condition is reached, does it execute events at some regular interval, etc, are all very crucial aspects of the scheduler.

Exactly how these various aspects of the scheduler are implemented can have a huge impact on the overall IO performance of the system and the perception people have when interacting with the system.

Defining the function or role of the system is probably the best place to start when considering scheduler design or tuning existing schedulers.

For example, you should know if the target system is an embedded device, a hand-held device, a laptop, desktop, server, supercomputer, database server, video server, and on and on.

Knowing this allows you to define what your goals are for the scheduler.

For example, if the target system is a desktop that is doing some web surfing as well as perhaps watching a video or listening to music, and maybe even playing a game. Seems simple, but this has enormous implications.

For example, if you watching a video or listening to music or playing a game, you don’t want it to be interrupted and you don’t want any frames to be dropped.

Nothing like a video that pauses, plays, pause, plays, to make you sea-sick in a hurry. Or you might be ready to blow the head of a mutant zombie and the system pauses while you are firing and when the system comes back up the zombie has removed your character’s head. And while “stuttery” music may be a genre to some, in general, it’s quite annoying.

So, if your target system is a desktop and you want to have as little interactive interruption as possible, then this has a great influence on the design of the scheduler.

One important advantage that IO scheduling gives the system is that it allows you to store events and even possibly reorder them for faster IO.

Since the time it takes disk IO to happen can be much slower than other aspects of the system, this can produce contiguous IO requests which can improve performance.

Newer file systems are even incorporating some of these concepts so that they can reorder the operations to make things easier and faster for the storage devices.

You can even extend these concepts to make the system better adapt to the unusual properties of SSDs.

There are some typical techniques that can be to help IO schedulers. These techniques are:

  • Request Merging: In this concept, adjacent requests are merged together to reduce disk seeking and to increase the size of the IO syscalls (usually resulting in higher performance).
  • Elevator: The requests are ordered based on their physical location on the disk so that the seeks are in one direction as much as possible.
  • Prioritization: This allows the requests to be put into some sort of priority order. The details of the ordering are up to the IO scheduler.

In addition almost all IO schedulers take into account resource starvation so that all requests are eventually serviced.

Linux IO Schedulers

There are currently four IO schedulers in the Linux kernel:

  • NOOP
  • Anticipatory
  • Deadline
  • Completely Fair Queuing (CFQ)

NOOP IO Scheduler

The NOOP IO scheduler is a fairly simple scheduler. With this scheduler all incoming IO requests are put into a simple First-In, First-Out (FIFO) queue and then executed.

Note that this happens for all processes running on the system regardless of the IO request (read, write, lseek, etc.). It also does something called request merging.

This is a feature that takes adjacent requests and merges them into a single request. This reduces seek time and improves throughput.

According to this article, the NOOP scheduler “… uses the minimal amount of CPU/instructions per I/O to accomplish the basic merging and sorting functionality to complete the I/O.”

The IO scheduler assumes that some other device will optimize the IO performance. For example, an external RAID controller or a SAN controller could perform this optimization.

Potentially, the NOOP scheduler could work well with storage devices that don’t have a mechanical component to read data (i.e. the drive head).

The reason is that the NOOP scheduler does not make any attempts to reduce seek time beyond simple request merging which also helps throughput.

So storage devices such as flash drives, SSD drives, USB sticks, etc. that have very little seek time could benefit from using a NOOP IO scheduler.

Anticipatory IO Scheduler

The Anticipatory IO Scheduler, as the name implies, anticipates subsequent block requests.

It implements request merging, a one-way elevator (basically an elevator), and read and write request batching.

After the scheduler services an IO request, it anticipates that the next request will be for the subsequent block by pausing for a small amount of time.

If the request comes, the disk head is in the correct location and the request is very quickly serviced.

This approach does add a little latency to the system because it pauses slightly to see if the next request is for the subsequent block.

However, this can possibly be out-weighed by the increased performance for neighboring requests.

Putting on your storage expert hat one can see that the anticipatory scheduler works really well for certain workloads.

For example it has been observed that the Apache web server may achieve up to 71% more throughput using the anticipatory IO scheduler.

On the other hand, it has beenobserved that the anticipatory scheduler has caused up to a 15% slowdown on a database run.

Deadline IO Scheduler

The Deadline IO Scheduler was written by Jens Axboe, a well know kernel developer.

The fundamental principle of the scheduler is to guarantee a start time for servicing an IO request.

It combines request merging, a one-way elevator, and imposes a deadline on all requests (hence the name).

It maintains two deadline queues in addition to the sorted queues for reads and one for writes.

The deadline queues are sorted by their deadline times (time to expiration) with shorter times moving to the head of the queue.

The sorted queues are sorted based on their sector number (the elevator approach).

The deadline scheduler really helps for the cases of a remote read (remote meaning fairly far out on the disk or with a large sector number).

Reads can sometimes block applications because they have to be actually read from the disk while the application waits.

On the other hand writes can be quickly returned to the application because they are in the cache (unless you turn off the cache or use Direct IO).

Even worse, remote reads get serviced very slowly because they constantly get moved to the back of the queue as requests for closer parts of the disk get serviced first.

So, the deadline scheduler makes sure that all IO requests are serviced, even these distant read requests.

The general process of the scheduler is fairly straight forward. The scheduler decides on the next request by first deciding which queue to use.

It keeps a higher priority to reads because, as mentioned, applications usually block on read requests. Next, it checks the first request to see if it has expired.

If so, it is immediately serviced. Otherwise the scheduler serves a batch of requests from the sorted queue.

For both cases, the scheduler also services a batch of requests following the chosen request in the sorted queue.

The deadline scheduler is very useful for some applications. In particular, real-time systems use the deadline scheduler because in most cases, it keeps the latency low (all requests are services within a short time frame).

It’s also been suggested that it also works well for database systems that haveTCQ aware disks.

CFQ IO Scheduler

The Completely Fair Queue (CFQ) IO scheduler, is the current default scheduler in the Linux kernel. It uses both request merging and elevators.

It synchronously puts requests from processes into a number of per-process queues. Then it allocates timeslices for each of the queues to access the disk.

The details of the length of the time slice and the number of requests a queue is allowed to submit, are all dependent on the IO priority of the given process.

Asynchronous requests for all processes are batched together in a fewer number of queues with one per priority.

Jens Axboe is the original author for the CFQ IO scheduler and it incorporates something that Jens called the “elevtor linus“.

It develops upon idea of an elevator by adding features to prevent starvation for worst case situations as could happen with distant reads.

The previous link is also a good discussion of the design of the CFQ IO scheduler and the intricacies of scheduler design (also it discusses the design of the deadline scheduler - it’s well worth reading).

What CFQ does, is to give all users (processes) of a particular device (storage) about the same number of IO requests over a particular time interval.

This can help multi-user systems since all users will see about the same level of responsiveness.

More over, CFQ achieves some of the good throughput characteristics of the anticipatory scheduler because it allows a process queue to have some idle time at the end of a synchronous IO request creating some anticipatory time waiting for some IO that might be close to the just finished request.

Changing the Scheduler

The 2.6 kernel series actually allows you to change the IO scheduler in several ways.

For example, you can change the default scheduler for the entire system using the “elevator=” option at the “kernel” line during the boot process or in the grub configuration.

This can be done manually during boot or can be done in the grub configuration file.

If you change the default IO scheduler by editing grub, be sure to edit the /boot/grub/menu.lst file adding the option “elevator=” to the end of the line.

For example, you could change it from cfq to deadline by adding the option “elevator=deadline” to the line that begins with “kernel”. If you change it, be sure to run the “update-grub” command afterwards.

A second way to change the IO scheduler is to actually change it on the fly for specific devices.

For example, you can determine which IO scheduler is being used by looking at the file, ” /sys/block/[device]/queue/scheduler” where [device] is the name of the device. For example, on my laptop,

root@laytonjb-laptop:~# cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]

Notice that the current IO scheduler is cfq. You can change the scheduler by just echoing the name of the desired scheduler to "/sys/block/[device]/queue/scheduler”.

For example, I can change the IO scheduler on my laptop to deadline.

root@laytonjb-laptop:~# echo deadline > /sys/block/sdb/queue/scheduler
root@laytonjb-laptop:~# cat /sys/block/sdb/queue/scheduler
noop anticipatory [deadline] cfq

Notice how the IO scheduler has changed to deadline. When the change in scheduler is performed the “old” scheduler completes all of it’s requests before control switches over to the new scheduler (ain’t Linux grand?).


This article is just a quick introduction to IO schedulers in Linux. Today’s systems can have large number of users, very IO intensive workloads, requirements for high levels of interactivity, real-time requirements, plus a large number of disks and/or file systems.

Given the enormous strains that current systems impose on IO subsystems, some way of controlling IO requests is mandatory.

This is where the IO scheduler comes helps.

The IO scheduler is not a new concept, but it is a very important one. These schedulers can be designed to influence IO and system behavior in whatever manner you desire.

Currently there are four IO schedulers in the Linux kernel: (1) NOOP, (2) Anticipatory, (3) Deadline, and (4) Completely Fair Queue (CFQ). Various aspects of the schedulers were discussed at a fairly high level in this article.

While not discussed in this article you can tune the various schedulers for your workload.

Take a look at the documentation that comes with the source for your current kernel.

For example, on my laptop, the documentation is found in the directory, /usr/src/linux-source-2.6.27/Documentation/block. In addition, there are a great number of articles around the web that discuss tuning.

One easy thing you can try is changing the IO scheduler associated with a particular device.

It’s an easy process that just echos the name of the IO scheduler to the particular file in the /sys file system.

This is fairly easy to do and can give you some interesting results (hint, hint).

No comments:

Post a Comment