Disk Scheduling: The Elevator Problem

Imagine a high-traffic database server hosted on traditional spinning hard drives (HDDs). User A requests data from a table located on the outer edge of the disk, while User B requests data from the inner edge. If the disk blindly serves requests in the exact order they arrive, the mechanical read/write arm will violently swing back and forth across the disk platters. This mechanical movement—the Seek—is the absolute slowest operation in modern computing, taking thousands of times longer than fetching data from RAM. To prevent this “thrashing,” the Operating System steps in as a smart dispatcher, reordering I/O requests to minimize physical movement.

[!NOTE] The Mechanical Limit In a Hard Disk Drive (HDD), the read/write head must physically move to the correct track (Seek) and wait for the sector to spin under it (Rotation). Minimizing this physical movement is the #1 performance goal for HDD management.

1. Anatomy of Latency

Before we can optimize disk I/O, we must understand the physical components that introduce delay. A magnetic disk consists of spinning platters coated in magnetic material. A mechanical arm swings across the platters, positioning a read/write head over the correct track.

When an application asks for data, the total time taken is: Latency = Seek Time + Rotational Latency + Transfer Time

Seek Time: The time it takes to move the mechanical arm to the correct cylinder/track. This is the slowest component (typically ~3-10ms).
Rotational Latency: The time spent waiting for the spinning platter to rotate the desired sector directly beneath the read/write head (Avg: ~4ms for a 7200 RPM drive).
Transfer Time: The time it actually takes to magnetically read or write the bits as they pass under the head (Extremely fast, heavily dependent on data density and rotation speed).

Analogy: Think of reading a disk like playing a vinyl record. Seek Time is moving the needle to the correct song. Rotational Latency is waiting for the exact lyric to spin under the needle. Transfer Time is actually playing the song.

2. Scheduling Algorithms: The Elevator Dispatcher

Because Seek Time dominates the latency, the OS acts like an Elevator Dispatcher. If an elevator receives requests for floors 10, 50, and 5, it shouldn’t randomly bounce 10 → 50 → 5. It should sweep in one direction to service them efficiently.

Let’s analyze the algorithms using a queue of track requests: [98, 183, 37, 122, 14, 124, 65, 67], with the head initially at track 53.

A. FCFS (First-Come, First-Served)

How it works: Process requests in the exact order they arrive.
The Result: The head swings wildly: 53 → 98 → 183 → 37 → 122 → 14 → 124 → 65 → 67.
Total Seek: 640 tracks.
Verdict: Terrible for performance. Never used in practice for mechanical drives.

B. SSTF (Shortest Seek Time First)

How it works: Always service the request that is closest to the current head position.
The Result: 53 → 65 → 67 → 37 → 14 → 98 → 122 → 124 → 183.
Total Seek: 236 tracks. Much better!
The Pitfall: Starvation. If a heavy stream of requests continually arrives for tracks 50-70, a lone request for track 183 will be indefinitely delayed.

C. SCAN (The Elevator Algorithm)

How it works: The head starts at one end of the disk and moves continuously toward the other end, servicing requests as it reaches each track. Once it hits the physical edge, it reverses direction and sweeps back.
The Result: Assuming we start at 53 moving towards 0: 53 → 37 → 14 → 0 (edge) → 65 → 67 → 98 → 122 → 124 → 183.
The Pitfall: Middle tracks get serviced twice as often (once going up, once going down), while edges wait longer.

D. C-SCAN (Circular SCAN)

How it works: Similar to SCAN, but to provide a more uniform wait time, it only services requests in one direction. It sweeps from 0 to the highest cylinder, then immediately jumps back to 0 without reading anything on the return trip.
Pros: Highly uniform wait times across all tracks.

E. LOOK and C-LOOK

How it works: These are the practical implementations of SCAN and C-SCAN. Instead of blindly sweeping all the way to cylinder 0 or the absolute highest cylinder, the arm only sweeps as far as the last request in that direction, then reverses or jumps.

3. Interactive: The Elevator Visualizer

Act as the OS scheduler. Add random track requests to the queue and watch how a simplified C-LOOK/SCAN algorithm efficiently sorts and services them in a single sweep, minimizing chaotic arm movements.

Track 0

Track 50

Track 100

HEAD

      Add requests to start.
  

4. The Modern Era: SSDs and NVMe

Wait… Solid State Drives (SSDs) don’t have spinning platters or mechanical arms. Do we still need disk scheduling algorithms like SCAN and SSTF?

The short answer is No. In fact, modern Operating Systems explicitly disable complex seek scheduling for SSDs (often reverting to a noop or none scheduler). Here is why:

Seek is Free: In NAND flash memory, reading logical block address 0 or address 1,000,000 takes the exact same amount of time. There is no physical penalty for jumping around.
Internal Parallelism: SSDs contain multiple NAND flash chips connected via parallel channels. We want to send multiple un-ordered requests to the drive simultaneously so the drive’s internal controller (the Flash Translation Layer, or FTL) can fetch them in parallel. Serializing them into a strict “SCAN” line actively hurts performance.

The Evolution: Multi-Queue Block Layer (blk-mq)

Historically, the Linux kernel used a single, global request queue for storage devices. This required a global lock. On massive multi-core servers pushing millions of IOPS to an NVMe drive, threads spent more time fighting for the lock than reading data.

Modern Linux uses blk-mq. Instead of one global queue, there is one software queue per CPU core.

Application threads push requests to their core’s local queue (lock-free).
The kernel batches these and dispatches them directly to the NVMe hardware’s multiple submission queues.

io_uring: Asynchronous I/O Done Right

Standard read() system calls block the thread while waiting for the disk. Legacy asynchronous I/O (aio) in Linux had numerous flaws and often blocked unexpectedly.

io_uring (introduced in Linux 5.1) revolutionized storage I/O by utilizing two ring buffers in shared memory between the application and the kernel:

Submission Queue (SQ): The application pushes requests into this ring.
Completion Queue (CQ): The kernel pushes the results into this ring.

Because the memory is shared, applications can submit I/O requests and reap the results without ever issuing a system call context switch, leading to Zero Syscall Overhead when operating in polling mode.

5. Code Example: Request Merging

Even though we don’t worry about mechanical “seeks” on SSDs, the OS still tries to merge adjacent block requests. Issuing a single large read for sectors 1-4 is significantly more efficient (less overhead, fewer interrupts) than issuing two separate reads for sectors 1-2 and 3-4.

Below is a Go simulation of how the OS block layer merges adjacent I/O requests before sending them to the hardware.

package main

import (
    "fmt"
    "sort"
)

// IORequest represents a block of data requested from storage
type IORequest struct {
    StartBlock int
    Count      int
}

func main() {
    // Incoming concurrent requests for adjacent blocks
    // e.g., Reading blocks 10-11, Block 100, and Blocks 12-13
    queue := []IORequest{
        {StartBlock: 10, Count: 2},
        {StartBlock: 100, Count: 1},
        {StartBlock: 12, Count: 2},
    }

    fmt.Println("Before Merge:", queue)
    merged := mergeRequests(queue)
    fmt.Println("After Merge: ", merged)
}

func mergeRequests(reqs []IORequest) []IORequest {
    if len(reqs) == 0 {
        return nil
    }

    // 1. Sort the requests logically by their starting block address
    sort.Slice(reqs, func(i, j int) bool {
        return reqs[i].StartBlock < reqs[j].StartBlock
    })

    var result []IORequest
    current := reqs[0]

    // 2. Sweep through and merge adjacent blocks
    for i := 1; i < len(reqs); i++ {
        next := reqs[i]

        // Check if the next request starts exactly where the current one ends
        if current.StartBlock+current.Count == next.StartBlock {
            // Merge! We can fetch them in a single operation
            current.Count += next.Count
        } else {
            // Gap detected, push the merged chunk and start a new one
            result = append(result, current)
            current = next
        }
    }
    // Append the final chunk
    result = append(result, current)
    return result
}

Disk Scheduling