The Adventures of OS

RISC-V OS using Rust

Support me on Patreon! OS Blog RSS Feed Github EECS site

This is chapter 9 of a multi-part series on writing a RISC-V OS in Rust.

Table of ContentsChapter 8 → (Chapter 9) → Chapter 10

Block Driver

6 April 2020: Patreon only

13 April 2020: Public

Video & Reference Material

I have taught operating systems at my university, so I will link my notes from that course here regarding the virt I/O protocol. This protocol has changed over the years, but the one that QEMU implements is the legacy MMIO interface.

https://www.youtube.com/watch?v=FyPnYxeH5YU

OS Course Notes: Virt I/O

https://docs.oasis-open.org/virtio/virtio/v1.1/virtio-v1.1.html

The notes above are for a general overview of processes as a concept. The OS we're building here will probably do things differently. Most of that is because it's written in Rust--insert jokes here.

Overview

The VirtIO protocol is a way to communicate with virtualized devices, such as a block device (hard drive) or input device (mouse/keyboard). For this post, I will show you how to write a block driver using the VirtIO protocol.

The first thing we must understand is that VirtIO is just a generic I/O communication protocol. Then, we have to look at the block device section to see the communication protocol specifically for block devices.

Volatile Pointers

Using memory-mapped I/O generally requires using volatile pointers. This is a specific keyword in C/C++ that tells the compiler that the value at the memory address given by the pointer is subject to change behind the scenes. This means that the compiler cannot optimize it thinking the value doesn't change.

In C/C++, this is a keyword used when declaring the pointer. However, Rust does not have such a keyword. Instead, Rust uses a member of a raw pointer called read_volatile or write_volatile: https://doc.rust-lang.org/nightly/std/primitive.pointer.html#method.read_volatile.

This can lead to some issues, though not too bad when reading, but a nightmare when writing. There are two different ways to tackle read/writing to MMIO. (1) create a big structure whose fields conveniently align with the offsets or (2) calculate the offsets per read and write. I personally like the convenience and readability of #1 better, however, Rust makes #1 much more difficult. After debugging for a long time, I decided to go with #2. I haven't given up though, but at this point, it's more stubbornness for not much gain.

To help me do this, I created an enumeration with all of the offsets, which is in the VirtIO specification.


#[repr(usize)]
pub enum MmioOffsets {
  MagicValue = 0x000,
  Version = 0x004,
  DeviceId = 0x008,
  VendorId = 0x00c,
  HostFeatures = 0x010,
  HostFeaturesSel = 0x014,
  GuestFeatures = 0x020,
  GuestFeaturesSel = 0x024,
  GuestPageSize = 0x028,
  QueueSel = 0x030,
  QueueNumMax = 0x034,
  QueueNum = 0x038,
  QueueAlign = 0x03c,
  QueuePfn = 0x040,
  QueueNotify = 0x050,
  InterruptStatus = 0x060,
  InterruptAck = 0x064,
  Status = 0x070,
  Config = 0x100,
}

The repr at the start means to represent these as the data type usize, which for a 64-bit system is 64 bits. We don't need this much storage space, but when we add to a pointer, Rust wants a usize, and not a u16.

You will notice, if you read the VirtIO specification, that these are the legacy offsets. That is because QEMU, for the time being, uses the legacy MMIO interface when using VirtIO.

Note About Pointer Offsets

Just like C/C++, when we add a value to a pointer in Rust, it adds it as a scaled offset. This means that if I add 1 to a pointer to a u64, it actually adds 8. The formula is base + offset * size for any pointer.

This will lead to problems because I have the absolute offset numbers in this enumeration. To help connect this with Rust's raw pointers, I added some members to the enumeration.


impl MmioOffsets {
  pub fn val(self) -> usize {
    self as usize
  }

  pub fn scaled(self, scale: usize) -> usize {
    self.val() / scale
  }

  pub fn scale32(self) -> usize {
    self.scaled(4)
  }

}

The first memeber, fn val will take the enueration type and convert it into the equivalent usize. This is because Rust will not automatically convert one to the other.

Then, we have the scale32(). I use this as a helper because I'm using a 32-bit pointer (most of the registers are 32-bits) to offset the base address for an MMIO read/write.

You can see that I don't particularly like this method. The structure method makes it much easier to switch between different data sizes, but for now, this is what I chose to go with.

Scanning the Bus

I will not cover the VirtIO protocol here, instead, I've linked my lecture notes to my OS class above. Instead, let's make our operating system function. I'll truncate some of the instructions, so if you need a more in-depth look at the protocol, please see the lecture notes.

For the QEMU emulator, it puts virtio devices (backwards) from 0x1000_1000 to 0x1000_8000. If we only have one device, it should be attached at 0x1000_8000, but for good OS practices, we probe all busses to see what's attached.


pub fn probe() {
  // Rust's for loop uses an Iterator object, which now has a step_by
  // modifier to change how much it steps. Also recall that ..= means up
  // to AND including MMIO_VIRTIO_END.
  for addr in (MMIO_VIRTIO_START..=MMIO_VIRTIO_END).step_by(MMIO_VIRTIO_STRIDE) {
    print!("Virtio probing 0x{:08x}...", addr);
    let magicvalue;
    let deviceid;
    let ptr = addr as *mut u32;
    unsafe {
      magicvalue = ptr.read_volatile();
      deviceid = ptr.add(2).read_volatile();
    }
    // 0x74_72_69_76 is "virt" in little endian, so in reality
    // it is triv. All VirtIO devices have this attached to the
    // MagicValue register (offset 0x000)
    if MMIO_VIRTIO_MAGIC != magicvalue {
      println!("not virtio.");
    }
    // If we are a virtio device, we now need to see if anything
    // is actually attached to it. The DeviceID register will
    // contain what type of device this is. If this value is 0,
    // then it is not connected.
    else if 0 == deviceid {
      println!("not connected.");
    }
    // If we get here, we have a connected virtio device. Now we have
    // to figure out what kind it is so we can do device-specific setup.
    else {
      match deviceid {
        // DeviceID 2 is a block device
        2 => {
          print!("block device...");
          if false == setup_block_device(ptr) {
            println!("setup failed.");
          }
          else {
            let idx = (addr - MMIO_VIRTIO_START) >> 12;
            unsafe {
              VIRTIO_DEVICES[idx] =
                Some(VirtioDevice::new_with(DeviceTypes::Block));
            }
            println!("setup succeeded!");
          }
        },
        // DeviceID 4 is a random number generator device
        4 => {
          print!("entropy device...");
          if false == setup_entropy_device(ptr) {
            println!("setup failed.");
          }
          else {
            println!("setup succeeded!");
          }
        },
        _ => println!("unknown device type."),
      }
    }
  }
}  

During a probe, we first have to see if this is a virtio base address. At offset 0, we should read 4 bytes, which will be "triv", which is "virt" stored in little-endian. This is called the magic bytes, and it's used for identification purposes. If we find that this magic doesn't match, then we can be assured that this is not a virtio memory address.

After we find that this is a virtio bus, then we have to see what type of device is actually attached. Recall that virtio is a generic bus, so we can attached GPUs, network devices, block device, and so forth. We can tell what type of device is attached by reading the DeviceID register. For now, we only care about device number 2, which is reserved for a block device.

If we find "virt" and device id 2, we can configure this device as a block device. This is when we get to the device-specific part of the specification.

Configuring the Device

Before we can use the device, we must configure it. We do this by following the procedures by negotiating the driver (us) with the device (them).

The procedures for configurating a device are laid out in the specification as:

  1. Reset the device by writing 0 to the status register.
  2. Set the ACKNOWLEDGE status bit to the status register.
  3. Set the DRIVER status bit to the status register.
  4. Read device features from host_features register.
  5. Negotiate the set of features and write what you'll accept to guest_features register.
  6. Set the FEATURES_OK status bit to the status register.
  7. Re-read the status register to confirm that the device accepted your features.
  8. Perform device-specific setup.
  9. Set the DRIVER_OK status bit to the status register. The device is now LIVE.

There seem to be a lot of steps, but it really isn't all that bad. What we're doing is making sure that the driver and device understand each other. One of the "features" could be the read-only bit, which means that we cannot write to the device. We might want to negotiate this OFF if we want to write to the device.


pub fn setup_block_device(ptr: *mut u32) -> bool {
  unsafe {
    // We can get the index of the device based on its address.
    // 0x1000_1000 is index 0
    // 0x1000_2000 is index 1
    // ...
    // 0x1000_8000 is index 7
    // To get the number that changes over, we shift right 12 places (3 hex digits)
    let idx = (ptr as usize - virtio::MMIO_VIRTIO_START) >> 12;
    // [Driver] Device Initialization
    // 1. Reset the device (write 0 into status)
    ptr.add(MmioOffsets::Status.scale32()).write_volatile(0);
    let mut status_bits = StatusField::Acknowledge.val32();
    // 2. Set ACKNOWLEDGE status bit
    ptr.add(MmioOffsets::Status.scale32()).write_volatile(status_bits);
    // 3. Set the DRIVER status bit
    status_bits |= StatusField::DriverOk.val32();
    ptr.add(MmioOffsets::Status.scale32()).write_volatile(status_bits);
    // 4. Read device feature bits, write subset of feature
    // bits understood by OS and driver    to the device.
    let host_features = ptr.add(MmioOffsets::HostFeatures.scale32()).read_volatile();
    let guest_features = host_features & !(1 << VIRTIO_BLK_F_RO);
    let ro = host_features & (1 << VIRTIO_BLK_F_RO) != 0;
    ptr.add(MmioOffsets::GuestFeatures.scale32()).write_volatile(guest_features);
    // 5. Set the FEATURES_OK status bit
    status_bits |= StatusField::FeaturesOk.val32();
    ptr.add(MmioOffsets::Status.scale32()).write_volatile(status_bits);
    // 6. Re-read status to ensure FEATURES_OK is still set.
    // Otherwise, it doesn't support our features.
    let status_ok = ptr.add(MmioOffsets::Status.scale32()).read_volatile();
    // If the status field no longer has features_ok set,
    // that means that the device couldn't accept
    // the features that we request. Therefore, this is
    // considered a "failed" state.
    if false == StatusField::features_ok(status_ok) {
      print!("features fail...");
      ptr.add(MmioOffsets::Status.scale32()).write_volatile(StatusField::Failed.val32());
      return false;
    }
    // 7. Perform device-specific setup.
    // Set the queue num. We have to make sure that the
    // queue size is valid because the device can only take
    // a certain size.
    let qnmax = ptr.add(MmioOffsets::QueueNumMax.scale32()).read_volatile();
    ptr.add(MmioOffsets::QueueNum.scale32()).write_volatile(VIRTIO_RING_SIZE as u32);
    if VIRTIO_RING_SIZE as u32 > qnmax {
      print!("queue size fail...");
      return false;
    }
    // First, if the block device array is empty, create it!
    // We add 4095 to round this up and then do an integer
    // divide to truncate the decimal. We don't add 4096,
    // because if it is exactly 4096 bytes, we would get two
    // pages, not one.
    let num_pages = (size_of::<Queue>() + PAGE_SIZE - 1) / PAGE_SIZE;
    // println!("np = {}", num_pages);
    // We allocate a page for each device. This will the the
    // descriptor where we can communicate with the block
    // device. We will still use an MMIO register (in
    // particular, QueueNotify) to actually tell the device
    // we put something in memory. We also have to be
    // careful with memory ordering. We don't want to
    // issue a notify before all memory writes have
    // finished. We will look at that later, but we need
    // what is called a memory "fence" or barrier.
    ptr.add(MmioOffsets::QueueSel.scale32()).write_volatile(0);
    // Alignment is very important here. This is the memory address
    // alignment between the available and used rings. If this is wrong,
    // then we and the device will refer to different memory addresses
    // and hence get the wrong data in the used ring.
    // ptr.add(MmioOffsets::QueueAlign.scale32()).write_volatile(2);
    let queue_ptr = zalloc(num_pages) as *mut Queue;
    let queue_pfn = queue_ptr as u32;
    ptr.add(MmioOffsets::GuestPageSize.scale32()).write_volatile(PAGE_SIZE as u32);
    // QueuePFN is a physical page number, however it
    // appears for QEMU we have to write the entire memory
    // address. This is a physical memory address where we
    // (the OS) and the block device have in common for
    // making and receiving requests.
    ptr.add(MmioOffsets::QueuePfn.scale32()).write_volatile(queue_pfn / PAGE_SIZE as u32);
    // We need to store all of this data as a "BlockDevice"
    // structure We will be referring to this structure when
    // making block requests AND when handling responses.
    let bd = BlockDevice { queue:        queue_ptr,
                            dev:          ptr,
                            idx:          0,
                            ack_used_idx: 0,
                            read_only:    ro, };
    BLOCK_DEVICES[idx] = Some(bd);

    // 8. Set the DRIVER_OK status bit. Device is now "live"
    status_bits |= StatusField::DriverOk.val32();
    ptr.add(MmioOffsets::Status.scale32()).write_volatile(status_bits);

    true
  }
}

Requests

Now that the device is LIVE, we can start making requests by using the virtio rings. The virtio descriptor/ring system is generic; however, we have a protocol when making block requests. We will make a block request using three descriptors: (1) block request header, (2) block request buffer, and (3) block request status.

The header tells the block device whether we want to read or write and where. Unfortunately, the where part is in sectors, not bytes. However, there calculation from one to the other is a factor of 512. That is, there are 512 bytes per sector. So, that's quite a simple calculation.

After the header, we store the buffer. For reads, the device will write to this piece of memory, and for writes, the device will read from this piece of memory. It's important to note that these must be physical addresses, since the block device bypasses the MMU.

Finally, we have a status field. The device will write the result of the request to this 8-bit field. There are currently only three responses we can get: 0-success, 1-failure, 2-unsupported operation. That doesn't give us a lot of information, but if we get a 0, we can reasonably assume that our request was properly handled.

Making a Request

To make a request, we need to allocate heap memory. The memory we create must stay resident/valid until AFTER the device makes its response. Therefore, we cannot use the stack. We will grab three open descriptors from the virtio queue, populate it with the header, buffer, and status, and then we write the virtqueue's number (0) into the queue_notify register to tell the device to start working on the request.


pub fn block_op(dev: usize, buffer: *mut u8, size: u32, offset: u64, write: bool) {
  unsafe {
    if let Some(bdev) = BLOCK_DEVICES[dev - 1].as_mut() {
      // Check to see if we are trying to write to a read only device.
      if true == bdev.read_only && true == write {
        println!("Trying to write to read/only!");
        return;
      }
      let sector = offset / 512;
      // TODO: Before we get here, we are NOT allowed to schedule a read or
      // write OUTSIDE of the disk's size. So, we can read capacity from
      // the configuration space to ensure we stay within bounds.
      let blk_request_size = size_of::<Request>();
      let blk_request = kmalloc(blk_request_size) as *mut Request;
      let desc = Descriptor { addr:  &(*blk_request).header as *const Header as u64,
                              len:   size_of::<Header>() as u32,
                              flags: virtio::VIRTIO_DESC_F_NEXT,
                              next:  0, };
      let head_idx = fill_next_descriptor(bdev, desc);
      (*blk_request).header.sector = sector;
      // A write is an "out" direction, whereas a read is an "in" direction.
      (*blk_request).header.blktype = if true == write {
        VIRTIO_BLK_T_OUT
      }
      else {
        VIRTIO_BLK_T_IN
      };
      // We put 111 in the status. Whenever the device finishes, it will write into
      // status. If we read status and it is 111, we know that it wasn't written to by
      // the device.
      (*blk_request).data.data = buffer;
      (*blk_request).header.reserved = 0;
      (*blk_request).status.status = 111;
      let desc = Descriptor { addr:  buffer as u64,
                              len:   size,
                              flags: virtio::VIRTIO_DESC_F_NEXT
                                      | if false == write {
                                        virtio::VIRTIO_DESC_F_WRITE
                                      }
                                      else {
                                        0
                                      },
                              next:  0, };
      let _data_idx = fill_next_descriptor(bdev, desc);
      let desc = Descriptor { addr:  &(*blk_request).status as *const Status as u64,
                              len:   size_of::<Status>() as u32,
                              flags: virtio::VIRTIO_DESC_F_WRITE,
                              next:  0, };
      let _status_idx = fill_next_descriptor(bdev, desc);
      (*bdev.queue).avail.ring[(*bdev.queue).avail.idx as usize % virtio::VIRTIO_RING_SIZE] = head_idx;
      (*bdev.queue).avail.idx = (*bdev.queue).avail.idx.wrapping_add(1);
      // The only queue a block device has is 0, which is the request
      // queue.
      bdev.dev.add(MmioOffsets::QueueNotify.scale32()).write_volatile(0);
    }
  }
}  

The code above shows us allocating three descriptors (using kzalloc so that it's on the heap), filling those descriptors, and then putting the head of those descriptors into the available ring. When we write 0 to queue_notify, the device starts immediately.

Responses

The available ring is used by us to make a request. The used ring is used by the device to send a response back to us. When we write 0 into queue_notify, it starts working. Then, when it is finished, it will send an interrupt via the PLIC (remember that thing?). Luckily, 0x1000_1000 is PLIC interrupt 1 ... 0x1000_8000 is PLIC interrupt 8. So, that's an easy translation.

A response comes in the form as a used ring element. When we read from this element, we will get the descriptor's identifier (index) that it's responding to. This is because the block device is free to execute requests in any order it desires. SO!!! We cannot assume that we will get responses in the order of our requests.

We will take an external interrupt, ask the PLIC what caused it, and the PLIC will give us 8 for the first block device. When we see that, we can forward the data to the block device's handler, which will then acknowledge the response.


pub fn pending(bd: &mut BlockDevice) {
  // Here we need to check the used ring and then free the resources
  // given by the descriptor id.
  unsafe {
    let ref queue = *bd.queue;
    while bd.ack_used_idx != queue.used.idx {
      let ref elem = queue.used.ring[bd.ack_used_idx as usize % VIRTIO_RING_SIZE];
      bd.ack_used_idx = bd.ack_used_idx.wrapping_add(1);
      let rq = queue.desc[elem.id as usize].addr as *const Request;
      kfree(rq as *mut u8);
      // TODO: Awaken the process that will need this I/O. This is
      // the purpose of the waiting state.
    }
  }
}

We keep ack_used_idx internally, so the device doesn't get to see that. That's the last acknowledged index of the used ring. The queue.used.idx is shared between the device and the driver. Therefore, whenever the device wants to respond to us, it will put something into the used ring and then increment the index. We can detect that our internal index is NOT the same as the common one, telling us that we have an unhandled response.

We have to use != in the while loop above because ALL of these rings are circular, meaning that when we reach the end, we start up from the beginning.

Notice that it isn't until we get the response that we free the resources, using kfree.

Testing

We can test reads and writes now that we have read() and write() functions. In the last chapter, we will link the block driver to the user processes so that we can use a system call to read sections of the block device!

When your block device works properly, you will need to link the hdd.dsk that has been hovering over us for a while.


// Let's test the block driver!
println!("Testing block driver.");
let buffer = kmem::kmalloc(512);
block::read(8, buffer, 512, 0);
for i in 0..48 {
  print!(" {:02x}", unsafe { buffer.add(i).read() });
  if 0 == ((i+1) % 24) {
    println!();
  }
}
kmem::kfree(buffer);
println!("Block driver done");

The code above goes into kinit right after we probe the virtio bus. When I look at what I get (the first 48 bytes), I see the following:

To verify our results, let's take a look at a hex dump of the hdd.dsk file:

Conclusion

There's a lot going on here. When I first decided to tackle the VirtIO spec, I didn't realize the painstaking frustrations that I would encounter. However, I think I now have a firm grasp on what's going on, but I'm open to corrections!

Table of ContentsChapter 8 → (Chapter 9) → Chapter 10

Stephen Marz (c) 2020

Become a Patron!