Virtio-drive
A hard drive that uses the virtiofs interface.
Code is in a private repo here: https://github.com/UBC-ECE-Sasha/virtio-drive
Development Environment
The conflict between VFIO and P2P
Assigning a PCIe device to a virtual machine requires VFIO. VFIO requires an IOMMU so the mappings are safe. P2P on the other hand, demands that the IOMMU be disabled so as to not interfere with communications between PCIe devices. The result is that I cannot use the same setup for both VMs and P2P. The IOMMU is enabled (or disabled) with a kernel command line option, set in grub.
Research Questions
Functionality
- Does this even work? Is it going to run on my drive?
- What can I do that I couldn't do before?
- The file system is accessible by any device that can memory-map the SSD. That is, any device that is a PCIe bus master can share the file system. Even devices that are not software programmable (e.g. FPGA) can have the virtio protocol hard-coded to gain access to files.
- What happens to system complexity? (moving logic from the OS kernel to the device)
- What does [some company that produces SSDs] need to do to implement this interface?
Performance
- Will this hurt latency on the host?
- How much CPU time is saved on the host?
- Will this improve performance inside a VM?
- inside a container (docker, LXC, etc.)?
- from a GPU?
- from some other programmable device (smartNIC)?
- Indexing overhead
- End-to-end application improvement
- Eliminating GC reduces tail latency and variability in throughput during intensive writes
Cost
- What is saved by integrating the FS+FTL?
- Host resource use is reduced by offloading the file system to another device. Specifically, host CPU time is replaced by device CPU time. Device CPU time of FS+FTL is not considerably higher than FTL alone because they are tightly integrated.
- How does the fs behave as the drive fills up?
- How much space becomes unusable?
- Implementation cost (hardware, software, etc.)
- How is performance affected as available dram decreases?
Future
- How could this be used with cache coherent fabrics (e.g. CXL, xGMI)
- What are the implications for composable chiplet designs?
Baseline
The baseline for comparison is an NVMe SSD using the same hardware as the prototype system. It uses a slightly modified version of the OpenSSD firmware with the "Greedy 3.0" FTL.
Since the firmware is not completely stable, the drive is accessed through a virtual machine (using PCI device assignment). The drive is excluded from the host operating system by using VFIO rule in the Linux command line. The host BIOS sees the device but largely ignores it.
After booting the VM using 'buildroot', I load the nvme driver:
modprobe nvme
Then partition the drive and build a file system:
parted > mklabel gpt > mkpart p1 2048s 33550300s > quit
mke2fs /dev/nvme0n1p1
mount /dev/nvme0n1p1 /mnt
Formatting
After partitioning, I format using the ext2 file system from the VM. It takes slightly less than 10 seconds.
# time mke2fs /dev/nvme0n1p1 Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 1048576 inodes, 4193531 blocks 209676 blocks (5%) reserved for the super user First data block=0 Maximum filesystem blocks=4194304 128 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000
real 0m 9.78s user 0m 0.11s sys 0m 1.68s
Instability
The problem with the host interrupt has been solved. I made 2 major changes:
- be sure to disable interrupts for all completion queue doorbell writes (not just doorbell 0)
- Use MSI style interrupts (instead of INTx)
Data consistency is much improved by turning off device's data cache. Some of it is understood (not flushing pages before DMA operations), but I still think some behaviour is not explained just by this.
There is still a problem in the drive consistency. The drive can be formatted and mounted reliably up to 500K sectors (~300MB). Larger file systems see errors either during formatting or operation.
Measurements
Turn off SSL verification for buildroot
git config --global http.sslVerify false
Clone zlib (10 MB)
git clone https://github.com/madler/zlib.git
Clone Linux Source
git clone https://github.com/torvalds/linux.git
Write amplification factor
After partitioning, formatting and cloning Linux source:
bytes @ if: 4887379968 bytes @ flash: 49260625920 WAF: 1.0
Host Environment
https://github.com/ikwzm/udmabuf
sudo insmod u-dma-buf.ko udmabuf0=1048576
Host Measurements
Measurements of virtio-drive vs nvme measured at the host
GPU Environment
GPU Measurements
Measurements of virtio-drive vs GPUfs measured at the GPU
Development
Phase 1: see what we have already
Step 1: virtiofs
Virtiofs is pretty new, but supported in QEMU (v5.0) and Linux kernel v5.4+. It is intended to share a host file system with one or more guests. Let's see if that works.
My host is ubuntu 20.04.3
I'll use buildroot (https://git.busybox.net/buildroot) 2021.08.1 and QEMU (https://github.com/qemu/qemu.git) v6.1.0 (latest stable release)
QEMU
sudo apt install ninja-build libglib2.0-dev libfdt-dev libpixman-1-dev zlib1g-dev libnfs-dev libiscsi-dev libcap-ng-dev libseccomp-dev
../configure --prefix=/usr/local --target-list=x86_64-softmmu --enable-virtiofsd
Buildroot
make O=$PWD -C ~/projects/buildroot qemu_x86_64_defconfig
sudo qemu-system-x86_64 -m 1G --enable-kvm -kernel images/bzImage -initrd images/rootfs.ext2
Based on some out-of-date-but-couldn't-find-anything-newer instructions here: https://virtio-fs.gitlab.io/howto-qemu.html
$QEMU_BUILD_DIR/tools/virtiofsd/virtiofsd --socket-path=/tmp/vhostqemu -o source=$TESTDIR -o cache=always
Had some issues building the guest kernel with the correct options (because I don't know how to use buildroot very well) but after I fixed that, I can now mount the directory shared by the host!
mount -t virtiofs myfs /mnt
Phase 2: GPU
I need the GPU to become a virtio-fs client to talk to the virtiofsd. One idea is to just run virtiofsd as a standalone process (i.e. no QEMU), but that will have a different interface (I want the GPU to read everything memory mapped). I don't want to start writing work-arounds to get the GPU to share the address range with virtiofsd. The proper way is to assign a GPU to QEMU so it sees virtiofsd as a PCIe device. I used to use pci-stub for passthrough, but that is no longer supported by QEMU. Now I have to use vfio, but that requires an IOMMU - nothing is easy anymore.
Attempt #1
- Enable VT-d in the BIOS
- Add intel_iommu=on to the kernel command line (/etc/default/grub)
- echo "10de 1b06" > /sys/bus/pci/drivers/vfio-pci/new_id
- In QEMU: -device vfio-pci,host=02:00.0
I got stuck using vfio. I can assign the GPU to the guest, but when I try to write a CUDA program to access the PCIe device from the GPU, I run into problems accessing physical memory ranges. Now I have a horrible problem of trying to map physical addresses for the virtiofs PCIe device into the GPU that only handles virtual addresses shared with the host program. What a mess. Looks like going back to the first idea will actually be easier to implement - let's try that.
Attempt #2
Virtiofsd as a standalone program listen on a UNIX socket and waits for a connection from vhost-user in QEMU. Since this is all user space, I should be able to write a CUDA program that talks to virtiofsd over a UNIX socket on one side, and shared memory (maybe unified memory?) with the GPU on the other side.
The problem is in the details of how shared memory works. Virtio exposes a configuration block from the device to the driver, so the driver can configure and eventually use the device. One of the first tasks of the driver is to reset the device by writing a 0 to a shared memory location. In a virtualized environment, the write can be captured by setting the page as read-only. In real hardware, the device will react to a write to a register. On a CPU in user space, we can't mark the memory page as read-only so we can never detect the write. What to do?
How the GPU performs IO accesses
Bypass the elegant solution of write-protecting the page by setting up polled shared memory. The driver writes an "IO command" to a known memory location, and sets a flag when it's ready. The CPU (i.e. emulated device) polls looking for the ready flag (constantly reading the same uncached memory). When the CPU detects an IO command, a handler is called for the particular command type (read or write). This is horribly inefficient, but easy to implement (and actually should be quite fast) - this is basically a PMD (poll-mode driver) like what DPDK uses, only it's a poll-mode device.
How UNIX sockets work
I had a to clarify a misunderstanding. The virtqueues are owned by the virtiofsd device, so very little logic needs to be implemented in the virtiofs-cuda process. The trick is sharing the memory between the GPU and virtiofsd, using virtiofs-cuda as an intermediary. The shared memory is set up by virtiofs-cuda, and a handle is passed to virtiofsd over a UNIX socket using ancillary data. Virtiofsd can then share the memory region by using mmap() but it will appear at a different virtual address so each access needs to be translated. On the other side, the GPU uses unified shared memory, so the virtual addresses are actually identical to those in virtiofs-cuda. The GPU can then send vhost commands to virtiofsd to set up the virtqueues. Once everything is set up correctly, the GPU can use FUSE commands to access the file system. The details of the vhost protocol are not well documented, so some reverse engineering is necessary.
Eventfd
Eventfd is used twice by vhost-user: kick (driver -> device) and call (device -> driver) notifications. Virtiofsd requires both of these to work, even though they are optional in the specification. The kick is easy to implement. When an IO command is picked up by the CPU (from the memory shared with the GPU) that writes to the MMIO_QUEUE_NOTIFY register, we write to the kick eventfd. Virtiofsd polls for it, and handles it. When virtiofsd wants to reply, it writes to the 'call' eventfd. We poll on the fd in the same loop that looks for IO commands. The problem is there is no easy way to send a message back to the GPU, since the GPU can't poll in the background. For now we just eat it, and have the GPU poll synchronously; awaiting a reply for the current command. This will probably need to be fixed once I get some more complicated commands going, but it is enough to complete the initialization process.
FUSE init
The FUSE init command consists of 4 buffers: in, init_in, out and init_out. The 'in' and 'out' are generic headers, while the 'init_in' and 'init_out' are specific to this command. It looks like all of the commands follow this same pattern. They are sent using 4 different virtio descriptors (2 in and 2 out) which are very small. Once I run out of better things to do, I want to try amalgamating them so there is only one in and one out descriptor (less overhead for processing descriptors, more free descriptors in case we want to work asynchronously one day).
Phase 3: Device Frontend
I want to evaluate the performance on a real device (i.e. not the host CPU). This will give me some idea of latency moving stuff over PCIe, and the measurements of resource utilization on the device. The problem is, programmable SSDs are hard to come by. Matias had a great suggestion - borrow the processing power from a smartNIC (Bluefield) to host the file system logic, then connect it to a normal SSD over SATA. There will more overhead of running lots of useless software on the bluefield (i.e. most of Linux) but it should be relatively straightforward to get this up & running. Unfortunately, it is difficult to find a smartNIC with enough storage to make this realistic.
Attempt #2 - Daisy
We are using the 'daisy' FPGA board from CRZ Tech to emulate the SSD. It has 2GB of DDR4 on board, and an UltraScale+ FPGA with 2 ARM A53 processors. It doesn't support flash natively, but has 2 external DDR4 controllers that can support up to 32GB or 64GB (depending on which datasheet you read). It also supports an NVMe attachment (M.2?) for larger, slower storage options.
Flash Emulation
CRZ has validated the design with 8GB on the first controller and 16GB on the second controller, so that is our initial configuration - emulating 24GB of flash in DRAM.
Interrupts to the ARM
The VIRTIO controller can raise an interrupt to the ARM. It has a status register with the reason for the interrupt. There are about 15 different reasons right now, inherited from NVMe - some of these will have to change.
Interrupts to the host
I am using MSI style interrupts (8 vectors). VIRTIO prefers MSI-X but that requires implementing more logic in the user IP to read values from the table and update the PBA. This is not supported in the NVMe design, so I'm falling back to MSI for now. There is room in BAR0 for the MSI-X table (0x800) & PBA (0x900) but those are currently empty.
VIRTIO protocol details
The idea is to use the VIRTIO 1.1 protocol, unmodified. That means we get the host driver for free by using the virtio_fs driver from Linux. All we have to do is make the hardware compliant to the driver. It sounds easy but there are some details that get in the way. VIRTIO specifies 5 additional PCI capabilities that must appear within the standard PCI configuration region (256 bytes). So far, this is not possible with the Xilinx PCIe 1.3 endpoint IP. I have added a work-around in the Linux PCIe subsystem that emulates the config space for our specific device ID and vendor ID (1AF4:105A). The capabilities don't do much, other than point to a BAR that contains the actual configuration.
BAR (base address register)
So far, we have only needed a single BAR. It is 64KB, 64-bit, non-prefetchable. I couldn't figure out an easy way of connecting this to a memory block, so I implemented it using registers. In fact, the majority of the addresses don't even respond (there is a default value when reading invalid registers). The rest implement the VIRTIO configuration structs, including the device-specific virtio_fs configuration containing the file system tag.
I plan to connect a 4KB (36Kb) BRAM to 0x1000 for the hi-prio virtqueue and another at 0x2000 for the regular virtqueue. There would still be room for a 2nd regular virtqueue at 0x3000 for futher experimentation. All of that would fit in a 16KB BAR, so I could trim it down later if no more space is needed. The descriptors in the virtqueue would point to DRAM, exposed over the same BAR or another one - not sure yet.
VIRTIO Register Block
I use the string 'EMBEDDED' for this file system tag.
ARM Memory
The ARM uses virtual memory through an MMU. Currently all memory is identity mapped. There is a problem accessing external DDR with memcpy - sometimes I get an exception. DWORD access seems to work fine.
Configured using 40-bit physical addresses (PArange) and 48-bit virtual addresses (the largest supported by this ARM implementation). T0SZ=16.
File System Design
Base the design on ctFS + IPLFS. DevFS {can't find the source code yet} but with some modifications. CrossFS is written by the same people, and is available here: https://github.com/RutgersCSSystems/CrossFS
DevFS states that integration of the file system with the FTL is not yet done, but would likely provide a benefit. OrcFS has done this. I have a very tightly integrated FS+FTL.
The basis is an inode. An inode is basically a descriptor, pointing to the data contents (contiguously allocated in virtual memory), its size, type and some extended attributes. Inodes live in cache (i.e. RAM) and are flushed to flash from time to time.
Directory entries are a fixed size (currently 72 bytes but likely to change). Directories are not in an inode (as in some other file systems) but rather in the contents of a file (like FAT?). File names are fixed at 63 bytes (the main factor in the directory entry size) but larger names may be supported by hashing the name and storing the hash in the directory entry. The directory entry also contains the inode index, and the inode has a list of attributes. One attribute can be the file name, that can be much larger than 63 bytes (basically up to the maximum of the extended attributes total size of 2KB).
One modification from ctFS is to extract the data segment headers from the data area and put them inside the metadata area in a standalone data structure. The main reason is to maximize the data region space, and not dirty it with more metadata. Now file contents are aligned to beginning of the partition, and there are no overlapping regions that can invalidate a data page by updating metadata. This is important on flash.
Modifications
- I think a queue per client would have less overhead than a queue per file. I'm not yet convinced of the benefits of a queue per file.
- We need a virtio front-end
Performance
gpu_io_read() takes between 15000-200000 GPU cycles with the 1us delay. This comes down to 5000 cycles without the delay.
fuse_readdir() on the source directory takes about 8.6M GPU cycles. Sending the command is about 160000 cycles, waiting for the response is the majority of the time (not a big surprise).
The datapath is ridiculously long:
User -> kernel -> vfs ->fuse -> FUSE -> kernel ->virtio_fs -> device
A shorter, more appropriate path would be:
User -> kernel -> vfs -> virtio_fs -> device
With some modifications to user space (fixing up glibc) we could have:
User -> device
Linux to Drive Benchmarks
Baseline bare-metal host Linux+SSD
VM using QEMU+KVM+Buildroot
iozone -a -i0 -i1 -g2048
mke2fs -b 65536 -i 65536 -I 128 -L joel /dev/nvme0n1
Design Issues
CVE-2022-0358
Setup
On babbage, my PCIe devices are connected like this:
+-[0000:40]-+-00.0 | +-00.2 | +-01.0 | +-01.1-[41]--+-00.0 (VCK5000) | | \-00.1 | +-03.0 | +-03.1-[42]----00.0 (Daisy FPGA: virtio or NVMe) +-[0000:60]-+-00.0 | +-00.2 | +-01.0 | +-01.1-[61]--+-00.0 (P1000 GPU target) | | \-00.1 | +-02.0 | +-03.0 | +-03.1-[62]--+-00.0 (T400 GPU - not used) | | \-00.1
40:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex 40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge 41:00.0 Memory controller: Xilinx Corporation Device 5048 41:00.1 Memory controller: Xilinx Corporation Device 5049 40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge IOMMU group: 19 Capabilities: [2a0 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- 42:00.0 Mass storage controller: Red Hat, Inc. Virtio file system 60:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex 60:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Milan IOMMU (rev 01) 60:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge IOMMU group: 1 Capabilities: [2a0 v1] Access Control Services ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+ ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- 61:00.0 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro P1000] (rev a1) 62:00.0 VGA compatible controller: NVIDIA Corporation TU117GL [T400 4GB] (rev a1)
Workloads
Image Classification
https://github.com/akrizhevsky/cuda-convnet2
https://github.com/vzhuang/CUDA-convnet
Gene Sequencing
https://github.com/OpenHero/gblastn
- Runs into the following challenges when installing from README instructions
- Requires existing NCBI-BLAST installation from source to hook in to
configure: error: Do not know how to build MT-safe with compiler /usr/bin/g++
- This should be easily solvable with any of 2 approaches:
- Ignoring the compiler version (https://fabioticconi.wordpress.com/2016/01/08/how-to-compile-ncbi-igblast-with-gcc-5-x/). This does not work and maintains the error
- Downgrading the compiler version (install and point to lower version). This will work, but will then point to Boost.Test being the error.
- I believe the problem is that their GPU implementation seems to have some dependency with the specific NCBI-BLAST version they use
Note that newer versions of NCBI-BLAST build fine, so I think the issue is just with the (very old!) NCBI-BLAST version. However, it is unclear how to run their code with a new version of NCBI-BLAST
Others
cuDF https://docs.rapids.ai/api/cudf/stable/
Bioinformatics (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3018811/; running thousands of ~100B queries against a database. Apparently this is done with files (ref: http://ccl.cse.nd.edu/research/papers/small-grid07.pdf), but I am not sure why
Data streaming (e.g. https://developer.nvidia.com/blog/beginners-guide-to-gpu-accelerated-event-stream-processing-in-python/); processing streams of file data from IoT and other sources
Supposedly a wide variety of physics simulation and HPC applications (ref: https://hal.archives-ouvertes.fr/hal-01892691/document)... but I have negative experience with HPC and would prefer to avoid it
Datasets
Pictures
MNIST: http://yann.lecun.com/exdb/mnist/
CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html
Genomes
BLAST: https://ftp.ncbi.nlm.nih.gov/blast/db/
Eval Platforms
Ideally, I want to modify the firmware of a real SSD, attached over a memory-mapped bus (PCIe or similar). Since it is difficult to get my hands on the source code of a real SSD, I am considering the following as alternatives:
Platform | Notes | Examples |
---|---|---|
Cosmos+ | Limited to 2GB/s by external PCIe gen 2 interface on Zynq 7000 | RecSSD (2021) |
NVIDIA Bluefield-2 | External PCIe Gen 4
Some models have direct attached storage (NVMe U.2) - end of life, and not available for purchase! Other models support NVMEoF and SNAP (i.e. SAN) |
|
Enzian | Remote access only
Is it possible to synthesize Cosmos+ FW to run on Enzian? Could be a perfect fit |
|
OpenExpress | For FPGA | |
OpenNVM | For Nexys 3 board (Xilinx Spartan-6 LX16 FPGA) | |
Daisy+ | Limited to 256GB flash, but can use NVMe as well
source: https://github.com/CRZ-Technology/OpenSSD-OpenChannelSSD |
|
Mt. Evans (IPU) | Only NVMe
announced but not available yet! |
Related Work
Year | Conference | Title & Link | Comments |
---|---|---|---|
2022 | USENIX ATC | IPLFS: Log-Structured File System without Garbage Collection |
|
2020 | HotStorage | CompoundFS: Compounding I/O Operations in Firmware File Systems |
|
2018 | FAST | Designing a True Direct-Access File System with DevFS |
|
2018 | Transactions on Storage | OrcFS: Orchestrated File System for Flash Storage |
|
Approaching DRAM performance using us latency flash memory https://dl.acm.org/doi/abs/10.14778/3457390.3457397
BaM: A Case for Enabling Fine-grain High Throughput GPU-Orchestrated Access to Storage https://arxiv.org/abs/2203.04910
Serverless computing on heterogeneous computers https://dl.acm.org/doi/abs/10.1145/3503222.3507732
CrossFS https://www.usenix.org/system/files/osdi20-ren.pdf
INSIDER https://www.usenix.org/system/files/atc19-ruan_0.pdf
FusionFS https://www.usenix.org/system/files/fast22-zhang-jian.pdf
https://nona1314.github.io/pubs/gpukv-jung-sac21.pdf
https://www.usenix.org/system/files/hotstorage20_paper_awamoto.pdf
https://dl.acm.org/doi/10.1145/3477132.3483565 (DFS offload to smartNICs)
https://ucare.cs.uchicago.edu/pdf/asplos20-LeapIO.pdf (storage server offload to SoC)
Redefining the role of the cpu in the era of cpu-gpu integration
A 3.5X increase in fixed-function accelerators across the six most recent Apple SoCs
Programming FPGA with OpenSSD
We are using the OpenSSD hardware design based on https://github.com/CRZ-Technology/OpenSSD-OpenChannelSSD
License Server
In order to synthesize the design, which is required to generate a bitstream, a licensed version of Vivado is required. UBC provides access to a license server living on research-lm2.ece.ubc.ca. There are also servers within the ECE network that have a licensed version of Vivado ready to use, however they do not have enough RAM to build our design.
To use Vivado licensed from the research-lm2.ece.ubc.ca license server:
- Install Vivado Design Edition version 19.1
- When configuring the install, un-check the section where you configure a license
- Complete install
- Connect to the ECE VPN (you may need to request access from it@ece.ubc.ca)
- Run Vivado with the environment variable
XILINXD_LICENSE_FILE=27017@research-lm2.ece.ubc.ca
Open the project Daisy/NVMe/MIG/Daisy_NVMe_MIG_19.1_20220110/OpenSSD/OpenSSD.xpr
To generate a bitstream in order to program the device, click "Generate Bitstream". Depending on the number of processors you use this may take up to 50GB of RAM or more.
The generated bitstream is located in: Daisy/NVMe/MIG/Daisy_NVMe_MIG_19.1_20220110/OpenSSD/OpenSSD.runs/impl_1/design_1_wrapper.bit
FPGA Development Flow (Vivado 2019.1)
- Modify design ( .v files)
- Refresh IP Catalog
- Open "IP Status" report
- Upgrade IP blocks as suggested
- Generate output products for block design (.bd file)
- Generate bitstream -> will automatically run Sythesis and Implementation first
Flash Datasheets
Manufacturer | Capacity | Bits per cell | Page Size | Block Size | Link |
---|---|---|---|---|---|
Kioxia | 16Gb | SLC | 4KB + 256 x 8 | 256K + 16K x 8 | https://www.kioxia.com/en-jp/business/memory/slc-nand/detail.TH58NVG4S0HTAK0.html |