Virtio-drive

A hard drive that uses the virtiofs interface.

Code is in a private repo here: https://github.com/UBC-ECE-Sasha/virtio-drive

Development Environment

The conflict between VFIO and P2P

Assigning a PCIe device to a virtual machine requires VFIO. VFIO requires an IOMMU so the mappings are safe. P2P on the other hand, demands that the IOMMU be disabled so as to not interfere with communications between PCIe devices. The result is that I cannot use the same setup for both VMs and P2P. The IOMMU is enabled (or disabled) with a kernel command line option, set in grub.

Research Questions

Functionality

Does this even work? Is it going to run on my drive?
What can I do that I couldn't do before?
1. The file system is accessible by any device that can memory-map the SSD. That is, any device that is a PCIe bus master can share the file system. Even devices that are not software programmable (e.g. FPGA) can have the virtio protocol hard-coded to gain access to files.
What happens to system complexity? (moving logic from the OS kernel to the device)
What does [some company that produces SSDs] need to do to implement this interface?

Performance

Will this hurt latency on the host?
How much CPU time is saved on the host?
1. Will this improve performance inside a VM?
2. inside a container (docker, LXC, etc.)?
3. from a GPU?
4. from some other programmable device (smartNIC)?
Indexing overhead
End-to-end application improvement
Eliminating GC reduces tail latency and variability in throughput during intensive writes

Cost

What is saved by integrating the FS+FTL?
1. Host resource use is reduced by offloading the file system to another device. Specifically, host CPU time is replaced by device CPU time. Device CPU time of FS+FTL is not considerably higher than FTL alone because they are tightly integrated.
How does the fs behave as the drive fills up?
How much space becomes unusable?
Implementation cost (hardware, software, etc.)
1. How is performance affected as available dram decreases?

Future

How could this be used with cache coherent fabrics (e.g. CXL, xGMI)
What are the implications for composable chiplet designs?

Baseline

The baseline for comparison is an NVMe SSD using the same hardware as the prototype system. It uses a slightly modified version of the OpenSSD firmware with the "Greedy 3.0" FTL.

Since the firmware is not completely stable, the drive is accessed through a virtual machine (using PCI device assignment). The drive is excluded from the host operating system by using VFIO rule in the Linux command line. The host BIOS sees the device but largely ignores it.

After booting the VM using 'buildroot', I load the nvme driver:

modprobe nvme

Then partition the drive and build a file system:

parted
> mklabel gpt
> mkpart p1 2048s 33550300s
> quit

mke2fs /dev/nvme0n1p1

mount /dev/nvme0n1p1 /mnt

Formatting

After partitioning, I format using the ext2 file system from the VM. It takes slightly less than 10 seconds.

# time mke2fs /dev/nvme0n1p1
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
1048576 inodes, 4193531 blocks
209676 blocks (5%) reserved for the super user
First data block=0
Maximum filesystem blocks=4194304
128 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000

real    0m 9.78s
user    0m 0.11s
sys     0m 1.68s

Instability

The problem with the host interrupt has been solved. I made 2 major changes:

be sure to disable interrupts for all completion queue doorbell writes (not just doorbell 0)
Use MSI style interrupts (instead of INTx)

Data consistency is much improved by turning off device's data cache. Some of it is understood (not flushing pages before DMA operations), but I still think some behaviour is not explained just by this.

There is still a problem in the drive consistency. The drive can be formatted and mounted reliably up to 500K sectors (~300MB). Larger file systems see errors either during formatting or operation.

Measurements

Turn off SSL verification for buildroot

git config --global http.sslVerify false

Clone zlib (10 MB)

git clone https://github.com/madler/zlib.git

Clone Linux Source

git clone https://github.com/torvalds/linux.git

Write amplification factor

After partitioning, formatting and cloning Linux source:

bytes @ if: 4887379968
bytes @ flash: 49260625920
WAF: 1.0

Host Environment

https://github.com/ikwzm/udmabuf

sudo insmod u-dma-buf.ko udmabuf0=1048576

Host Measurements

Measurements of virtio-drive vs nvme measured at the host

GPU Environment

GPU Measurements

Measurements of virtio-drive vs GPUfs measured at the GPU

Development

Phase 1: see what we have already

Step 1: virtiofs

Virtiofs is pretty new, but supported in QEMU (v5.0) and Linux kernel v5.4+. It is intended to share a host file system with one or more guests. Let's see if that works.

My host is ubuntu 20.04.3

I'll use buildroot (https://git.busybox.net/buildroot) 2021.08.1 and QEMU (https://github.com/qemu/qemu.git) v6.1.0 (latest stable release)

QEMU

sudo apt install ninja-build libglib2.0-dev libfdt-dev libpixman-1-dev zlib1g-dev libnfs-dev libiscsi-dev libcap-ng-dev libseccomp-dev

../configure --prefix=/usr/local --target-list=x86_64-softmmu --enable-virtiofsd

Buildroot

make O=$PWD -C ~/projects/buildroot qemu_x86_64_defconfig

sudo qemu-system-x86_64 -m 1G --enable-kvm -kernel images/bzImage -initrd images/rootfs.ext2

Share root fs with guest

Based on some out-of-date-but-couldn't-find-anything-newer instructions here: https://virtio-fs.gitlab.io/howto-qemu.html

$QEMU_BUILD_DIR/tools/virtiofsd/virtiofsd --socket-path=/tmp/vhostqemu -o source=$TESTDIR -o cache=always

Had some issues building the guest kernel with the correct options (because I don't know how to use buildroot very well) but after I fixed that, I can now mount the directory shared by the host!

mount -t virtiofs myfs /mnt

Phase 2: GPU

I need the GPU to become a virtio-fs client to talk to the virtiofsd. One idea is to just run virtiofsd as a standalone process (i.e. no QEMU), but that will have a different interface (I want the GPU to read everything memory mapped). I don't want to start writing work-arounds to get the GPU to share the address range with virtiofsd. The proper way is to assign a GPU to QEMU so it sees virtiofsd as a PCIe device. I used to use pci-stub for passthrough, but that is no longer supported by QEMU. Now I have to use vfio, but that requires an IOMMU - nothing is easy anymore.

Attempt #1

Enable VT-d in the BIOS
Add intel_iommu=on to the kernel command line (/etc/default/grub)
echo "10de 1b06" > /sys/bus/pci/drivers/vfio-pci/new_id
In QEMU: -device vfio-pci,host=02:00.0

I got stuck using vfio. I can assign the GPU to the guest, but when I try to write a CUDA program to access the PCIe device from the GPU, I run into problems accessing physical memory ranges. Now I have a horrible problem of trying to map physical addresses for the virtiofs PCIe device into the GPU that only handles virtual addresses shared with the host program. What a mess. Looks like going back to the first idea will actually be easier to implement - let's try that.

Attempt #2

Virtiofsd as a standalone program listen on a UNIX socket and waits for a connection from vhost-user in QEMU. Since this is all user space, I should be able to write a CUDA program that talks to virtiofsd over a UNIX socket on one side, and shared memory (maybe unified memory?) with the GPU on the other side.

The problem is in the details of how shared memory works. Virtio exposes a configuration block from the device to the driver, so the driver can configure and eventually use the device. One of the first tasks of the driver is to reset the device by writing a 0 to a shared memory location. In a virtualized environment, the write can be captured by setting the page as read-only. In real hardware, the device will react to a write to a register. On a CPU in user space, we can't mark the memory page as read-only so we can never detect the write. What to do?

How the GPU performs IO accesses

Bypass the elegant solution of write-protecting the page by setting up polled shared memory. The driver writes an "IO command" to a known memory location, and sets a flag when it's ready. The CPU (i.e. emulated device) polls looking for the ready flag (constantly reading the same uncached memory). When the CPU detects an IO command, a handler is called for the particular command type (read or write). This is horribly inefficient, but easy to implement (and actually should be quite fast) - this is basically a PMD (poll-mode driver) like what DPDK uses, only it's a poll-mode device.

How UNIX sockets work

I had a to clarify a misunderstanding. The virtqueues are owned by the virtiofsd device, so very little logic needs to be implemented in the virtiofs-cuda process. The trick is sharing the memory between the GPU and virtiofsd, using virtiofs-cuda as an intermediary. The shared memory is set up by virtiofs-cuda, and a handle is passed to virtiofsd over a UNIX socket using ancillary data. Virtiofsd can then share the memory region by using mmap() but it will appear at a different virtual address so each access needs to be translated. On the other side, the GPU uses unified shared memory, so the virtual addresses are actually identical to those in virtiofs-cuda. The GPU can then send vhost commands to virtiofsd to set up the virtqueues. Once everything is set up correctly, the GPU can use FUSE commands to access the file system. The details of the vhost protocol are not well documented, so some reverse engineering is necessary.

Eventfd

Eventfd is used twice by vhost-user: kick (driver -> device) and call (device -> driver) notifications. Virtiofsd requires both of these to work, even though they are optional in the specification. The kick is easy to implement. When an IO command is picked up by the CPU (from the memory shared with the GPU) that writes to the MMIO_QUEUE_NOTIFY register, we write to the kick eventfd. Virtiofsd polls for it, and handles it. When virtiofsd wants to reply, it writes to the 'call' eventfd. We poll on the fd in the same loop that looks for IO commands. The problem is there is no easy way to send a message back to the GPU, since the GPU can't poll in the background. For now we just eat it, and have the GPU poll synchronously; awaiting a reply for the current command. This will probably need to be fixed once I get some more complicated commands going, but it is enough to complete the initialization process.

FUSE init

The FUSE init command consists of 4 buffers: in, init_in, out and init_out. The 'in' and 'out' are generic headers, while the 'init_in' and 'init_out' are specific to this command. It looks like all of the commands follow this same pattern. They are sent using 4 different virtio descriptors (2 in and 2 out) which are very small. Once I run out of better things to do, I want to try amalgamating them so there is only one in and one out descriptor (less overhead for processing descriptors, more free descriptors in case we want to work asynchronously one day).

Phase 3: Device Frontend

I want to evaluate the performance on a real device (i.e. not the host CPU). This will give me some idea of latency moving stuff over PCIe, and the measurements of resource utilization on the device. The problem is, programmable SSDs are hard to come by. Matias had a great suggestion - borrow the processing power from a smartNIC (Bluefield) to host the file system logic, then connect it to a normal SSD over SATA. There will more overhead of running lots of useless software on the bluefield (i.e. most of Linux) but it should be relatively straightforward to get this up & running. Unfortunately, it is difficult to find a smartNIC with enough storage to make this realistic.

Attempt #2 - Daisy

We are using the 'daisy' FPGA board from CRZ Tech to emulate the SSD. It has 2GB of DDR4 on board, and an UltraScale+ FPGA with 2 ARM A53 processors. It doesn't support flash natively, but has 2 external DDR4 controllers that can support up to 32GB or 64GB (depending on which datasheet you read). It also supports an NVMe attachment (M.2?) for larger, slower storage options.

Flash Emulation

CRZ has validated the design with 8GB on the first controller and 16GB on the second controller, so that is our initial configuration - emulating 24GB of flash in DRAM.

Interrupts to the ARM

The VIRTIO controller can raise an interrupt to the ARM. It has a status register with the reason for the interrupt. There are about 15 different reasons right now, inherited from NVMe - some of these will have to change.

Interrupts to the host

I am using MSI style interrupts (8 vectors). VIRTIO prefers MSI-X but that requires implementing more logic in the user IP to read values from the table and update the PBA. This is not supported in the NVMe design, so I'm falling back to MSI for now. There is room in BAR0 for the MSI-X table (0x800) & PBA (0x900) but those are currently empty.

VIRTIO protocol details

The idea is to use the VIRTIO 1.1 protocol, unmodified. That means we get the host driver for free by using the virtio_fs driver from Linux. All we have to do is make the hardware compliant to the driver. It sounds easy but there are some details that get in the way. VIRTIO specifies 5 additional PCI capabilities that must appear within the standard PCI configuration region (256 bytes). So far, this is not possible with the Xilinx PCIe 1.3 endpoint IP. I have added a work-around in the Linux PCIe subsystem that emulates the config space for our specific device ID and vendor ID (1AF4:105A). The capabilities don't do much, other than point to a BAR that contains the actual configuration.

BAR (base address register)

So far, we have only needed a single BAR. It is 64KB, 64-bit, non-prefetchable. I couldn't figure out an easy way of connecting this to a memory block, so I implemented it using registers. In fact, the majority of the addresses don't even respond (there is a default value when reading invalid registers). The rest implement the VIRTIO configuration structs, including the device-specific virtio_fs configuration containing the file system tag.

I plan to connect a 4KB (36Kb) BRAM to 0x1000 for the hi-prio virtqueue and another at 0x2000 for the regular virtqueue. There would still be room for a 2nd regular virtqueue at 0x3000 for futher experimentation. All of that would fit in a 16KB BAR, so I could trim it down later if no more space is needed. The descriptors in the virtqueue would point to DRAM, exposed over the same BAR or another one - not sure yet.

VIRTIO Register Block

I use the string 'EMBEDDED' for this file system tag.

ARM Memory

The ARM uses virtual memory through an MMU. Currently all memory is identity mapped. There is a problem accessing external DDR with memcpy - sometimes I get an exception. DWORD access seems to work fine.

Configured using 40-bit physical addresses (PArange) and 48-bit virtual addresses (the largest supported by this ARM implementation). T0SZ=16.

File System Design

Base the design on ctFS + IPLFS. ~~DevFS {can't find the source code yet} but with some modifications. CrossFS is written by the same people, and is available here: https://github.com/RutgersCSSystems/CrossFS~~

DevFS states that integration of the file system with the FTL is not yet done, but would likely provide a benefit. OrcFS has done this. I have a very tightly integrated FS+FTL.

The basis is an inode. An inode is basically a descriptor, pointing to the data contents (contiguously allocated in virtual memory), its size, type and some extended attributes. Inodes live in cache (i.e. RAM) and are flushed to flash from time to time.

Directory entries are a fixed size (currently 72 bytes but likely to change). Directories are not in an inode (as in some other file systems) but rather in the contents of a file (like FAT?). File names are fixed at 63 bytes (the main factor in the directory entry size) but larger names may be supported by hashing the name and storing the hash in the directory entry. The directory entry also contains the inode index, and the inode has a list of attributes. One attribute can be the file name, that can be much larger than 63 bytes (basically up to the maximum of the extended attributes total size of 2KB).

One modification from ctFS is to extract the data segment headers from the data area and put them inside the metadata area in a standalone data structure. The main reason is to maximize the data region space, and not dirty it with more metadata. Now file contents are aligned to beginning of the partition, and there are no overlapping regions that can invalidate a data page by updating metadata. This is important on flash.

Modifications

I think a queue per client would have less overhead than a queue per file. I'm not yet convinced of the benefits of a queue per file.
We need a virtio front-end

Performance

gpu_io_read() takes between 15000-200000 GPU cycles with the 1us delay. This comes down to 5000 cycles without the delay.

fuse_readdir() on the source directory takes about 8.6M GPU cycles. Sending the command is about 160000 cycles, waiting for the response is the majority of the time (not a big surprise).

The datapath is ridiculously long:

User -> kernel -> vfs ->fuse -> FUSE -> kernel ->virtio_fs -> device

A shorter, more appropriate path would be:

User -> kernel -> vfs -> virtio_fs -> device

With some modifications to user space (fixing up glibc) we could have:

User -> device

Linux to Drive Benchmarks

Baseline bare-metal host Linux+SSD

VM using QEMU+KVM+Buildroot

iozone -a -i0 -i1 -g2048

mke2fs -b 65536 -i 65536 -I 128 -L joel /dev/nvme0n1

Design Issues

CVE-2022-0358

Setup

On babbage, my PCIe devices are connected like this:

 +-[0000:40]-+-00.0
 |           +-00.2
 |           +-01.0
 |           +-01.1-[41]--+-00.0 (VCK5000)
 |           |            \-00.1               
 |           +-03.0
 |           +-03.1-[42]----00.0 (Daisy FPGA: virtio or NVMe)
 +-[0000:60]-+-00.0                                                                                  
 |           +-00.2                                                                                  
 |           +-01.0
 |           +-01.1-[61]--+-00.0 (P1000 GPU target)
 |           |            \-00.1
 |           +-02.0             
 |           +-03.0
 |           +-03.1-[62]--+-00.0 (T400 GPU - not used)
 |           |            \-00.1

40:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex

40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
41:00.0 Memory controller: Xilinx Corporation Device 5048
41:00.1 Memory controller: Xilinx Corporation Device 5049

40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
        IOMMU group: 19
        Capabilities: [2a0 v1] Access Control Services
        ACSCap:  SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
        ACSCtl:  SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
42:00.0 Mass storage controller: Red Hat, Inc. Virtio file system

60:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
60:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Milan IOMMU (rev 01)

60:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
        IOMMU group: 1
        Capabilities: [2a0 v1] Access Control Services
        ACSCap:  SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans+
        ACSCtl:  SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
61:00.0 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro P1000] (rev a1)
62:00.0 VGA compatible controller: NVIDIA Corporation TU117GL [T400 4GB] (rev a1)

Workloads

Image Classification

https://github.com/akrizhevsky/cuda-convnet2

https://github.com/vzhuang/CUDA-convnet

Gene Sequencing

https://github.com/OpenHero/gblastn

Runs into the following challenges when installing from README instructions
- Requires existing NCBI-BLAST installation from source to hook in to

configure: error: Do not know how to build MT-safe with compiler /usr/bin/g++

This should be easily solvable with any of 2 approaches:
- Ignoring the compiler version (https://fabioticconi.wordpress.com/2016/01/08/how-to-compile-ncbi-igblast-with-gcc-5-x/). This does not work and maintains the error
- Downgrading the compiler version (install and point to lower version). This will work, but will then point to Boost.Test being the error.
  - I believe the problem is that their GPU implementation seems to have some dependency with the specific NCBI-BLAST version they use

Note that newer versions of NCBI-BLAST build fine, so I think the issue is just with the (very old!) NCBI-BLAST version. However, it is unclear how to run their code with a new version of NCBI-BLAST

Others

cuDF https://docs.rapids.ai/api/cudf/stable/

Bioinformatics (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3018811/; running thousands of ~100B queries against a database. Apparently this is done with files (ref: http://ccl.cse.nd.edu/research/papers/small-grid07.pdf), but I am not sure why

Data streaming (e.g. https://developer.nvidia.com/blog/beginners-guide-to-gpu-accelerated-event-stream-processing-in-python/); processing streams of file data from IoT and other sources

Supposedly a wide variety of physics simulation and HPC applications (ref: https://hal.archives-ouvertes.fr/hal-01892691/document)... but I have negative experience with HPC and would prefer to avoid it

Datasets

Pictures

MNIST: http://yann.lecun.com/exdb/mnist/

CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html

Genomes

BLAST: https://ftp.ncbi.nlm.nih.gov/blast/db/

Eval Platforms

Ideally, I want to modify the firmware of a real SSD, attached over a memory-mapped bus (PCIe or similar). Since it is difficult to get my hands on the source code of a real SSD, I am considering the following as alternatives:

Platform	Notes	Examples
Cosmos+	Limited to 2GB/s by external PCIe gen 2 interface on Zynq 7000	RecSSD (2021)
NVIDIA Bluefield-2	External PCIe Gen 4 Some models have direct attached storage (NVMe U.2) - end of life, and not available for purchase! Other models support NVMEoF and SNAP (i.e. SAN)
Enzian	Remote access only Is it possible to synthesize Cosmos+ FW to run on Enzian? Could be a perfect fit
OpenExpress	For FPGA
OpenNVM	For Nexys 3 board (Xilinx Spartan-6 LX16 FPGA)
Daisy+	Limited to 256GB flash, but can use NVMe as well source: https://github.com/CRZ-Technology/OpenSSD-OpenChannelSSD
Mt. Evans (IPU)	Only NVMe announced but not available yet!

Related Work


Year	Conference	Title & Link	Comments
2022	USENIX ATC	IPLFS: Log-Structured File System without Garbage Collection	eliminate redundancies across the FTL and FS layers extension of F2FS
2020	HotStorage	CompoundFS: Compounding I/O Operations in Firmware File Systems	Fusing FS operations extension of PMFS (DevFS) emulated in-kernel (on host)
2018	FAST	Designing a True Direct-Access File System with DevFS	Reduce overhead of FS operations without compromising integrity & security extension of PMFS emulated in-kernel (on host)
2018	Transactions on Storage	OrcFS: Orchestrated File System for Flash Storage	eliminate the redundancies across the FTL and FS layers extension of F2FS implemented in Samsung 843TN

Approaching DRAM performance using us latency flash memory https://dl.acm.org/doi/abs/10.14778/3457390.3457397

BaM: A Case for Enabling Fine-grain High Throughput GPU-Orchestrated Access to Storage https://arxiv.org/abs/2203.04910

Serverless computing on heterogeneous computers https://dl.acm.org/doi/abs/10.1145/3503222.3507732

CrossFS https://www.usenix.org/system/files/osdi20-ren.pdf

INSIDER https://www.usenix.org/system/files/atc19-ruan_0.pdf

FusionFS https://www.usenix.org/system/files/fast22-zhang-jian.pdf

https://nona1314.github.io/pubs/gpukv-jung-sac21.pdf

https://www.usenix.org/system/files/hotstorage20_paper_awamoto.pdf

https://dl.acm.org/doi/10.1145/3477132.3483565 (DFS offload to smartNICs)

https://ucare.cs.uchicago.edu/pdf/asplos20-LeapIO.pdf (storage server offload to SoC)

Redeﬁning the role of the cpu in the era of cpu-gpu integration

A 3.5X increase in ﬁxed-function accelerators across the six most recent Apple SoCs

Programming FPGA with OpenSSD

We are using the OpenSSD hardware design based on https://github.com/CRZ-Technology/OpenSSD-OpenChannelSSD

License Server

In order to synthesize the design, which is required to generate a bitstream, a licensed version of Vivado is required. UBC provides access to a license server living on research-lm2.ece.ubc.ca. There are also servers within the ECE network that have a licensed version of Vivado ready to use, however they do not have enough RAM to build our design.

To use Vivado licensed from the research-lm2.ece.ubc.ca license server:

Install Vivado Design Edition version 19.1
When configuring the install, un-check the section where you configure a license
Complete install
Connect to the ECE VPN (you may need to request access from it@ece.ubc.ca)
Run Vivado with the environment variable XILINXD_LICENSE_FILE=27017@research-lm2.ece.ubc.ca

Open the project Daisy/NVMe/MIG/Daisy_NVMe_MIG_19.1_20220110/OpenSSD/OpenSSD.xpr

To generate a bitstream in order to program the device, click "Generate Bitstream". Depending on the number of processors you use this may take up to 50GB of RAM or more.

The generated bitstream is located in: Daisy/NVMe/MIG/Daisy_NVMe_MIG_19.1_20220110/OpenSSD/OpenSSD.runs/impl_1/design_1_wrapper.bit

FPGA Development Flow (Vivado 2019.1)

Modify design ( .v files)
Refresh IP Catalog
Open "IP Status" report
Upgrade IP blocks as suggested
Generate output products for block design (.bd file)
Generate bitstream -> will automatically run Sythesis and Implementation first

Flash Datasheets


Manufacturer	Capacity	Bits per cell	Page Size	Block Size	Link
Kioxia	16Gb	SLC	4KB + 256 x 8	256K + 16K x 8	https://www.kioxia.com/en-jp/business/memory/slc-nand/detail.TH58NVG4S0HTAK0.html

Development Environment

The conflict between VFIO and P2P

Research Questions

Functionality

Performance

Cost

Future

Baseline

Formatting

Instability

Measurements

Turn off SSL verification for buildroot

Clone zlib (10 MB)

Clone Linux Source

Write amplification factor

Host Environment

Host Measurements

GPU Environment

GPU Measurements

Development

Phase 1: see what we have already

Step 1: virtiofs

QEMU

Buildroot

Share root fs with guest

Phase 2: GPU

Attempt #1

Attempt #2

How the GPU performs IO accesses

How UNIX sockets work

Eventfd

FUSE init

Phase 3: Device Frontend

Attempt #2 - Daisy

Flash Emulation

Interrupts to the ARM

Interrupts to the host

VIRTIO protocol details

BAR (base address register)

VIRTIO Register Block

ARM Memory

File System Design

Modifications

Performance

Linux to Drive Benchmarks

Design Issues

Setup

Workloads

Image Classification

Gene Sequencing

Others

Datasets

Pictures

Genomes

Eval Platforms

Related Work

Programming FPGA with OpenSSD

License Server

FPGA Development Flow (Vivado 2019.1)

Back to Main Page