PIM-SWAP

From UBC Wiki

PIM-SWAP is a page swap cache that compresses memory pages swapped out from RAM into UPMEM memory. It is implemented as a Linux kernel module that uses the 'frontswap' API. The linked report explains the motivation and initial experiments regarding block size and overall design. File:Compressed in memory swap.pdf

Setup

Requires QEMU, g++, libelf-dev, libssl-dev and buildroot.

I am using the Ubuntu-supplied package for QEMU, which is version: QEMU emulator version 4.2.1 (Debian 1:4.2-3ubuntu6.12). QEMU also appears to work on WSL running Ubuntu.

Clone PIM-SWAP from https://github.com/UBC-ECE-Sasha/PIM-swap and clone buildroot from https://git.busybox.net/buildroot and checkout tag 2021.02.2 to a new branch. The two repositories should each be sub-directories of the same directory.

Next, run swap_setup.sh in the PIM-SWAP directory to create a disk image to treat as a swap device. The default size is 1G but a different size in gigabytes can be passed in as a positional argument to the script.

Run setup.sh to create the environment (setup custom buildroot in PIM-SWAP/output) and then compile buildroot.

git clone https://git.busybox.net/buildroot
git clone https://github.com/UBC-ECE-Sasha/PIM-swap 
cd buildroot && git checkout 2021.02.2 -b pimswap
cd ../PIM-swap
./setup.sh
cd output
make all

Virtual build environment with Vagrant

Vagrant is a tool for building and maintaining virtual build environments. Vagrant is configure with a Vagrantfile which specifies things such as VM parameters and what repositories to clone and other commands to run upon VM startup.

The Vagrantfile for PIM-swap can be taken from the /vagrant directory in the PIM-swap repository or downloaded with the following command. The packages vagrant and virtualbox-qt must be installed. Virtualbox does not currently support Apple silicon meaning that newer macs cannot run this tutorial.

curl -O https://raw.githubusercontent.com/UBC-ECE-Sasha/PIM-swap/vagrant-build-environment/vagrant/Vagrantfile && mkdir src

The VM can then be run with the command vagrant up and accessed with the command vagrant ssh; the logout command will stop the ssh session. The VM can be stopped with the command vagrant halt and destroyed with vagrant destroy.

Once logged into the VM with ssh, the PIM-swap directory should be in the home directory. From there, the regular steps to build it can be followed (without needing to clone the repos as that has already been done). Since building in a VirtualBox shared directory creates linker problems, PIM-swap must be copied to a shared directory after building. By default, Vagrant shares the /vagrant directory on the guest with the directory containing the Vagrantfile. Thus, the following command can be run to move the results of the PIM-swap build to the directory containing the Vagrantfile on the host. This will overwrite any PIM-swap files in that shared directory.

cp -L -R -f PIM-swap /vagrant

Running

Locally

modprobe pim_swap

QEMU

Once buildroot has been made, run QEMU. To run QEMU, use this command

sudo ./run-qemu.sh

The QEMU monitor will then pop up. Once that happens, the guest can be accessed via ssh with the following command. and the password "root."

ssh root@localhost -p 10022

To remove the password prompts and the ssh-keygen commands needed for re-accessing QEMU, the following command can be used.

sshpass -p "root" ssh root@localhost -p 10022 -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no

QEMU is set up with a 1GB swap device (virtual disk). The swap device is mount and enabled automatically (by init) because the swap device is listed in /etc/fstab. The pim_swap module must still be inserted manually. It can be inserted at any time (frontswap allows for that) but it can never be removed (frontswap does not provide any method for removing a swap cache!). Use:

modprobe pim_swap

You can see the free and used swap space with the 'free' command. 'cat /proc/meminfo' gives even more detailed information.

QEMU disks

When adding a drive to QEMU, the parameter cache=none should be added to set the cache mode. This (mostly) prevents the host OS from caching the disk file by using O_DIRECT semantics,[1] which is ideal when trying to use QEMU for performance benchmarking. However, linux does not guarantee that O_DIRECT reads and writes don't use the host cache[2] and thus the certainty of this feature is in doubt.

Adding more disk space

Giving the root filesystem more space

Allegedly, adding the following line, where x is the desired rootsfs size in GB, to your defonconfig should give the rootfs more space. However, it still isn't working for me.

BR2_TARGET_ROOTFS_EXT2_SIZE="[x]G"
Adding an additional disk

In some cases, the default root file system on buildroot may be too small to accommodate storage needs. To get more disk space, the following tutorial can be used to add an additional drive.

  1. Run the following command to create a disk image where [x] can be changed to whatever size in gigabytes you need. qemu-img create disk_[x]G.raw [x]G
  2. Add the following line to run-qemu.sh script to tell qemu where to find the disk image made above. -drive id=additional_disk,file=disk_[x]G.raw,if=ide,format=raw
  3. Run QEMU and ssh into the vm.
  4. Run fdisk -l to find the drive. In the case of this tutorial, we will assume that the drive is /dev/sdb. There should be text saying that the drive isn't partitioned.
  5. Run fdisk /dev/sdb (or whatever your drive is called) to create a partition table. When asked for a command, enter 'w' to write a new partition table.
  6. Run mke2fs /dev/sdb (or whatever your drive is called) to create an ext2 filesystem on the disk.
  7. Run mount /dev/sdb /dev to mount the drive.
  8. The additional drive should now be usable. This can be checked by running df -h to see available disk space.

Testing/benchmarking

Automated testing

An extensive set of scripts for running workloads, logging system data and interpreting the data has been developed and can be read about in bench/README.md

stress-ng

To stress out the memory, use 'stress-ng' which is built as an optional package in buildroot.

stress-ng --vm 2 --vm-bytes 1G --timeout 30s

wiredtiger

wiredtiger is the storage engine at the heart of MongoDB, a popular document-oriented database program. wiredtiger comes with a benchmarking tool called wtperf that can be used to evaluate the performance of PIM-Swap. wtperf has two phases, a populate phase and a workload phase. We are specifically interested in performance during the workload phase.

Building

Building wiredtiger requires cmake. ninja and ccache are also reccomended for faster build times. wiredtiger can be built with the following instructions or the script bench/conf/wiredtiger/WT-make.sh. It isn't strictly necessary to use the mongodb-5.3.1 release, however, sticking with one release makes comparisons between tests easier.

git clone https://github.com/wiredtiger/wiredtiger.git
cd wiredtiger
git checkout tags/mongodb-5.3.1 -b WT-5.3.1
mkdir build && cd build
cmake ../.
make

Running locally

The basic command for running wiredtiger is ./wtperf -O [config]where config is a configuration file for wtperf that can be found in bench/conf/wiredtiger/configs and wiredtiger/bench/wtperf/runners. The scripts local-test.sh and wt-test.sh can run wtperf and take care of limiting memory and handling logging respectively. If you're running on an NFS, you should also specify a database directory on an SSD with the -h flag. Otherwise, wiredtiger will run off of files in the NFS.

Reusing the database

When using a read-only workload like ycsb-c, it isn't necessary to populate the database each time. Instead, we can create a "populate" and a "workload" version of the configuration file. First, create a populate version of a configuration file in which run_time and run_ops are set to 0. Next, create a workload version of the configuration file with create=false added. Now, you can run the populate version once and then the workload version as many times as you want. For this to work, the same directory must be specified with the -h flag each time.

Workloads

Configuration files can be found in the bench/conf/wiredtiger/configs directory.

  • ycsb-a (read and update)
  • ycsb-c (read only)

memcached

memcached is a generic memory object caching system. It can be benchmarked with a highly configurable load generating tool called memaslap. memaslap is closely related to memslap and memcslap but has more functionality. memcached, benchamrked with memaslap, was used to benchmark Infiniswap, which you can read about here.

memcached is a package in buildroot and can be enable by going into the buildroot directory, running make menuconfig and navigating to Target Applications -> Networking applications, enabling memcached and saving. This step should be completed before running "make all" as described in the Setup section.

memaslap can then be installed on your machine by following these instructions adapted from this article.

  1. Download libmemcached from this site. (I used 1.0.18)
  2. Edit the configure file to include the line "CPPFLAGS='-fpermissive'" and "LDFLAGS='-L/lib64 -lpthread'". I inserted that right after the phrase "Main body of script" on line 2830. This edit is necessary to suppress some compiler errors as of right now. Hopefully, these bugs, which you can track here, and here will be fixed and you won't need to do this.
  3. Run the following commands to configure, make and install libmemcached including memaslap.
./configure --enable-memaslap 
sudo make install

After mounting the pim_swap module as described above, memcached can be run by typing the following command

memcached -u nobody

From your machine, you can then run memaslap with the server address being the one you ssh'd into at port 11211.

The script qemu-memaslap.sh automates the process of starting a vm, running memcached on it and running memaslap. It runs memaslap twice with the first run being to pre-populate the server and the second run being the actual test. In order, it takes in the command line arguments number of cores for QEMU to emulate (default: 4), size of guest memory for QEMU in MB (default 1600) and memcached memory limit in MB (default 2400).

Troubleshooting

Missing Frontswap Config

If you see an error message like the following, it means the fragment "config/linux_config/pimswap.cfg" was not applied to the Linux .config file.

ERROR: "frontswap_register_ops" [/home/john/Workspace/Research/upmem/PIM-swap/output/build/pimswap-custom/./pim_swap.ko] undefined!
ERROR: "frontswap_writethrough" [/home/john/Workspace/Research/upmem/PIM-swap/output/build/pimswap-custom/./pim_swap.ko] undefined!
make[4]: *** [scripts/Makefile.modpost:94: __modpost] Error 1
make[3]: *** [Makefile:1645: modules] Error 2
make[2]: *** [package/pkg-generic.mk:251: /home/john/Workspace/Research/upmem/PIM-swap/output/build/pimswap-custom/.stamp_built] Error 2
make[1]: *** [Makefile:23: _all] Error 2
make[1]: Leaving directory '/home/john/Workspace/Research/upmem/PIM-swap/output'
make: *** [Makefile:6: pim-swap] Error 2

I am still not sure why this happens. You can just 'touch' the file config/linux_config/pimswap.cfg, then re-run ./setup.sh, and force Linux to rebuild from the output directory make linux-rebuild all

The following should appear at the beginning of the build, when buildroot merges the config files together:

Using /home/joel/projects/upmem/pim-swap/output/build/linux-5.4.95/.config as base
Merging linux_config/linux_e1000.cfg
Merging linux_config/pimswap.cfg

You can also verify that the flags are in the config file:

grep FRONTSWAP /home/joel/projects/upmem/pim-swap/output/build/linux-5.4.95/.config

Can't open file ... Config.in

The following error may be caused by an out of date PIM-swap\config\Config.in file.

package/Config.in:[LINE NUMBER]: can't open file "package/[PACKAGE NAME]/Config.in"

Check that you are on tag 2021.02.2. You can also check in buildroot\CHANGES that the package has indeed been removed from buildroot and then remove the offending line from PIM-swap\config\Config.in. Before rerunning setup.sh, discard all changes in buildroot.

Block node is read-only

This may occur if permissions are not allowing writing to the swap disc. Adjust the permissions to allow writes.

kex_exchange_identification: read: Connection reset by peer

If this error occurs when trying to ssh right after QEMU is started, it may mean that the boot process may not have been finished when ssh was started. When ssh is tried again, it generally works. This can also happen if ssh on the guest gets messed up somehow. For example, doing something with /dev will cause this.

Debugging

This section covers all debugging techniques we can use for pim-swap.

Using kgdb to debug pim-swap

Kernel debugging can be theoretically done using gdb and qemu as a remote target to attach gdb. This section covers steps to use it practically for pim-swap

Building the kernel

The kernel configuration needs to be changed in order to use kgdb effectively for the kernel. In order to change the current kernel config go to the kernel directory and run make menuconfig. The config changes required for the kernel are -

  1. Set CONFIG_DEBUG_INFO to y
  2. Set CONFIG_KGDB to y
  3. Disable KASLR either by disabling CONFIG_RANDOMIZE_BASE or by passing the nokaslr boot option to qemu. I used both.
  4. Set CONFIG_KGDB_SERIAL_CONSOLE to y
  5. Leave CONFIG_DEBUG_INFO_REDUCED off

After changing the configuration, rebuild the kernel. You should see vmlinux and a vmlinux python script in the kernel root directory.

Setting up buildroot target with gdb

To install gdb in the buildroot generated target to debug we need to change the config scripts to have C++ support and gdb. Config changes are pushed in the kernel_module branch.

Using gdb with qemu

  1. Update the run_qemu script (additionally the qemu-system command) with the -s and -S options to wait for a debugger to be attached.
  2. Run gdb from the kernel root using the gdb vmlinx
  3. Attach to qemu target using target remote :1234
  4. Set a hardware break on start_kernel using hb start_kernel and continue. This breakpoint should hit. To read more about why we need hardware breakpoints read the sources. Continue from the breakpoint.
  5. ssh into the kernel.
  6. Load pimswap using modprobe pim_swap
  7. Get the virtual address of the text segment of the loaded module using cat /sys/module/pim_swap/sections/.text
  8. Load the symbol file in gdb using add-symbol-file ../pimswap_custom/pim_swap.o <address from previous line>
  9. Set breakpoint on pimswap_frontswap_store using b pimswap_frontswap_store
  10. Continue

Fun adventures

  1. Since our pimswap kernel module was built with -o2 option (which is recommended for kernel stuff as some inlining is required inside the kernel) the compiler labelled the function pimswap_frontswap_store (the *only* function with an implementation) in a segment called text.unlikely, which means this function is unlikely to run. In this case even if you load the symbol table for pimswap to gdb it will not find the function when you set a breakpoint. The way to find out if this function exists in the elf is by using objdump -t on the file and checking which segment it belongs to. For debugging purposes I added -o0 to the build and that fixed this issue.
  2. If kernel randomization is on, virtual addresses don't work as provided in elf so debugging with breakpoints is basically impossible. In this case you will have to use hardware breakpoints.

Sources -

https://www.kernel.org/doc/html/v4.18/dev-tools/kgdb.html

https://www.linuxjournal.com/article/5749

Related Work

Year Conference Title & Link Rel
2022 ASPLOS TMO: Transparent Memory Offloading in Datacenters
2022 ASPLOS Clio: A Hardware-Software Co-Designed Disaggregated Memory System
2021 ASPLOS Rethinking Software Runtimes for Disaggregated Memory
2021 SOSP MIND: In-Network Memory Management for Disaggregated Data Centers
2020 USENIX ATC Effectively Prefetching Remote Memory with Leap
2020 EuroSys Can far memory improve job throughput?
2019 ASPLOS Software-Defined Far Memory in Warehouse-Scale Computers
2018 IEEE Big Data Dynamic and Transparent Memory Sharing for Accelerating Big Data Analytics Workloads in Virtualized Cloud
2017 ASPLOS ReFlex: Remote Flash ≈ Local Flash
2021 SOSP HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM
2014 ATC Efficient Tracing of Cold Code via Bias-Free Sampling
  1. "KVM Disk Cache Modes". SUSE. Retrieved 2021-12-01.
  2. "open(2)". Linux manual page. Retrieved 2021-12-01.