Monitoring ECC memory on Linux with rasdaemon

If you have a workstation built around an AMD Ryzen/Threadripper or Intel Xeon processor chances are you're using ECC memory. ECC memory is a worthy investment to improve the reliability of your machine and if properly monitored will allow you to spot memory problems before they become catastrophic.

On recent Linux kernels the rasdaemon tools can be used to monitor ECC memory and report both correctable and uncorrectable memory error. As we'll see with a little bit of tweaking it's also possible to know exactly which DIMM is responsible experiencing the errors.

Installing rasdaemon

First of all you'll need to intall rasdeamon, it's packaged for most Linux distributions:

  • Debian/Ubuntu

    # apt-get install rasdaemon
  • Fedora

    # dnf install rasdaemon
  • Gentoo

    The package is currently marked as unstable so you'll need to unmask it first:

    # echo "app-admin/rasdaemon ~amd64" >> /etc/portage/package.keywords

    Then I recommend enabling sqlite support, this makes rasdaemon record events to disk and is particularly useful for machines that get rebooted often:

    # echo "app-admin/rasdaemon sqlite" >> /etc/portage/packages.use

    Finally install rasdaemon itself:

    emerge rasdaemon

Configuring rasdaemon

Then we'll setup rasdaemon to launch at startup and to record events to a on-disk sqlite database.

  • Debian/Ubuntu/Fedora and other systemd-based distros

    Note that on Fedora rasdaemon will not work if Secure Boot is enabled because of kernel lockdown. You will have to either disable kernel lockdown or Secure Boot if you want to use rasdaemon.

    # systemctl enable rasdaemon
    # systemctl start rasdaemon
  • Gentoo with OpenRC

    Add the following line to /etc/conf.d/rasdaemon:

    RASDAEMON_ARGS=--record

    Add rasdaemon to the default run-level and start it

    # rc-config add rasdaemon default
    # /etc/init.d/rasdaemon start

Configuring DIMM labels

At this point rasdaemon should already be running on your system. You can now use the ras-mc-ctl tool to query the errors that have been detected. From now on I will use data from my machine to give an example of the output.

# ras-mc-ctl --error-count
Label                 CE      UE
mc#0csrow#2channel#0  0       0
mc#0csrow#2channel#1  0       0
mc#0csrow#3channel#1  0       0
mc#0csrow#3channel#0  0       0

The CE column represents the number of correct errors for a given DIMM, UE represents uncorrectable errors that were detected. The label on the left shows the EDAC path under /sys/devices/system/edac/mc/ of every DIMM.

This is not very readable, because the kernel has no idea of the physical layout of your motherboard so it will print the EDAC paths instead of the names of the DIMM slots. We can confirm that the labels are missing with this command:

# ras-mc-ctl --print-labels
ras-mc-ctl: Error: No dimm labels for ASUSTeK COMPUTER INC. model PRIME B450-PLUS

To identify which DIMM slots correspond to which EDAC path you will have to reboot your system with only one DIMM inserted, write down the name of the slot you insterted it in and then printing out the paths with ras-mc-ctl --error-count.

In my case this was the mapping:

mc#0csrow#0channel#0  DIMM_A1
mc#0csrow#0channel#1  DIMM_A2
mc#0csrow#1channel#1  DIMM_A2
mc#0csrow#1channel#0  DIMM_A1
mc#0csrow#2channel#0  DIMM_B1
mc#0csrow#2channel#1  DIMM_B2
mc#0csrow#3channel#1  DIMM_B2
mc#0csrow#3channel#0  DIMM_B1

Note that there's more than one path per DIMM label, that's fine.

With this data at hand create a text file under /etc/ras/dimm_labels.d/. You will need to fill it up with the mapping data in the following format:

Vendor: <motherboard vendor name>
Model: <motherboard model name>
  <label>: <mc>.<row>.<channel>

You can obtain the motherboard vendor and model name with the following command:

# sudo ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: ASUSTeK COMPUTER INC. model PRIME B450-PLUS

The label lines take a string (the name of the physical DIMM slot), then the numbers in the EDAC path corresponding to the physical slot. You can put more than one label entry per line by separating them with a semicolon. If a given label is associated with more than one EDAC path you can add the separate <mc>.<row>.<channel> sequences by separating them with a comma.

In my case the resulting file (/etc/ras/dimm_labels.d/asus) looks like this:

Vendor: ASUSTeK COMPUTER INC.
Model: PRIME B450-PLUS
  DIMM_A1:  0.0.0, 0.1.0;    DIMM_A2:   0.0.1, 0.1.1;
  DIMM_B1:  0.2.0, 0.3.0;    DIMM_B2:   0.2.1, 0.3.1;

You can find another example of this with configuration entries for a bunch of other motherboards in the edac-utils repo.

Once the file is ready it's time to load the labels in the kernel with the following command:

# ras-mc-ctl --register-labels

Printing out labels and error counts will now use the physical DIMM slot names which is much nicer if you need to figure out which of your DIMMs is faulty and needs to be replaced:

# ras-mc-ctl --print-labels
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS
                                    DIMM_A1              0:0:0 missing
                                    DIMM_A2              0:0:1 missing
                                    DIMM_A1              0:1:0 missing
                                    DIMM_A2              0:1:1 missing
mc0 csrow 2 channel 0               DIMM_B1              DIMM_B1
mc0 csrow 2 channel 1               DIMM_B2              DIMM_B2
mc0 csrow 3 channel 0               DIMM_B1              DIMM_B1
mc0 csrow 3 channel 1               DIMM_B2              DIMM_B2

# ras-mc-ctl --error-count
Label         CE      UE
DIMM_B2       0       0
DIMM_B1       0       0
DIMM_B1       0       0
DIMM_B2       0       0

To persist the DIMM names across reboots load the rac-mc-ctl service at startup:

  • Debian/Ubuntu/Fedora and other systemd-based distros

    # systemctl enable ras-mc-ctl
    # systemctl start ras-mc-ctl
  • Gentoo with OpenRC

    # rc-config add ras-mc-ctl default
    # /etc/init.d/ras-mc-ctl start

You're done! After rebooting your system rasdaemon will be continually running and recording errors. You can use ras-mc-ctl to print out a summary of all the errors that have been seen and recorded. Since the counts are stored on disk they will be persisted across reboots. Here's some example output from my machine:

# ras-mc-ctl --summary
Memory controller events summary:
  Corrected on DIMM Label(s): 'DIMM_B1' location: 0:2:0:-1 errors: 5

PCIe AER events summary:
  1 Uncorrected (Non-Fatal) errors: BIT21

No Extlog errors.

No devlink errors.
Disk errors summary:
  0:0 has 6646 errors
No MCE errors.

Troubleshooting

  • ras-mc-ctl --status prints out ras-mc-ctl: drivers are not loaded

    For rasdaemon to work the EDAC kernel drivers for your particular machine need to be loaded. They are usually loaded automatically at boot. You can check out which ones are loaded with this command:

    # lsmod | grep edac
    amd64_edac_mod         32768  0
    edac_mce_amd           28672  1 amd64_edac_mod

    If the EDAC drivers haven't been loaded automatically either your kernel doesn't provide one for your machine or you need to manually load it. Check the EDAC kernel documentation for more details.

Gentoo 5.4.x generic kernel configuration

While the 5.4.x Linux kernel hasn't been marked as stable in Gentoo yet I've updated my generic kernel configuration file to match it. As with the previous config files for 4.14.x and 4.19.x this configuration is based on the Fedora kernel with some Gentoo-specific tweaks. It supports practically every bit of hardware in existence and enables a lot of bleeding-edge kernel functionality. The downside is that building it will take a while and the modules will occupy quite a bit of storage.

Note that this kernel configuration is for use with OpenRC. If you're using systemd you'll have to remove the CONFIG_GENTOO_LINUX_INIT_SCRIPT=y line from the configuration file and add CONFIG_GENTOO_LINUX_INIT_SYSTEMD=y instead.

To use it install the appropriate sys-kernel/gentoo-sources package (5.4.x), copy the configuration file under /usr/src/linux/ and rename it to .config then proceed to build and install the kernel as usual.

Gentoo 5.4.x kernel configuration file

Gentoo 4.19.x generic kernel configuration

The 4.19.23 Linux kernel has been marked as stable in Gentoo a few days ago and I've just updated my generic kernel configuration file to match it. As with the the 4.14.x configuration I posted a while ago this configuration is based on the Fedora kernel with some Gentoo-specific tweaks. It supports practically every bit of hardware in existence and enables a lot of bleeding-edge kernel functionality. The downside is that building it will take a while and the modules will occupy quite a bit of storage.

To use it install the latest stable sys-kernel/gentoo-sources package (4.19.x), copy the configuration file under /usr/src/linux/ and rename it to .config then proceed to build and install the kernel as usual.

Gentoo 4.19.x kernel configuration file

Setting the compose key on Xfce

The compose key is a handy tool to generate characters that aren't available on your keyboard. On Xfce there isn't a readily accessible way to set it, but it can be done rather easily from the Settings Editor:

  1. Launch the Settings Editor from Applications > Settings > Settings Editor or via the terminal by executing the xfce4-settings-editor command

  2. Select the keyboard-layout channel

  3. Look for the Compose property under Default > XkbOptions > Compose

  4. To enable the compose key you have to enter one of the following values in the Compose property:

    Compose key Value
    Right Win compose:rwin
    Left Win compose:lwin
    Right Ctrl compose:rctrl
    Left Ctrl compose:lctrl
    Right Alt compose:ralt

The resulting setting should look like this (I'm using the right Windows key in this example):

/images/xfce-compose-key.png

Gentoo 4.14.x generic kernel configuration

While the Gentoo Handbook contains almost every step needed into making a working Gentoo installation the kernel configuration step can be quite confusing for a new user. Enabling proper hardware support and turning on all the useful features can be daunting if you're not a developer or simply haven't encountered the kernel configuration before.

A good way around it is to use a generic kernel. genkernel provides a way to build a default kernel but I often find its default configuration to be either out-of-date or missing some important bit.

So, if you want to get started on Gentoo quickly you might as well use my kernel configuration which is based on the Fedora kernel and as such follows an everything-and-the-kitchen-sink approach. It supports practically every bit of hardware out there, will work on desktop PCs, laptops and servers, and includes important security features such as KPTI. The downside is that it's very large and will take a long time to compile.

To use it install the latest stable sys-kernel/gentoo-sources package (4.14.x), copy the configuration file under /usr/src/linux/ and rename it to .config then proceed to build the kernel as usual.

Gentoo 4.14.x kernel configuration file