Monitoring ECC memory on Linux with rasdaemon

If you have a workstation built around an AMD Ryzen/Threadripper or Intel Xeon processor chances are you're using ECC memory. ECC memory is a worthy investment to improve the reliability of your machine and if properly monitored will allow you to spot memory problems before they become catastrophic.

On recent Linux kernels the rasdaemon tools can be used to monitor ECC memory and report both correctable and uncorrectable memory error. As we'll see with a little bit of tweaking it's also possible to know exactly which DIMM is responsible experiencing the errors.

Installing rasdaemon

First of all you'll need to intall rasdeamon, it's packaged for most Linux distributions:

  • Debian/Ubuntu

    # apt-get install rasdaemon
  • Fedora

    # dnf install rasdaemon
  • Gentoo

    The package is currently marked as unstable so you'll need to unmask it first:

    # echo "app-admin/rasdaemon ~amd64" >> /etc/portage/package.keywords

    Then I recommend enabling sqlite support, this makes rasdaemon record events to disk and is particularly useful for machines that get rebooted often:

    # echo "app-admin/rasdaemon sqlite" >> /etc/portage/packages.use

    Finally install rasdaemon itself:

    emerge rasdaemon

Configuring rasdaemon

Then we'll setup rasdaemon to launch at startup and to record events to a on-disk sqlite database.

  • Debian/Ubuntu/Fedora and other systemd-based distros

    Note that on Fedora rasdaemon will not work if Secure Boot is enabled because of kernel lockdown. You will have to either disable kernel lockdown or Secure Boot if you want to use rasdaemon.

    # systemctl enable rasdaemon
    # systemctl start rasdaemon
  • Gentoo with OpenRC

    Add the following line to /etc/conf.d/rasdaemon:

    RASDAEMON_ARGS=--record

    Add rasdaemon to the default run-level and start it

    # rc-config add rasdaemon default
    # /etc/init.d/rasdaemon start

Configuring DIMM labels

At this point rasdaemon should already be running on your system. You can now use the ras-mc-ctl tool to query the errors that have been detected. From now on I will use data from my machine to give an example of the output.

# ras-mc-ctl --error-count
Label                 CE      UE
mc#0csrow#2channel#0  0       0
mc#0csrow#2channel#1  0       0
mc#0csrow#3channel#1  0       0
mc#0csrow#3channel#0  0       0

The CE column represents the number of correct errors for a given DIMM, UE represents uncorrectable errors that were detected. The label on the left shows the EDAC path under /sys/devices/system/edac/mc/ of every DIMM.

This is not very readable, because the kernel has no idea of the physical layout of your motherboard so it will print the EDAC paths instead of the names of the DIMM slots. We can confirm that the labels are missing with this command:

# ras-mc-ctl --print-labels
ras-mc-ctl: Error: No dimm labels for ASUSTeK COMPUTER INC. model PRIME B450-PLUS

To identify which DIMM slots correspond to which EDAC path you will have to reboot your system with only one DIMM inserted, write down the name of the slot you insterted it in and then printing out the paths with ras-mc-ctl --error-count.

In my case this was the mapping:

mc#0csrow#0channel#0  DIMM_A1
mc#0csrow#0channel#1  DIMM_A2
mc#0csrow#1channel#1  DIMM_A2
mc#0csrow#1channel#0  DIMM_A1
mc#0csrow#2channel#0  DIMM_B1
mc#0csrow#2channel#1  DIMM_B2
mc#0csrow#3channel#1  DIMM_B2
mc#0csrow#3channel#0  DIMM_B1

Note that there's more than one path per DIMM label, that's fine.

With this data at hand create a text file under /etc/ras/dimm_labels.d/. You will need to fill it up with the mapping data in the following format:

Vendor: <motherboard vendor name>
Model: <motherboard model name>
  <label>: <mc>.<row>.<channel>

You can obtain the motherboard vendor and model name with the following command:

# sudo ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: ASUSTeK COMPUTER INC. model PRIME B450-PLUS

The label lines take a string (the name of the physical DIMM slot), then the numbers in the EDAC path corresponding to the physical slot. You can put more than one label entry per line by separating them with a semicolon. If a given label is associated with more than one EDAC path you can add the separate <mc>.<row>.<channel> sequences by separating them with a comma.

In my case the resulting file (/etc/ras/dimm_labels.d/asus) looks like this:

Vendor: ASUSTeK COMPUTER INC.
Model: PRIME B450-PLUS
  DIMM_A1:  0.0.0, 0.1.0;    DIMM_A2:   0.0.1, 0.1.1;
  DIMM_B1:  0.2.0, 0.3.0;    DIMM_B2:   0.2.1, 0.3.1;

You can find another example of this with configuration entries for a bunch of other motherboards in the edac-utils repo.

Once the file is ready it's time to load the labels in the kernel with the following command:

# ras-mc-ctl --register-labels

Printing out labels and error counts will now use the physical DIMM slot names which is much nicer if you need to figure out which of your DIMMs is faulty and needs to be replaced:

# ras-mc-ctl --print-labels
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS
                                    DIMM_A1              0:0:0 missing
                                    DIMM_A2              0:0:1 missing
                                    DIMM_A1              0:1:0 missing
                                    DIMM_A2              0:1:1 missing
mc0 csrow 2 channel 0               DIMM_B1              DIMM_B1
mc0 csrow 2 channel 1               DIMM_B2              DIMM_B2
mc0 csrow 3 channel 0               DIMM_B1              DIMM_B1
mc0 csrow 3 channel 1               DIMM_B2              DIMM_B2

# ras-mc-ctl --error-count
Label         CE      UE
DIMM_B2       0       0
DIMM_B1       0       0
DIMM_B1       0       0
DIMM_B2       0       0

To persist the DIMM names across reboots load the rac-mc-ctl service at startup:

  • Debian/Ubuntu/Fedora and other systemd-based distros

    # systemctl enable ras-mc-ctl
    # systemctl start ras-mc-ctl
  • Gentoo with OpenRC

    # rc-config add ras-mc-ctl default
    # /etc/init.d/ras-mc-ctl start

You're done! After rebooting your system rasdaemon will be continually running and recording errors. You can use ras-mc-ctl to print out a summary of all the errors that have been seen and recorded. Since the counts are stored on disk they will be persisted across reboots. Here's some example output from my machine:

# ras-mc-ctl --summary
Memory controller events summary:
  Corrected on DIMM Label(s): 'DIMM_B1' location: 0:2:0:-1 errors: 5

PCIe AER events summary:
  1 Uncorrected (Non-Fatal) errors: BIT21

No Extlog errors.

No devlink errors.
Disk errors summary:
  0:0 has 6646 errors
No MCE errors.

Troubleshooting

  • ras-mc-ctl --status prints out ras-mc-ctl: drivers are not loaded

    For rasdaemon to work the EDAC kernel drivers for your particular machine need to be loaded. They are usually loaded automatically at boot. You can check out which ones are loaded with this command:

    # lsmod | grep edac
    amd64_edac_mod         32768  0
    edac_mce_amd           28672  1 amd64_edac_mod

    If the EDAC drivers haven't been loaded automatically either your kernel doesn't provide one for your machine or you need to manually load it. Check the EDAC kernel documentation for more details.