Monitoring ECC memory on Linux with rasdaemon

Gabriele Svelto

2020-02-13 23:31

If you have a workstation built around an AMD Ryzen/Threadripper or Intel Xeon processor chances are you're using ECC memory. ECC memory is a worthy investment to improve the reliability of your machine and if properly monitored will allow you to spot memory problems before they become catastrophic.

On recent Linux kernels the rasdaemon tools can be used to monitor ECC memory and report both correctable and uncorrectable memory errors. As we'll see with a little bit of tweaking it's also possible to know exactly which DIMM is experiencing the errors.

Installing rasdaemon

First of all you'll need to intall rasdeamon, it's packaged for most Linux distributions:

Debian/Ubuntu
```
# apt-get install rasdaemon
```
Fedora
```
# dnf install rasdaemon
```
openSUSE
```
# zypper install rasdaemon
```
Gentoo

The package is currently marked as unstable so you'll need to unmask it first:
```
# echo "app-admin/rasdaemon ~amd64" >> /etc/portage/package.keywords
```
Then I recommend enabling sqlite support, this makes rasdaemon record events to disk and is particularly useful for machines that get rebooted often:
```
# echo "app-admin/rasdaemon sqlite" >> /etc/portage/packages.use
```
Finally install rasdaemon itself:
```
emerge rasdaemon
```

Configuring rasdaemon

Then we'll setup rasdaemon to launch at startup and to record events to an on-disk sqlite database.

Note that when booting with Secure Boot enabled, using the kernel lockdown facility in confidentiality mode will prevent rasdaemon from running. To use rasdaemon you'll have to use a different lockdown mode, disable lockdown entirely or disable Secure Boot. You'll find more information in the Troubleshooting section.

Debian/Ubuntu/Fedora/openSUSE and other systemd-based distros
```
# systemctl enable rasdaemon
# systemctl start rasdaemon
```
Gentoo with OpenRC

Add the following line to /etc/conf.d/rasdaemon:
```
RASDAEMON_ARGS=--record
```
Add rasdaemon to the default run-level and start it
```
# rc-config add rasdaemon default
# rc-config start rasdaemon
```

Configuring DIMM labels

At this point rasdaemon should already be running on your system. You can now use the ras-mc-ctl tool to query the errors that have been detected. From now on I will use data from my machine to give an example of the output.

# ras-mc-ctl --error-count
Label                 CE      UE
mc#0csrow#2channel#0  0   0
mc#0csrow#2channel#1  0   0
mc#0csrow#3channel#1  0   0
mc#0csrow#3channel#0  0   0

The CE column represents the number of corrected errors for a given DIMM, UE represents uncorrectable errors that were detected. The label on the left shows the EDAC path under /sys/devices/system/edac/mc/ of every DIMM.

This is not very readable. Since the kernel has no idea of the physical layout of your motherboard it will print the EDAC paths instead of the names of the DIMM slots. We can confirm that the labels are missing with this command:

# ras-mc-ctl --print-labels
ras-mc-ctl: Error: No dimm labels for ASUSTeK COMPUTER INC. model PRIME B450-PLUS

To identify which DIMM slot corresponds to which EDAC path you will have to reboot your system with only one DIMM inserted, write down the name of the slot you insterted it in and then printing out the paths with ras-mc-ctl --error-count.

In my case this was the mapping:

mc#0csrow#0channel#0  DIMM_A1
mc#0csrow#0channel#1  DIMM_A2
mc#0csrow#1channel#1  DIMM_A2
mc#0csrow#1channel#0  DIMM_A1
mc#0csrow#2channel#0  DIMM_B1
mc#0csrow#2channel#1  DIMM_B2
mc#0csrow#3channel#1  DIMM_B2
mc#0csrow#3channel#0  DIMM_B1

Note that there's more than one path per DIMM label, that's fine.

With this data at hand create a text file under /etc/ras/dimm_labels.d/. You will need to fill it up with the mapping data in the following format:

Vendor: <motherboard vendor name>
Model: <motherboard model name>
  <label>: <mc>.<row>.<channel>

You can obtain the motherboard vendor and model name with the following command:

# sudo ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: ASUSTeK COMPUTER INC. model PRIME B450-PLUS

The label lines take a string (the name of the physical DIMM slot), then the numbers in the EDAC path corresponding to the physical slot. You can put more than one label entry per line by separating them with a semicolon. If a given label is associated with more than one EDAC path you can add the separate <mc>.<row>.<channel> sequences by separating them with a comma.

In my case the resulting file (/etc/ras/dimm_labels.d/asus) looks like this:

Vendor: ASUSTeK COMPUTER INC.
Model: PRIME B450-PLUS
  DIMM_A1:  0.0.0, 0.1.0;    DIMM_A2:   0.0.1, 0.1.1;
  DIMM_B1:  0.2.0, 0.3.0;    DIMM_B2:   0.2.1, 0.3.1;

You can find another example of this, with configuration entries for a bunch of other motherboards, in the edac-utils repo.

Once the file is ready it's time to load the labels in the kernel with the following command:

# ras-mc-ctl --register-labels

Printing out labels and error counts will now use the physical DIMM slot names. This is much better if you need to figure out which of your DIMMs is faulty and needs to be replaced:

# ras-mc-ctl --print-labels
LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS
                                    DIMM_A1              0:0:0 missing
                                    DIMM_A2              0:0:1 missing
                                    DIMM_A1              0:1:0 missing
                                    DIMM_A2              0:1:1 missing
mc0 csrow 2 channel 0               DIMM_B1              DIMM_B1
mc0 csrow 2 channel 1               DIMM_B2              DIMM_B2
mc0 csrow 3 channel 0               DIMM_B1              DIMM_B1
mc0 csrow 3 channel 1               DIMM_B2              DIMM_B2

# ras-mc-ctl --error-count
Label   CE      UE
DIMM_B2 0       0
DIMM_B1 0       0
DIMM_B1 0       0
DIMM_B2 0       0

To persist the DIMM names across reboots load the rac-mc-ctl service at startup:

Debian/Ubuntu/Fedora and other systemd-based distros

# systemctl enable ras-mc-ctl
# systemctl start ras-mc-ctl

Gentoo with OpenRC

# rc-config add ras-mc-ctl default
# rc-config start ras-mc-ctl

You're done! After rebooting your system rasdaemon will be continually running and recording errors. You can use ras-mc-ctl to print out a summary of all the errors that have been seen and recorded. Since the counts are stored on disk they will be persisted across reboots. Here's some example output from my machine:

# ras-mc-ctl --summary
Memory controller events summary:
  Corrected on DIMM Label(s): 'DIMM_B1' location: 0:2:0:-1 errors: 5

PCIe AER events summary:
  1 Uncorrected (Non-Fatal) errors: BIT21

No Extlog errors.

No devlink errors.
Disk errors summary:
  0:0 has 6646 errors
No MCE errors.

Troubleshooting

ras-mc-ctl --status prints out ras-mc-ctl: drivers are not loaded

For rasdaemon to work the EDAC kernel drivers for your particular machine need to be loaded. They are usually loaded automatically at boot. You can check out which ones are loaded with this command:
```
# lsmod | grep edac
amd64_edac_mod         32768  0
edac_mce_amd           28672  1 amd64_edac_mod
```
If the EDAC drivers haven't been loaded automatically either your kernel doesn't provide one for your machine or you need to manually load it. Check the EDAC kernel documentation for more details.
rasdaemon fails to start, complaining it can't access the debugfs filesystem

You're likely using the kernel lockdown module in confidentiality mode. When Secure Boot is enabled this will prevent rasdaemon from reading the files it needs to gather its statistics. rasdaemon can work with kernel lockdown when using the integrity mode. To switch to integrity mode add the lockdown=integrity option to the Linux kernel command line in your boot loader.

When using GRUB this can usually be achieved by editing /etc/default/grub and changing the GRUB_CMDLINE_LINUX_DEFAULT variable to include the option, e.g.:
```
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash lockdown=integrity"
```
Keep in mind that integrity mode is less strict than confidentiality mode, as it permits userspace applications to access a fair amount of information that lives in the kernel. This might not be suitable for some deployments - such as those that must run untrusted userspace code.