Monitoring ECC memory on Linux with rasdaemon
If you have a workstation built around an AMD Ryzen/Threadripper or Intel Xeon processor chances are you're using ECC memory. ECC memory is a worthy investment to improve the reliability of your machine and if properly monitored will allow you to spot memory problems before they become catastrophic.
On recent Linux kernels the rasdaemon tools can be used to monitor ECC memory and report both correctable and uncorrectable memory errors. As we'll see with a little bit of tweaking it's also possible to know exactly which DIMM is experiencing the errors.
Installing rasdaemon
First of all you'll need to intall rasdeamon, it's packaged for most Linux distributions:
-
Debian/Ubuntu
# apt-get install rasdaemon
-
Fedora
# dnf install rasdaemon
-
openSUSE
# zypper install rasdaemon
-
Gentoo
The package is currently marked as unstable so you'll need to unmask it first:
# echo "app-admin/rasdaemon ~amd64" >> /etc/portage/package.keywords
Then I recommend enabling sqlite support, this makes rasdaemon record events to disk and is particularly useful for machines that get rebooted often:
# echo "app-admin/rasdaemon sqlite" >> /etc/portage/packages.use
Finally install rasdaemon itself:
emerge rasdaemon
Configuring rasdaemon
Then we'll setup rasdaemon to launch at startup and to record events to an on-disk sqlite database.
Note that when booting with Secure Boot enabled, using the kernel lockdown facility in confidentiality mode will prevent rasdaemon from running. To use rasdaemon you'll have to use a different lockdown mode, disable lockdown entirely or disable Secure Boot. You'll find more information in the Troubleshooting section.
-
Debian/Ubuntu/Fedora/openSUSE and other systemd-based distros
# systemctl enable rasdaemon # systemctl start rasdaemon
-
Gentoo with OpenRC
Add the following line to
/etc/conf.d/rasdaemon
:RASDAEMON_ARGS=--record
Add
rasdaemon
to the default run-level and start it# rc-config add rasdaemon default # rc-config start rasdaemon
Configuring DIMM labels
At this point rasdaemon should already be running on your system. You can now use the ras-mc-ctl tool to query the errors that have been detected. From now on I will use data from my machine to give an example of the output.
# ras-mc-ctl --error-count Label CE UE mc#0csrow#2channel#0 0 0 mc#0csrow#2channel#1 0 0 mc#0csrow#3channel#1 0 0 mc#0csrow#3channel#0 0 0
The CE column represents the number of corrected errors for a given DIMM, UE
represents uncorrectable errors that were detected. The label on the left
shows the EDAC path under /sys/devices/system/edac/mc/
of every DIMM.
This is not very readable. Since the kernel has no idea of the physical layout of your motherboard it will print the EDAC paths instead of the names of the DIMM slots. We can confirm that the labels are missing with this command:
# ras-mc-ctl --print-labels ras-mc-ctl: Error: No dimm labels for ASUSTeK COMPUTER INC. model PRIME B450-PLUS
To identify which DIMM slot corresponds to which EDAC path you will have to
reboot your system with only one DIMM inserted, write down the name of the
slot you insterted it in and then printing out the paths with
ras-mc-ctl --error-count
.
In my case this was the mapping:
mc#0csrow#0channel#0 DIMM_A1 mc#0csrow#0channel#1 DIMM_A2 mc#0csrow#1channel#1 DIMM_A2 mc#0csrow#1channel#0 DIMM_A1 mc#0csrow#2channel#0 DIMM_B1 mc#0csrow#2channel#1 DIMM_B2 mc#0csrow#3channel#1 DIMM_B2 mc#0csrow#3channel#0 DIMM_B1
Note that there's more than one path per DIMM label, that's fine.
With this data at hand create a text file under /etc/ras/dimm_labels.d/
.
You will need to fill it up with the mapping data in the following format:
Vendor: <motherboard vendor name> Model: <motherboard model name> <label>: <mc>.<row>.<channel>
You can obtain the motherboard vendor and model name with the following command:
# sudo ras-mc-ctl --mainboard ras-mc-ctl: mainboard: ASUSTeK COMPUTER INC. model PRIME B450-PLUS
The label lines take a string (the name of the physical DIMM slot), then the
numbers in the EDAC path corresponding to the physical slot. You can put
more than one label entry per line by separating them with a semicolon. If a
given label is associated with more than one EDAC path you can add the separate
<mc>.<row>.<channel>
sequences by separating them with a comma.
In my case the resulting file (/etc/ras/dimm_labels.d/asus
) looks like this:
Vendor: ASUSTeK COMPUTER INC. Model: PRIME B450-PLUS DIMM_A1: 0.0.0, 0.1.0; DIMM_A2: 0.0.1, 0.1.1; DIMM_B1: 0.2.0, 0.3.0; DIMM_B2: 0.2.1, 0.3.1;
You can find another example of this, with configuration entries for a bunch of other motherboards, in the edac-utils repo.
Once the file is ready it's time to load the labels in the kernel with the following command:
# ras-mc-ctl --register-labels
Printing out labels and error counts will now use the physical DIMM slot names. This is much better if you need to figure out which of your DIMMs is faulty and needs to be replaced:
# ras-mc-ctl --print-labels LOCATION CONFIGURED LABEL SYSFS CONTENTS DIMM_A1 0:0:0 missing DIMM_A2 0:0:1 missing DIMM_A1 0:1:0 missing DIMM_A2 0:1:1 missing mc0 csrow 2 channel 0 DIMM_B1 DIMM_B1 mc0 csrow 2 channel 1 DIMM_B2 DIMM_B2 mc0 csrow 3 channel 0 DIMM_B1 DIMM_B1 mc0 csrow 3 channel 1 DIMM_B2 DIMM_B2 # ras-mc-ctl --error-count Label CE UE DIMM_B2 0 0 DIMM_B1 0 0 DIMM_B1 0 0 DIMM_B2 0 0
To persist the DIMM names across reboots load the rac-mc-ctl
service at
startup:
-
Debian/Ubuntu/Fedora and other systemd-based distros
# systemctl enable ras-mc-ctl # systemctl start ras-mc-ctl
-
Gentoo with OpenRC
# rc-config add ras-mc-ctl default # rc-config start ras-mc-ctl
You're done! After rebooting your system rasdaemon will be continually running
and recording errors. You can use ras-mc-ctl
to print out a summary of all
the errors that have been seen and recorded. Since the counts are stored on
disk they will be persisted across reboots. Here's some example output from my
machine:
# ras-mc-ctl --summary Memory controller events summary: Corrected on DIMM Label(s): 'DIMM_B1' location: 0:2:0:-1 errors: 5 PCIe AER events summary: 1 Uncorrected (Non-Fatal) errors: BIT21 No Extlog errors. No devlink errors. Disk errors summary: 0:0 has 6646 errors No MCE errors.
Troubleshooting
-
ras-mc-ctl --status
prints outras-mc-ctl: drivers are not loaded
For rasdaemon to work the EDAC kernel drivers for your particular machine need to be loaded. They are usually loaded automatically at boot. You can check out which ones are loaded with this command:
# lsmod | grep edac amd64_edac_mod 32768 0 edac_mce_amd 28672 1 amd64_edac_mod
If the EDAC drivers haven't been loaded automatically either your kernel doesn't provide one for your machine or you need to manually load it. Check the EDAC kernel documentation for more details.
-
rasdaemon
fails to start, complaining it can't access the debugfs filesystemYou're likely using the kernel lockdown module in confidentiality mode. When Secure Boot is enabled this will prevent rasdaemon from reading the files it needs to gather its statistics. rasdaemon can work with kernel lockdown when using the integrity mode. To switch to integrity mode add the lockdown=integrity option to the Linux kernel command line in your boot loader.
When using GRUB this can usually be achieved by editing
/etc/default/grub
and changing theGRUB_CMDLINE_LINUX_DEFAULT
variable to include the option, e.g.:GRUB_CMDLINE_LINUX_DEFAULT="quiet splash lockdown=integrity"
Keep in mind that integrity mode is less strict than confidentiality mode, as it permits userspace applications to access a fair amount of information that lives in the kernel. This might not be suitable for some deployments - such as those that must run untrusted userspace code.