.. title: Monitoring ECC memory on Linux with rasdaemon
.. slug: monitoring-ecc-memory-on-linux-with-rasdaemon
.. date: 2020-02-13 23:31:44 UTC+01:00
.. tags: ecc, memory, linux, gentoo, debian, ubuntu, fedora, opensuse
.. category: 
.. link: 
.. description: Monitoring ECC memory errors on Linux with rasdaemon
.. type: text

If you have a workstation built around an AMD Ryzen/Threadripper or Intel Xeon
processor chances are you're using `ECC memory`_. ECC memory is a worthy
investment to improve the reliability of your machine and if properly monitored
will allow you to spot memory problems before they become catastrophic.

On recent Linux kernels the rasdaemon_ tools can be used to monitor ECC memory
and report both correctable and uncorrectable memory errors. As we'll see with a
little bit of tweaking it's also possible to know exactly which DIMM is
experiencing the errors.

.. contents::

Installing rasdaemon
====================

First of all you'll need to intall **rasdeamon**, it's packaged for most Linux
distributions:

* **Debian/Ubuntu**
  
  ::

    # apt-get install rasdaemon

* **Fedora**

  ::

    # dnf install rasdaemon

* **openSUSE**

  ::

    # zypper install rasdaemon

* **Gentoo**

  The package is currently marked as unstable so you'll need to unmask it first:

  ::

    # echo "app-admin/rasdaemon ~amd64" >> /etc/portage/package.keywords

  Then I recommend enabling sqlite support, this makes rasdaemon record events
  to disk and is particularly useful for machines that get rebooted often:

  ::

    # echo "app-admin/rasdaemon sqlite" >> /etc/portage/packages.use

  Finally install rasdaemon itself:

  ::

    emerge rasdaemon

Configuring rasdaemon
=====================

Then we'll setup **rasdaemon** to launch at startup and to record events to
an on-disk sqlite database.

Note that when booting with Secure Boot enabled, using the kernel lockdown
facility in **confidentiality** mode will prevent rasdaemon from running. To
use **rasdaemon** you'll have to use a different lockdown mode, disable
lockdown entirely or disable Secure Boot. You'll find more information in the
Troubleshooting_ section.

* **Debian/Ubuntu/Fedora/openSUSE and other systemd-based distros**

  ::

    # systemctl enable rasdaemon
    # systemctl start rasdaemon

* **Gentoo with OpenRC**

  Add the following line to ``/etc/conf.d/rasdaemon``:

  ::

     RASDAEMON_ARGS=--record

  Add ``rasdaemon`` to the **default** run-level and start it

  ::

     # rc-config add rasdaemon default
     # rc-config start rasdaemon

Configuring DIMM labels
=======================

At this point **rasdaemon** should already be running on your system. You can
now use the **ras-mc-ctl** tool to query the errors that have been detected.
From now on I will use data from my machine to give an example of the output.

::

  # ras-mc-ctl --error-count
  Label               	CE	UE
  mc#0csrow#2channel#0	0   0
  mc#0csrow#2channel#1	0   0
  mc#0csrow#3channel#1	0   0
  mc#0csrow#3channel#0	0   0

The CE column represents the number of corrected errors for a given DIMM, UE
represents uncorrectable errors that were detected. The label on the left
shows the EDAC path under ``/sys/devices/system/edac/mc/`` of every DIMM.

This is not very readable. Since the kernel has no idea of the physical layout
of your motherboard it will print the EDAC paths instead of the names of the
DIMM slots. We can confirm that the labels are missing with this command:

::

  # ras-mc-ctl --print-labels
  ras-mc-ctl: Error: No dimm labels for ASUSTeK COMPUTER INC. model PRIME B450-PLUS

To identify which DIMM slot corresponds to which EDAC path you will have to
reboot your system with only one DIMM inserted, write down the name of the
slot you insterted it in and then printing out the paths with
``ras-mc-ctl --error-count``.

In my case this was the mapping:

::

  mc#0csrow#0channel#0	DIMM_A1
  mc#0csrow#0channel#1	DIMM_A2
  mc#0csrow#1channel#1	DIMM_A2
  mc#0csrow#1channel#0	DIMM_A1
  mc#0csrow#2channel#0	DIMM_B1
  mc#0csrow#2channel#1	DIMM_B2
  mc#0csrow#3channel#1	DIMM_B2
  mc#0csrow#3channel#0	DIMM_B1

Note that there's more than one path per DIMM label, that's fine.

With this data at hand create a text file under ``/etc/ras/dimm_labels.d/``.
You will need to fill it up with the mapping data in the following format:

::

  Vendor: <motherboard vendor name>
  Model: <motherboard model name>
    <label>: <mc>.<row>.<channel>

You can obtain the motherboard vendor and model name with the following
command:

::

  # sudo ras-mc-ctl --mainboard
  ras-mc-ctl: mainboard: ASUSTeK COMPUTER INC. model PRIME B450-PLUS

The label lines take a string (the name of the physical DIMM slot), then the
numbers in the EDAC path corresponding to the physical slot. You can put
more than one label entry per line by separating them with a semicolon. If a
given label is associated with more than one EDAC path you can add the separate
``<mc>.<row>.<channel>`` sequences by separating them with a comma.

In my case the resulting file (``/etc/ras/dimm_labels.d/asus``) looks like this:

::

  Vendor: ASUSTeK COMPUTER INC.
  Model: PRIME B450-PLUS
    DIMM_A1:  0.0.0, 0.1.0;    DIMM_A2:   0.0.1, 0.1.1;
    DIMM_B1:  0.2.0, 0.3.0;    DIMM_B2:   0.2.1, 0.3.1;

You can find another example of this, with configuration entries for a bunch of
other motherboards, in the `edac-utils`_ repo.

Once the file is ready it's time to load the labels in the kernel with the
following command:

::

  # ras-mc-ctl --register-labels

Printing out labels and error counts will now use the physical DIMM slot names.
This is much better if you need to figure out which of your DIMMs is faulty and
needs to be replaced:

::

  # ras-mc-ctl --print-labels
  LOCATION                            CONFIGURED LABEL     SYSFS CONTENTS      
                                      DIMM_A1              0:0:0 missing       
                                      DIMM_A2              0:0:1 missing       
                                      DIMM_A1              0:1:0 missing       
                                      DIMM_A2              0:1:1 missing       
  mc0 csrow 2 channel 0               DIMM_B1              DIMM_B1             
  mc0 csrow 2 channel 1               DIMM_B2              DIMM_B2             
  mc0 csrow 3 channel 0               DIMM_B1              DIMM_B1             
  mc0 csrow 3 channel 1               DIMM_B2              DIMM_B2             

  # ras-mc-ctl --error-count
  Label   CE      UE
  DIMM_B2 0       0
  DIMM_B1 0       0
  DIMM_B1 0       0
  DIMM_B2 0       0

To persist the DIMM names across reboots load the ``rac-mc-ctl`` service at
startup:

* **Debian/Ubuntu/Fedora and other systemd-based distros**

  ::

    # systemctl enable ras-mc-ctl
    # systemctl start ras-mc-ctl

* **Gentoo with OpenRC**

  ::

     # rc-config add ras-mc-ctl default
     # rc-config start ras-mc-ctl

You're done! After rebooting your system rasdaemon will be continually running
and recording errors. You can use ``ras-mc-ctl`` to print out a summary of all
the errors that have been seen and recorded. Since the counts are stored on
disk they will be persisted across reboots. Here's some example output from my
machine:

::

  # ras-mc-ctl --summary
  Memory controller events summary:
    Corrected on DIMM Label(s): 'DIMM_B1' location: 0:2:0:-1 errors: 5
  
  PCIe AER events summary:
    1 Uncorrected (Non-Fatal) errors: BIT21
  
  No Extlog errors.
  
  No devlink errors.
  Disk errors summary:
    0:0 has 6646 errors
  No MCE errors.


Troubleshooting
===============

* ``ras-mc-ctl --status`` prints out ``ras-mc-ctl: drivers are not loaded``

  For **rasdaemon** to work the EDAC kernel drivers for your particular
  machine need to be loaded. They are usually loaded automatically at boot. You
  can check out which ones are loaded with this command:

  ::

    # lsmod | grep edac
    amd64_edac_mod         32768  0
    edac_mce_amd           28672  1 amd64_edac_mod

  If the EDAC drivers haven't been loaded automatically either your kernel
  doesn't provide one for your machine or you need to manually load it. Check
  the `EDAC kernel documentation`_ for more details.

* ``rasdaemon`` fails to start, complaining it can't access the debugfs
  filesystem

  You're likely using the kernel lockdown module in **confidentiality** mode.
  When Secure Boot is enabled this will prevent **rasdaemon** from reading the
  files it needs to gather its statistics. **rasdaemon** can work with kernel
  lockdown when using the **integrity** mode. To switch to **integrity** mode
  add the `lockdown=integrity` option to the Linux kernel command line in your
  boot loader.

  When using **GRUB** this can usually be achieved by editing
  ``/etc/default/grub`` and changing the ``GRUB_CMDLINE_LINUX_DEFAULT``
  variable to include the option, e.g.:

  ::

    GRUB_CMDLINE_LINUX_DEFAULT="quiet splash lockdown=integrity"

  Keep in mind that **integrity** mode is less strict than **confidentiality**
  mode, as it permits userspace applications to access a fair amount of
  information that lives in the kernel. This might not be suitable for some
  deployments - such as those that must run untrusted userspace code.

.. _`ECC memory`: https://en.wikipedia.org/wiki/ECC_memory
.. _rasdaemon: https://github.com/mchehab/rasdaemon
.. _`EDAC kernel documentation`: https://www.kernel.org/doc/html/latest/driver-api/edac.html
.. _`edac-utils`: https://github.com/grondo/edac-utils/blob/master/src/etc/labels.db
