Diagnose ECC issues every 314 seconds on Ryzen 5800X with Micron ECC SODIMM modules
Diagnose ECC issues every 314 seconds on Ryzen 5800X with Micron ECC SODIMM modules
It's a bit outdated, but after almost five months I finally recall the same thing. Plugging those two memories into A1 and B1 eliminated any MCEs in the last 24 hours. I believe this points to compatibility problems. I'm planning to try 2999 even if MCE occurs again, because it doesn't seem to impact my process. Thanks, ShrimpBrime (and you), for the feedback! @Joe217
After two years: In 2021 I started this discussion. The fix came from exchanging two memory modules. But in October, after swapping the M.2 SSD, the same problem resurfaced. Almost everything remained unchanged; the 314-second interval stayed consistent. It’s quite strange. The chassis layout doesn’t require direct contact with the memory area during SSD replacement. ESD also didn’t seem responsible because MCE still occurred randomly across banks and addresses. I chose to overlook the matter since it didn’t impact daily operations, then moved on to another compact ITX system. That new unit is even smaller and accepts a 12V VO input from an external connector. The enclosure consists mainly of six aluminum pieces, with minimal height, leaving little space above the CPU cooler. There’s roughly a 3-inch gap for adding a PCIe card (which needs a raiser) and a power supply. Since I already have a 12V bus in my rack, the power supply wasn’t needed. I crafted a 12V CPU power cable with an XT60 connector so the connector could be exposed from inside. Eventually, I left the case open until everything functioned properly. No ECC errors appeared for days. Then I completed the setup, sealed the chassis. The next morning, the log showed an EDAC error. This time the machine ran on an Intel CPU, and the issue was clearly logged in Slot A1. It happened about 10 hours after the system booted. I initially dismissed it as a random glitch. Ten hours later, the same error appeared in Slot A2, then A1, then A1 again. This persisted. Clearly, it wasn’t a minor concern. Yet how could both memory modules have CE errors with varying durations, yet the overall interval remained steady? I examined the case, tried reinserting the memory, and closed it again. The problem still occurred. I opened it once more, reconnected the memory, and sealed it. The issue vanished. Odd, isn’t it? Eventually I figured it wasn’t about memory at all. Fortunately, I installed a camera in the rack, and this time everything was handled correctly. After reviewing the logs, I discovered the real cause: The 12V power cable was positioned above both A1 and A2 memory modules, and the insulation was slightly touching the PCB. Later, after closing the case, another section of the insulation came into contact with the chassis cover. During a second memory connection, I shifted the power cable to sit beneath the memory slot buckles. That revealed the 314-second issue. I removed the old machine, noticed the cable was still touching the memory, rearranged the wiring, and after a month of careful management, no errors appeared. This wasn’t a memory fault—it was a “magic” problem. My conclusion: Even with well-separated PCB, insulated cables, and a non-conductive chassis, certain conditions could still cause glitches. This taught me the value of proper cable routing. At least avoid running cables above your memory modules.
Capacitance might be the root issue. You have conductors, insulators, and a highly sensitive part. A tiny charge could accumulate and then release through the skin effect from cable to PCB to ground via the memory, leading to the error. Interesting, thanks for the update.