Diagnose ECC issues every 314 seconds on Ryzen 5800X with Micron ECC SODIMM modules
Diagnose ECC issues every 314 seconds on Ryzen 5800X with Micron ECC SODIMM modules
Hi everyone. Here is a strange problem that you might be interested in. Welcome to send any comments and I will update my debugging process (and communication with vendors, if any and if possible). TL;DR for problem description If all those conditions are met: RAM is inserted in DIMM_A1 Stable Load (0 load or 100% cpu usage is okay, as long as it is stable) Then on my operating system (Linux 5.11.10-hardened), after each 5'14'' (which means 314 seconds, or 100 * pi), there would be an ECC CE (correctable error) on random page/bank at random offset. And here part of the log (in GMT+8): Apr 04 08:40:42 new_nas_server kernel: EDAC MC0: Giving out device to module amd64_edac controller F19h_M20h: DEV 0000:00:18.3 (INTERRUPT) Apr 04 08:45:54 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000004f8c93a80 Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 08:45:54 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x508c93 offset:0xa80 grain:64 syndrome:0x80) Apr 04 08:45:54 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 08:51:08 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000018ee2f480 Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0xbe0700100a800903 Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 08:51:08 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x19ee2f offset:0x480 grain:64 syndrome:0x10) Apr 04 08:51:08 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:06:50 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000684a02380 Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902 Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:06:50 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x694a02 offset:0x380 grain:64 syndrome:0x80) Apr 04 09:06:50 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:12:04 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000001628f49c0 Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902 Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:12:04 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x1728f4 offset:0x9c0 grain:64 syndrome:0x80) Apr 04 09:12:04 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:17:18 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000000f73a0e00 Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:17:18 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x1073a0 offset:0xe00 grain:64 syndrome:0x80) Apr 04 09:17:18 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:22:32 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000007c4e2b600 Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902 Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:22:32 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x7d4e2b offset:0x600 grain:64 syndrome:0x80) Apr 04 09:22:32 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:23:53 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000003f55f3740 Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:23:53 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x4055f3 offset:0x740 grain:64 syndrome:0x80) Apr 04 09:23:53 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:27:46 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0xdc2041000000011b Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000027a3176c0 Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902 Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:27:46 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x28a317 offset:0x6c0 grain:64 syndrome:0x80) Apr 04 09:27:46 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:33:00 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000001c683b880 Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:33:00 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x1d683b offset:0x880 grain:64 syndrome:0x80) Apr 04 09:33:00 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:38:14 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000004275c9e80 Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:38:14 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x4375c9 offset:0xe80 grain:64 syndrome:0x80) Apr 04 09:38:14 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:43:28 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000017a5fcc80 Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:43:28 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x18a5fc offset:0xc80 grain:64 syndrome:0x80) Apr 04 09:43:28 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:48:42 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000002b4798280 Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:48:42 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x2c4798 offset:0x280 grain:64 syndrome:0x80) Apr 04 09:48:42 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:53:56 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000025bfea4c0 Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:53:56 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x26bfea offset:0x4c0 grain:64 syndrome:0x80) Apr 04 09:53:56 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 09:58:56 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 09:58:56 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 09:58:56 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 09:58:57 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000004c7993f80 Apr 04 09:58:57 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 09:58:57 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 09:58:57 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x4d7993 offset:0xf80 grain:64 syndrome:0x80) Apr 04 09:58:57 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 10:03:57 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000002c11fa380 Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902 Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 10:03:57 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x2d11fa offset:0x380 grain:64 syndrome:0x80) Apr 04 10:03:57 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 10:09:11 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000025f6b0700 Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 10:09:11 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x26f6b0 offset:0x700 grain:64 syndrome:0x80) Apr 04 10:09:11 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 10:14:25 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000002f620c840 Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902 Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 10:14:25 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x30620c offset:0x840 grain:64 syndrome:0x80) Apr 04 10:14:25 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 10:19:39 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000014273bb80 Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 10:19:39 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x15273b offset:0xb80 grain:64 syndrome:0x80) Apr 04 10:19:39 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 10:24:53 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000786d15580 Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 10:24:53 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x796d15 offset:0x580 grain:64 syndrome:0x80) Apr 04 10:24:53 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 10:30:07 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000351090980 Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902 Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 10:30:07 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x361090 offset:0x980 grain:64 syndrome:0x80) Apr 04 10:30:07 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 10:35:21 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000212453f00 Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 10:35:21 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x222453 offset:0xf00 grain:64 syndrome:0x80) Apr 04 10:35:21 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 10:40:35 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000001e2df0e80 Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 10:40:35 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x1f2df0 offset:0xe80 grain:64 syndrome:0x80) Apr 04 10:40:35 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 10:45:49 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000721870f00 Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902 Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 10:45:49 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x731870 offset:0xf00 grain:64 syndrome:0x80) Apr 04 10:45:49 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 10:51:03 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000144e82b00 Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 10:51:03 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x154e82 offset:0xb00 grain:64 syndrome:0x80) Apr 04 10:51:03 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 10:56:03 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000031770b180 Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 10:56:04 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x32770b offset:0x180 grain:64 syndrome:0x80) Apr 04 10:56:04 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 11:01:04 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000232e46340 Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902 Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 11:01:04 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x242e46 offset:0x340 grain:64 syndrome:0x80) Apr 04 11:01:04 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 11:06:18 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000001667a1240 Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902 Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 11:06:18 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x1767a1 offset:0x240 grain:64 syndrome:0x80) Apr 04 11:06:18 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 11:11:32 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: Error Addr: 0x0000000428b809c0 Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 11:11:32 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x438b80 offset:0x9c0 grain:64 syndrome:0x80) Apr 04 11:11:32 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 11:16:46 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000001c1d8ce80 Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800902 Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 11:16:46 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x1d1d8c offset:0xe80 grain:64 syndrome:0x80) Apr 04 11:16:46 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 11:22:00 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000006459b19c0 Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 11:22:00 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x6559b1 offset:0x9c0 grain:64 syndrome:0x80) Apr 04 11:22:00 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 11:27:14 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: Error Addr: 0x000000034e7aa140 Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 11:27:14 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x35e7aa offset:0x140 grain:64 syndrome:0x80) Apr 04 11:27:14 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Apr 04 11:32:28 new_nas_server kernel: mce: [Hardware Error]: Machine check events logged Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: Corrected error, no action required. Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: CPU:0 (19:21:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: Error Addr: 0x00000005a6933e80 Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x305e00800a800903 Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Apr 04 11:32:28 new_nas_server kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x5b6933 offset:0xe80 grain:64 syndrome:0x80) Apr 04 11:32:28 new_nas_server kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Background Information I'm assembling my new NAS based on following configurations (sorted by relevance with my this thread) CPU: Ryzen 7 5800X microcode updated, patch level 0x0a201009 Stepping 0 Memory: 1 x Micron MTA18ASF4G72HZ-3G2B1 ( datasheet ) Running at DDR4-2933 as reported by BIOS CLS arguments are Motherboard: ASRock Rack X570D4I-2T ( manual ) BIOS 2.1.0 (AGESA ComboV2 1.1.0.0) BMC Firmware 01.40.00 All settings in BIOS and BMC are default, expect for Boot Filter (UEFI Only), BMC IP address and passwords Cooler: Noctua NH-L9i Case: The old case of an HP ProLiant Microserver Gen8, with some modifications. However, those (physical) modifications are not ready yet so here is its workbench now: Measures taken and results Clean golden fingers: CE continue Swap RAM to DIMM_A2 or DIMM_B2: Won't boot (as expected) Swap RAM to DIMM_B1: No future errors but I just don't want to use B1 since the problem is just hidden, and I hope to add more memory later Run memtest86 (both the Freeware version by PassMark, and the FOSS version memtest86+) and: RAM at DIMM_A1: FOSS version would fail at random block in first 1M if SMP is enabled, but if disable SMP or change SMP mode to round-robin, no error is reported Freeware (paid) always pass RAM at DIMM_B1: Same as A1 A2/B2: not bootable so N/A Run stress and memtest for memory pressure test under Linux: Errors would be more/less at first, but after ~10min (which means 2 rounds of error), CE continue When DRAM at A1, push forces to the motherboard itself genteelly by following ways, CE continue . Ways are (DO NOT TRY THIS AT HOME IF YOU DON'T KNOW IR SOLDERING): Loosen/Tighten CPU cooler screws Bend the board at less than 2 degrees Read the whole BIOS settings, related manuals, but found nothing looks reasonable for me Suspects I think at least one of those parts is faulty (sorted by possibility based on current findings): Software problems, including The default BIOS settings for this motherboard will cause false alarm for Linux MCE related subsystem The default BIOS settings cause the data being changed without properly set ECC bits Linux MCE things is just not working on Ryzen ...... Software problems is really possible, since 314 seconds looks like something periodic, not random CPU memory controller for channel A (mc0) is broken (which means, RMA ). I think in Ryzen, memory is directly connected to CPU and if I put memory on channel B (maybe another MC?), runtime error disappears. So mc0 seems suspicious. RAM is not stable DIMM_A1 is much shorter than B1 so due to some magical physical things (reflection, etc.) an unstable RAM is possible. Additionally, the vendor of this RAM is also suspicious, and I have never brought things before. Motherboard trace is broken, including traces from A1 to CPU, B1 to CPU, and the DIMM slot to motherboard This is reasonable because when I got the motherboard, the package seems treated violently, although the internal box seems fine. CPU Socket connectors are not in good contact Least possibility. By the pinout diagram ( source ), MA_* pins are really near MB_* pins, and even located more insider. Affected by EMI Can't explain why B1 works. No high-power devices working near me. To-Do List Try if Windows (PE) have same system logs (if they can generate any, hehe) Re-plug CPU Contact AMD and ASRock for help since I'm not sure if this is really some hardware problem Try a new memory Try a new CPU The most interesting thing is: Why 314 seconds? Is this a periodically thing, or I just treated it wrongly? In my experience of R&D, even if I need some counter for, like drawing an circle, I would use 2**n instead of pi since this is how binary works. Oscillators generating sin wave also don't really use pi, although the sin function have some historical relationship with it. This makes me think maybe those CEs are some software bugs, not hardware problems. But why would this happen.... I spend about 16 hours continuously to change sockets, read kernel source code... but still have no idea about it. Anyway, thanks for everyone's help and I hope if you have met things like this before, you could give me some instructions. It would be better if some engineers from AMD, Micron or ASRock is interested in this. Totally pi**ed off, izumi_konata
I encountered some memory stability problems with my B450 board and R5 2600. Moving the RAM chips from the front to the back slots resolved the issue. It seems related to the IMC in Ryzen processors, possibly needing the farthest slots filled first. I ran Prime95's large FFT to analyze the error, but each run had different rounding issues and varying times. The problem disappeared completely after swapping the slots. It doesn’t look like a general glitch, more like a Ryzen-specific quirk.
I believe this makes sense... because the issue isn't appearing at precisely 314 seconds. Throughout 13:00 to 24:00, the only thing disabling is my LED desk lamp... Perhaps I should purchase another memory and relocate it to B1. If the problem disappears on A1, then this might be a Ryzen puzzle.
Update on the situation: Placed a new RAM for testing with the same supplier. However, the memory capacity isn't sufficient for me. New observation: when inserted into a specific slot, the contents in /sys/devices/system/edac/mc/mc0/ appear altered. The only variation is the position of the slot (B on the left, A on the right) and nothing related to the lamp. I just keep forgetting to terminate certain processes.
I really love lamps. Just keep plugging them in, man—it took me about two weeks to get the hang of my R5 2600 and memory issues.
Hey everyone, I noticed a similar issue today. The logs show a hardware error with the message "Machine check events logged." It seems like a CPU fault occurred, specifically a 0 error at timestamp 17:60:1. The system reported an ECC DRAM issue and a unified memory controller problem. It’s happening intermittently, about every 311 seconds on Debian Linux (Bullseye). The hardware details point to an Asrock B550 Pro4 with Kingston 16GB ECC memory at 2667 MT/s. The kernel version is 3 channels, 0 channel, and there’s a memory-related syndrome code. We might look for patterns across these logs to resolve it faster.
Use Jedec's memory instead of the overclocked version, likely fewer errors or none at all. ECC memory is built to work reliably. The original poster is running at 2933mhz, you're at 2667mhz, and all systems are around 2133mhz. I believe starting with a default test would be best, then consider overclocking later.
Thank you for your suggestion! I'm not configuring the ECC memory OC. The board is certified by ASrock for compatibility, and I simply connected it without changing any BIOS settings (assuming default parameters). The JEDEC specifications state 16GB 2Rx8 PC4-2666V-EE1-11, so a 2666 MHz is acceptable; only reducing to 2133 would require underclocking, which I'm considering if possible. I'll attempt the underclocking and see if it helps.
Memory details: two Kingston KSM26ED8/16ME chips (DPU part number 9965745-002.A00G).
Specs range from around 1600MHz to 2667MHz. If the memory is just 2667MHz, it might still support a lower frequency range. I'll need to check your details later. You're on a mobile device, right?
Decoding DIMM reveals the following details: The guessed DIMM resides in bank 3. The kernel driver utilizes EEPROM. CRC checks on EEPROM bytes 0-125 are valid (0xC434). Writing to SDRAM EEPROM is confirmed with 384 bytes, total EEPROM capacity 512 bytes. It operates as DDR4 SDRAM, revision 1.1, UDIMM type. EEPROM CRC for bytes 128-253 is also valid (0xD6A6). Key specs include a maximum module speed of 2666 MT/s (PC4-21300), size 16384 MB, and a fundamental memory type of DDR4 SDRAM. The device supports 512 bytes in total, with a width of 16 bits per row, 10 columns, and 64 bits per byte. It features a bus width of 64 bits, symmetric primary bus, and supports 19-43 CAS latencies. Timing data matches DDR4 variants at standard speeds, with precise cycles and delays listed. Additional info highlights a package type, maximum activation count, thermal compliance, and physical dimensions. One row per bank group is possible, and the module runs stably after downclocking to 2133. Appreciate your assistance!