Issue detected with FIVR fault; system shuts down during stress test.
Issue detected with FIVR fault; system shuts down during stress test.
PC specs (ALL BOUGHT USED, DON'T KNOW IF THESE PARTS WORK PROPERLY)
CPU: Xeon W-2235
CPU cooler: Thermalright Macho Rev.B
Motherboard: Gigabyte MW51-HP0 (latest bios update R02)
Memory: Kingston KSM26RS4/16MEI (8 dimms) (tested with MemTest)
PSU: Corsair RMx Series RM850x
GPU: EVGA GTX 750ti
problem:
The system boots up, and works under low load. When starting a stress test however, after a few minutes the PC crashes completely, the fans partly stop spinning, and there's no chance for the OS to write a report about the error, the only thing to indicate the problem is a red LED next to the CPU socket with a label FIVR_FAULT that lights up upon crash. (more in attachments)
The issue happens semi-randomly, so sometimes under maximum load (using AVX-512 stress test) the system will survive for 5 minutes, other times it'll shut off in less than a minute. As a general rule however, it never gets past 10 minutes under full load.
When lowering the amount of threads concurrently executing the stress test to 11 there's way less crashes, and at 6 cores, the system never crashes.
As the CPU doesn't support any overclocking of any kind, the only thing that can be really tweaked is the Turbo Boost. Turning off the Turbo Boost lowers the base clocks from 4200 to 3800.
As it was discovered, turning off the Turbo Boost makes the system (seemingly) completely stable, as even under the full load I wasn't able to reproduce the issue (with the tests not exceeding 16 mins).
At the beginning of the testing, I was still using a stock intel CPU cooler, which was making the CPU overheat, so I installed a much more appropriate cooler from Thermalright, which cooled the CPU to ~60C under the full load with Turbo Boost enabled.
The second thing that I did was googling what FIVR_FAULT might mean. As it turns out, the CPU of this generation uses what's known as Fully Integrated Voltage Regulator. Which allowed the manufacturer to significantly reduce the amount of VRMs on the motherboard.
Next, as the description about FIVR is murky in details about power delivery to the memory, and it's possible that too many populated RAM slots may cause issues in some cases, I removed half the ram, now only at 4 sticks.
Since the issue seems to be related to the power delivery, I decided that a capacitor modification might fix the issue if the voltage is unstable.
To add to my concerns, the capacitors at the back of the motherboard, those right under the VRMs were not fully populated, so that's where I started.
I ordered myself a lot of ten capacitors named EEFSX0D471XE that match in specs those already installed on the board.
So long story short, in addition to those 5 caps already present, 6 more were installed, and as the name of the post might suggest this didn't really solve the issue.
I tried to figure out if the crashes took longer to occur under load, but those crashes occur so randomly that it seems impossible to establish any progress here.
As already mentioned, a crash can happen in under a minute, or take over 5 minutes before occurrence.
But I wasn't done yet, because by this point I've already modified the 2 volt capacitors, but these don't hold as much charge, and only fix the most high frequency noise, not really being able to offset bigger dips in voltage.
So I decided that upgrading the 16 volt capacitors, that feed the VRMs to higher capacitance ones might help. So I've upgraded the 270uF caps to the 470uF ones, otherwise similarly speced.
And again I'm out of luck, so I decide to hook up the 16v capacitors to an oscilloscope just to see what's going on. The results are attached.
So a division on the screen being 100mV, from the picture it looks like under low load, the voltage noise level is about 100mV.
When running under full load the noise increases to 200mV with jumps up to 300mV.
And as far as I understand, this is supposed to be a normal voltage drift (AI says that voltage can drift at around 11.5V–12.6V). So I should be safely within limits here.
(CORRECTION: after further googling, might be completely off on the exact acceptable voltage ripple)
(CORRECTION2: as per ATX spec, the voltage can range between 11.5V–12.6V but the ripple cannot exceed the 120mv, while mine can easily exceed 200mv)
I might try ordering another CPU just to try it out, in case mine is somehow degraded.
The PSU might also be at fault, albeit unlikely as I've already tried another one in place of the aforementioned, but the replacement one is not very powerful. So again I might try yet another PSU.
The BIOS is very bare-bones, so there isn't much to see there.
The board also has an IPMI chip installed on it, which could provide more clues as to what's going on, but I wasn't able to get it to work. The bios has no options in it to activate the IPMI server, and the board doesn't even rgister its IP on the network when the Ethernet cable is plugged in.
In case someone has any clues as to what else I could do to get this thing fixed besides turning off the Turbo Boost, I'm eager to hear from you.
In case you know what the acceptable voltage ripple on the 12v caps (or the 2v caps) might be, please reply.
(UPD: seems that the voltage ripple on the 12v caps may be well beyond acceptable, am I correct here?)
(voltage on capacitors under load)
View: https://youtube.com/shorts/lksGHiwm3NI?feature=share
(low load conditions)
https://ibb.co/DPrcfg0Y
(error LED photo)
https://ibb.co/rKRGXfLC
What issues were present that required stress testing? Or is this simply curiosity testing (acceptable) to gain more insight and conduct general experiments? Are there any error codes, warnings, or informational events recorded in Reliability History/Monitor or Event Viewer prior to or during the crashes? Please try another PSU. Make sure to use only the cables included with each PSU. Perform some tests on the PSUs as well—verify that output voltages remain within acceptable ranges.
It was just a standard follow-up after buying some aftermarket hardware, and nothing significant came up. I searched the system logs for any mention of FIVR, but found nothing. I plan to look at other error messages as well, since I haven’t tried this before. So far, the best method to detect low-level issues seems to be IPMI, though it isn’t working for me. Concerning the PSU, the cables were made by myself, so checking their voltage might be useful. However, the capacitors are showing correct voltage, which suggests there’s no need to investigate further upstream.
I've purchased a new CPU, model W-2133, which resolved the problem. The stress test now passes properly. It seems the other CPU is failing, likely because the issue lies in FIVR, which resides on the CPU. The problem is fixed.