Possible overheating issue detected with the M.2 device.
Possible overheating issue detected with the M.2 device.
Hello everyone! I'm facing an unexpected problem that I haven't encountered before. My PC randomly starts in a blue screen state. Sometimes it functions properly when under stress but crashes when idle, then it fails when under load. It seems the issue occurs more often during heavy usage. I've tested my GPU with Heaven Benchmark and Superposition Benchmark, ensuring the VRAM is fully loaded. Running Prime95 helped eliminate overheating CPU or RAM concerns, and using full RAM capacity didn't trigger the problem. From what I understand, the culprit appears to be my M.2 SSD—specifically the CT1000MX500SSD4. It's been reliable for several years; according to CrystalDiskInfo it's at 92% health, which is below ideal but still functional (some SSDs work at 80%). The drive is only 60% full, and I don't play games directly from it—I use a separate SSD. However, the temperature fluctuates between 50 and 70°C, with the max reaching 70°C as per specs. I've placed an AliExpress heatsink, but it still stays near 70°C even with extra cooling. The room temperature is under 20°C, though it might be higher in my space (around 25°C). CPU and GPU temperatures seem normal, and RAM shouldn't be affected. I've been using this setup for about six months to a year. Malwarebytes scans show everything appears fine. Thanks!
Some part of the operating system might have been damaged or the NAND storage was failing. Whenever data needed to be accessed from that area, the system would crash. This happened once with my Samsung 870 Evo. In my situation, the issue was with my SSD, not the OS itself—my games and programs were installed there, but they kept failing as the SSD degraded quickly. It lost about 1-2% of its health each day. The problem became obvious when I noticed it struggling to run games or software before the degradation started. The telltale sign was the long SMART test in Samsung Magician; it passed quickly and even DiskMark, but failed completely during the extended test. I hope someone more experienced can help clarify what was really going on.
They often push CPU/GPU temperatures near critical levels. Well-being comes first—this doesn’t necessarily mean the disk is failing. Could you provide a full screenshot of the SSD’s SMART data from the Crystaldiskinfo app? A 70°C should be acceptable, and there are tests showing higher temps: https://www.techpowerup.com/review/cruci...tb/15.html This drive tends to throttle and slow down to protect itself (“preventing meltdown”). https://www.hwinfo.com/ Let’s begin with the basics. When did your PC start randomly shutting down? Is it random or linked to specific tasks, games, or activities? Since the first blue screen and a few days prior, have any changes occurred—software updates, driver issues, hardware modifications? While you consider these, another helpful step is updating the firmware, as some drives run with less stable versions that eventually improve: https://www.crucial.com/support/ssd-support
Visit C:\Windows\Minidump to see if any minidump files exist. If present, return to the Windows directory and transfer the Minidump folder to the Downloads folder (use your desktop if needed). Compress the copied folder and include it in a post. Please adhere strictly to instructions since Windows doesn't allow file changes there.
It might be a RAM problem because you probably have more than one stick installed. Consider testing each stick separately. Watch for clear signs like failing to start at all or booting only with one stick. Running diagnostics for damaged files could help, though these processes may take time.
I question if the problem lies with RAM since Prime95 loads most of it quickly. If issues existed, it would likely crash immediately. I also tried the built-in Windows memory test, just the quick version. It might point to RAM, but I haven’t discovered stronger evidence yet. Any scan for damaged files would be helpful?
Prime95 requires minimal RAM usage. It primarily utilizes CPU and GPU resources. Run the sfc /scannow command for updates.
It seems the data comes from dump files. Memory isn't always RAM, but it's often what people think. Windows moves low-priority RAM information into the page file and retrieves it when needed, making storage appear like memory. The CPU’s memory controller plays a role; if it fails, it can mimic memory behavior. When storage issues affect about half of the dumps, storage or its drivers are likely the culprits—this isn’t present here, so storage is unlikely. If there are overclocking or voltage problems, remove those components. To verify RAM, run the machine normally with one stick at a time. If only one stick causes crashes, that stick is faulty. If it fails with either stick, the CPU is likely the issue. Memory testers often miss defective RAM, especially DDR4 and newer types, so I’m not confident in their results. Since you have four sticks, you can test two at a time instead of all four. Use the second and fourth slots when counting from the CPU. I noticed you’re using four RAM modules. Have you been running with four sticks for some time or just recently changed them? If this was a compatibility problem, it would show up within a few months—so if you’ve been using it longer than that, it probably isn’t. The health percentage reflects wear metrics, not the current condition of the drive. It depends on how many write cycles remain under warranty. The overall status shown isn’t reliable; it depends on what the manufacturer specifies. You’ll need to understand the CDI parameters in the lower section and identify key indicators. If you’re unsure, take a screenshot of the CDI and double-check which metrics matter. Unless it’s an NVMe SSD, its self-diagnostic tool is disabled and provides little useful data (like the CT1000MX500SSD4, which uses SATA).