Switch issues causing frequent restarts

V

vuur123

08-22-2025, 01:28 PM #1

Hi everyone, I've been operating a Brocade ICX7250 48P-2X10G network switch around the clock for several months now. Recently, I noticed my Proxmox cluster nodes frequently restarting my VMs, and I suspect the issue stems from the switch itself. It appears to reboot intermittently several times each day without a clear pattern. With syslog remote logging set up, the logs remain intact even after reboots, unlike the temporary output on the show command. So far, I haven't found anything unusual beyond standard warnings such as: May 26 19:42:24:A:System: Stack unit 1 Temperature 67.0 C degrees, it steadily climbs to around 80C, then a reboot happens and temperatures drop back to about 60C before rising again. This makes me wonder if the problem is thermal in nature, even though the temperature never hits the shutdown threshold of 105C. I also noticed that one of my SFP+ to 10GBASE-T adapters stopped working—no blinking LEDs or network signal. After replacing it, the switch entered a persistent boot loop with only amber lights. I powered off the system overnight and it started again the next day with the adapter connected, except for this more erratic looping. To test thermal performance, I disabled the fans; temperatures reached roughly 90C per logs, but the system automatically switched to high fan mode, cooled down, and stayed online. This suggests possible overheating under load before a remote log was sent. It seems atypical, but I’d like to hear your thoughts or any similar cases you’ve encountered. Any guidance would be greatly appreciated. Thanks for your help! Here’s some additional debug info: show version Copyright © Ruckus Networks, Inc. All rights reserved. UNIT 1: compiled on Aug 8 2023 at 23:06:54 labeled as SPR08095m (33554432 bytes) from Primary SPR08095m.bin (UFI) SW: Version 08.0.95mT213 Compressed Primary Boot Code size = 786944, Version:10.1.26T215 (spz10126) Compiled on Tue Nov 29 23:13:15 2022 HW: Stackable ICX7250-48-HPOE ========================================================================== UNIT 1: SL 1: ICX7250-48P POE 48-port Management Module Serial # UK3845L1DZ Software Package: ICX7250_L3_SOFT_PACKAGE (LID: fwmINJKnGfb) Current License: l3-prem-8X10G P-ASIC 0: type B344, rev 01 Chip BCM56344_A0 ========================================================================== UNIT 1: SL 2: ICX7250-SFP-Plus 8-port 80G Module ========================================================================== 1000 MHz ARM processor ARMv7 88 MHz bus 8 MB boot flash memory 2 GB code flash memory 2 GB DRAM STACKID 1 system uptime is 3 hour(s) 44 minute(s) 17 second(s) The system began at 19:38:19 CST Mon May 26 2025 The system : started=cold start show chassis The stack unit 1 chassis info: Power supply 1 (AC - PoE) present, status ok Power supply 2 not present Power supply 3 not present Fan 1 ok, speed (auto): [[1]]<->2 Fan 2 ok, speed (auto): [[1]]<->2 Fan 3 ok, speed (auto): [[1]]<->2 Fan controlled temperature: Rule 1/2 (MGMT THERMAL PLANE): 91.3 deg-C Rule 2/2 (AIR OUTLET NEAR PSU): 40.5 deg-C Fan speed switching thresholds: Rule 1/2 (MGMT THERMAL PLANE): Speed 1: NM<-----> 95 deg-C Speed 2: 85<----->105 deg-C (shutdown) Rule 2/2 (AIR OUTLET NEAR PSU): Speed 1: NM<-----> 41 deg-C Speed 2: 34<----->105 deg-C (shutdown) Fan 1 Air Flow Direction: Front to Back Fan 2 Air Flow Direction: Front to Back Fan 3 Air Flow Direction: Front to Back Slot 1 Current Temperature: 91.3 deg-C (Sensor 1), 40.5 deg-C (Sensor 2) Slot 2 Current Temperature: NA Warning level.......: 85.0 deg-C Shutdown level......: 105.0 deg-C**.

Reply

C

CeekaQueen

08-22-2025, 01:28 PM #2

They provide a replacement option if needed.

Reply

S

Space_Triks

08-22-2025, 01:28 PM #3

Consider updating to 08.0.95s or 09.0.10j_cd1. Recent logs show random crashes during updates since 08.0.95m, though no temperature-related issues were noted. Which modules are you working with? Their power consumption ratings are also important. The SFP+ area of the switch might be overheating more than the overall temperature reading suggests.

Reply

L

ladymorepork

08-22-2025, 01:28 PM #4

Here’s a revised version of your message:

Thanks for the quick replies! I acquired a second-hand switch, so a replacement isn’t an option. However, one choice would be to purchase a new or upgraded model... Would you have any recommendations for a similar switch with quieter fans? I’m generally satisfied with the current setup, but the fans seem quite loud lately... Thanks for the advice on upgrading—I’ll explore options and give it a try. It’s odd that everything was fine for the past few months before suddenly failing!

The module I’m using comes from fs.com: Brocade Compatible 10GBASE-T SFP+ Copper 30m RJ-45 Transceiver Module (LOS). The previous unit worked well for months but eventually stopped functioning and turned into a FlyproFiber Transceiver. Here’s the current switch status for reference:

Reply

E

Elia1153

08-22-2025, 01:29 PM #5

Are you merely running a copy or is HA truly activated for the virtual machines? If HA is indeed active, the switch reboots might be initiating it to respond, which involves restarting a node if it was marked as offline by others. When everyone remains only connected to the switch, you should transition from fully connected to fully isolated. However, once connectivity is restored, one of them may lag in reconnecting and be selected to reboot.

Reply

H

Hytac

08-22-2025, 01:29 PM #6

For FS P/N SFP-10G-T-30 this model should work fine since it supports 2.9W. I checked several FlyproFiber modules and they all showed around 2.5W consumption, though confirming the exact spec would help. The transceiver status is displaying 42°C, which doesn’t raise concerns about power draw. If you want deeper insights, updating the switch firmware and starting CPU/Memory logging via SNMP would be useful. Zabbix is a solid choice for monitoring this.

Reply

C

captainevan100

08-22-2025, 01:29 PM #7

Apologies for any lack of clarity in my previous summary. For two of my virtual machines, high availability is active among three cluster nodes. With four VMs, replication is turned on between those nodes. Around ten other VMs still aren’t running yet; they’re still restarting according to the schedule, suggesting a reboot cycle is happening. It seems logical that the switch reboots are contributing to the cluster instability. I mentioned this to clarify the overall issue if needed.

The current module model is SFP-10G-T-30. The prior one was SFP-10GT-BC-30M. It’s reassuring to hear it might be resolved. I’m puzzled as to why this seems connected to the problem. My note was that when the new module was installed, the switch kept looping through boot. Removing it fixed it. Leaving the switch off overnight and reinserting it worked better now, though it remains intermittent. The module still provides a 10G link to the Proxmox node. I’m concerned about possible short circuits or power spikes when the module was first installed, but it’s not dead now.

Do you have a suitable guide for configuring CPU and memory logging with this switch? This would be helpful since I plan to use Grafana Alloy/Loki/Mimir for logging. I’d also like to remove the module, restart the entire setup, and check if stability improves.

Reply

N

Ninjas_R_OP

08-22-2025, 01:29 PM #8

I'm familiar mainly with SNMP. Other aspects like CPU, memory, temperature, and many more are accessed through standardized OIDs. This is how enterprise monitoring systems have been developed over the years.

Reply

M

MadMats100

08-22-2025, 01:29 PM #9

I removed the transceiver and began tracking temperature and CPU metrics through SNMP during this period, thanks for the advice. I'm still figuring out whether I'm interpreting the CPU data accurately and what the actual variable names should be. The temperature graph appears reasonable, showing two distinct outages or restarts over time with clear drops. The CPU chart might not be recording properly, but it still aligns with these events. It seems the two incidents happened around the same time each morning. That timing is intriguing—I'm curious about its significance. If it matters, pinpointing the cause is challenging since there aren't obvious signs in the sensor data. The power comes from a clean UPS, and logs indicate stable supply with no issues. Could network usage spikes around those times—perhaps due to backup tasks or similar activities—that might trigger crashes? Or maybe my setup (OPNSense Router or a VM) is somehow initiating such reboot cycles? Any suggestions or guidance would be appreciated. Perhaps upgrading the firmware could help, though I'm unsure why it would be necessary. It feels odd because the device still functions normally otherwise.

Reply