Live backup PC
Live backup PC
I run a company that heavily relies on design by simulation, constantly using powerful multi-core PCs with high CPU usage. Because of this intense demand, these machines often fail regularly—specifically, three Dell 7820 Xeon Gold models in the past two years, which is above average. On average, I lose one every two years. We can discuss individual failures, but the main concern here is downtime.
Our backup system works well, starting with NAS and then moving to the cloud. However, it takes Dell 3 to 6 weeks to assemble and deliver a new PC each time a failure occurs. This creates significant disruptions for our business. Even two hours of downtime can result in substantial losses, so I’m considering having at least one backup PC ready whenever possible.
This backup doesn’t need to be a perfect permanent solution; in fact, I prefer not to constantly upgrade to the newest models. The goal is simply to have something functional while the replacement is being built and shipped.
I can acquire decent workstations from eBay for this purpose—around $1500 for a Dell Precision 7820 with dual Xeon Gold 6000-series, adequate RAM and an SSD. That should suffice. However, setting up automatic or hourly replication of my primary PC, both running Windows 11, is challenging. I believe it would be more efficient to quickly transfer the backup from the main PC to the emergency PC during a critical moment, rather than relying on a NAS transfer from the primary system.
Currently, all my Windows PCs use local login. I strongly dislike being forced to store files on Microsoft’s cloud services due to performance issues with large simulation files (which require NVMe drives) and because of the convenience and security benefits they provide. Much of our work involves export-controlled tasks and some that are classified or ITAR-related.
The simulations incorporate both GPU and CPU processing capabilities.
This significantly limits your options and actions. Using random ideas might come from restricted regions, risking your government contracts. Sharing this could also be against the rules. Please discuss this with your security team before proceeding.
In general, your tasks align more with server requirements. Operating it as a VM would make it simple to switch between two hosts and restart automatically on the second if needed. Additionally, you could replicate the VM to AWS instances meeting ITAR standards for extra reliability using Veeam.
My prediction is that just by asking, they already grasp the clearance problems. In this case, my reasoning suggests this person is unlikely to find a solution outside certified suppliers, regardless. It’s an interesting mental scenario, but the truth seems he’d need to rely on the official approval to maintain a functional backup system for this contract. Considering the other facts, it clearly stands out as a highly profitable option if downtime is so critical.
My top suggestion is to maintain an on-site spare. This prevents downtime and provides a test host for software and firmware updates. You should never deploy software or firmware to operations without a test platform. Your test platform acts as a temporary operations resource while you receive a replacement.
I’m uncertain if your application works in a VM, but if it does, that significantly expands your possibilities. An automatic restart of the application on another physical host is also beneficial.
You have the ability to execute any program within a virtual machine. However, you may experience a reduction in performance of 3-5% because of the hypervisor's impact.
Be very cautious about making general claims. Many systems can sense their surroundings and may stop functioning if a virtual machine environment is identified.