Insights: Why This Virtualisation Platform Survived a Catastrophic Shutdown

Summary

This Insight looks at the lifecycle of a small virtualisation platform that began on VMware with no backups, migrated to XenServer/XCP‑ng, and ultimately survived a catastrophic, unplanned shutdown years later. The stability of the system came not from expensive hardware or complex software, but from simple engineering principles: clean migrations, minimal configuration drift, resilient storage, and a disciplined “don’t touch it unless it’s broken” approach.

The Beginning: VMware With No Backups

The environment started life on VMware ESXi Free Edition. For two years, the business ran its core services — Samba Active Directory, Zimbra email, and a file server — with no backups at all. No snapshots, no replication, no export automation. If the datastore had failed, everything would have been lost.

This risk drove the decision to replace the platform entirely. The goal was simple: build something recoverable.

Migration to XenServer / XCP‑ng

The first step was to export the VMware VMs, remove VMware Tools, and import them into XenServer. This process effectively “cleaned” the VMs:

  • proprietary VMware drivers removed
  • virtual hardware reset
  • simplified device trees
  • clean bootloaders
  • no leftover VMware configuration rot

Once on XenServer (and later XCP‑ng), the VMs ran for the next seven years.

Storage Redesign: QNAP RAID‑6 + Local Disk for Zimbra

To eliminate the original risk, a QNAP TS‑453U‑RP was introduced:

  • 4 × 4TB Seagate enterprise drives
  • RAID‑6 for resilience
  • ext4 + mdadm + LVM
  • exported via NFS/iSCSI to the XCP‑ng hosts

Almost all VDI files lived on this QNAP.

Zimbra was the exception. Heavy I/O during backups caused the QNAP to remount read‑only, so Zimbra was moved to the internal HDD of a DL360 G9. This isolated it from storage stalls and made it far more stable.

The Stability Philosophy: Don’t Fiddle With Working Systems

Once the VMs were running on XCP‑ng, they were left alone. Other than:

  • apt update
  • apt upgrade
  • do‑release‑upgrade

…nothing was changed unless a package upgrade broke something (Samba config merges being the usual culprit).

This lack of constant tweaking meant:

  • no configuration drift
  • no abandoned experiments
  • no half‑applied fixes
  • no creeping instability

The VMs became extremely stable simply because they were not interfered with.

The Catastrophic Shutdown

Years later, the company was forced into liquidation. The infrastructure was not shut down properly. Instead:

  • the QNAP was hard‑powered off
  • the network switch was unplugged
  • fibre uplinks were cut with side cutters
  • some XCP‑ng hosts were left running with no storage
  • no graceful shutdown of any VM
  • no monitoring, no console, no visibility

The environment was left in a half‑alive, half‑dead state.

When the new owners abandoned the equipment, the director obtained written permission to dispose of it. The hardware was recovered and taken home.

The Resurrection

Once reassembled at home:

  • the QNAP powered on
  • the RAID‑6 array came up cleanly
  • the SRs mounted
  • the VDI chains were intact
  • the XCP‑ng hosts rejoined the pool
  • the VMs booted

The only repairs needed were a few manual filesystem checks:

Code

fsck /

Zimbra, Samba AD (Bind9 + DHCP), and the file server all came back online.

No VDI corruption. No metadata loss. No rebuilds. No data loss.

Why It Survived

The recovery was not luck. It was the natural outcome of the engineering decisions made years earlier:

  • Linux VMs with ext4 journals tolerate abrupt shutdowns
  • XCP‑ng freezes VMs when storage disappears instead of corrupting metadata
  • QNAP RAID‑6 with enterprise drives survives power loss well
  • VDI chains were short and healthy
  • Zimbra was isolated on local storage
  • No configuration drift accumulated over the years
  • No proprietary VMware dependencies remained
  • Backups existed even though they weren’t needed

The system survived because it was simple, clean, and consistent.

Lessons Learned

  1. Stability comes from not constantly tweaking systems that already work.
  2. Linux VMs with ext4 are extremely resilient under catastrophic conditions.
  3. Open hypervisors like XCP‑ng recover better than proprietary ones.
  4. RAID‑6 with enterprise drives is worth the overhead.
  5. Short VDI chains are far more reliable than long incremental histories.
  6. Local storage for I/O‑heavy workloads (like Zimbra) prevents cascading failures.
  7. Backups matter — but clean engineering matters even more.

Conclusion

This environment began in a risky place: VMware with no backups. It evolved into a stable, resilient platform through careful migration, simple design, and disciplined maintenance. Years later, even after a destructive shutdown and physical damage, the system resurrected itself with minimal repair.

Good engineering doesn’t just work on good days. It survives the bad ones.