Summary

This Insight looks at the lifecycle of a small virtualisation platform that began on VMware with no backups, migrated to XenServer/XCP‑ng, and ultimately survived a catastrophic, unplanned shutdown years later. The stability of the system came not from expensive hardware or complex software, but from simple engineering principles: clean migrations, minimal configuration drift, resilient storage, and a disciplined “don’t touch it unless it’s broken” approach.

The Beginning: VMware With No Backups

The environment started life on VMware ESXi Free Edition. For two years, the business ran its core services — Samba Active Directory, Zimbra email, and a file server — with no backups at all. No snapshots, no replication, no export automation. If the datastore had failed, everything would have been lost.

This risk drove the decision to replace the platform entirely. The goal was simple: build something recoverable.

Migration to XenServer / XCP‑ng

The first step was to export the VMware VMs, remove VMware Tools, and import them into XenServer. This process effectively “cleaned” the VMs:

proprietary VMware drivers removed
virtual hardware reset
simplified device trees
clean bootloaders
no leftover VMware configuration rot

Once on XenServer (and later XCP‑ng), the VMs ran for the next seven years.

Storage Redesign: QNAP RAID‑6 + Local Disk for Zimbra

To eliminate the original risk, a QNAP TS‑453U‑RP was introduced:

4 × 4TB Seagate enterprise drives
RAID‑6 for resilience
ext4 + mdadm + LVM
exported via NFS/iSCSI to the XCP‑ng hosts

Almost all VDI files lived on this QNAP.

Zimbra was the exception. Heavy I/O during backups caused the QNAP to remount read‑only, so Zimbra was moved to the internal HDD of a DL360 G9. This isolated it from storage stalls and made it far more stable.

The Stability Philosophy: Don’t Fiddle With Working Systems

Once the VMs were running on XCP‑ng, they were left alone. Other than:

apt update
apt upgrade
do‑release‑upgrade

…nothing was changed unless a package upgrade broke something (Samba config merges being the usual culprit).

This lack of constant tweaking meant:

no configuration drift
no abandoned experiments
no half‑applied fixes
no creeping instability

The VMs became extremely stable simply because they were not interfered with.

The Catastrophic Shutdown

Years later, the company was forced into liquidation. The infrastructure was not shut down properly. Instead:

the QNAP was hard‑powered off
the network switch was unplugged
fibre uplinks were cut with side cutters
some XCP‑ng hosts were left running with no storage
no graceful shutdown of any VM
no monitoring, no console, no visibility

The environment was left in a half‑alive, half‑dead state.

When the new owners abandoned the equipment, the director obtained written permission to dispose of it. The hardware was recovered and taken home.

The Resurrection

Once reassembled at home:

the QNAP powered on
the RAID‑6 array came up cleanly
the SRs mounted
the VDI chains were intact
the XCP‑ng hosts rejoined the pool
the VMs booted

The only repairs needed were a few manual filesystem checks:

Code

fsck /

Zimbra, Samba AD (Bind9 + DHCP), and the file server all came back online.

No VDI corruption. No metadata loss. No rebuilds. No data loss.

Why It Survived

The recovery was not luck. It was the natural outcome of the engineering decisions made years earlier:

Linux VMs with ext4 journals tolerate abrupt shutdowns
XCP‑ng freezes VMs when storage disappears instead of corrupting metadata
QNAP RAID‑6 with enterprise drives survives power loss well
VDI chains were short and healthy
Zimbra was isolated on local storage
No configuration drift accumulated over the years
No proprietary VMware dependencies remained
Backups existed even though they weren’t needed

The system survived because it was simple, clean, and consistent.

Lessons Learned

Stability comes from not constantly tweaking systems that already work.
Linux VMs with ext4 are extremely resilient under catastrophic conditions.
Open hypervisors like XCP‑ng recover better than proprietary ones.
RAID‑6 with enterprise drives is worth the overhead.
Short VDI chains are far more reliable than long incremental histories.
Local storage for I/O‑heavy workloads (like Zimbra) prevents cascading failures.
Backups matter — but clean engineering matters even more.

Conclusion

This environment began in a risky place: VMware with no backups. It evolved into a stable, resilient platform through careful migration, simple design, and disciplined maintenance. Years later, even after a destructive shutdown and physical damage, the system resurrected itself with minimal repair.

Good engineering doesn’t just work on good days. It survives the bad ones.

Insights: Why This Virtualisation Platform Survived a Catastrophic Shutdown