NovaCloud-Hosting - XEON01-VHOST-EYG1 outage – Incident details

XEON01-VHOST-EYG1 outage

Resolved
Partial outage 75 %
Started about 2 months agoLasted 31 minutes

Affected

SkyLink Data Center

Partial outage from 5:06 PM to 5:29 PM, Operational from 5:29 PM to 5:36 PM

VPS Hypervisor EYG1

Partial outage from 5:06 PM to 5:29 PM, Operational from 5:29 PM to 5:36 PM

XEON-01-VHOST

Partial outage from 5:06 PM to 5:29 PM, Operational from 5:29 PM to 5:36 PM

Updates
  • Resolved
    Resolved

    Around 75% of VPS on XEON01-VHOST-EYG1 were affected by this outage.

    Due to a software bug, a backup task for a large VPS was incorrectly written to the primary NVMe SSD pool instead of the designated backup storage. The backup file grew to approximately 4 TB, while the NVMe pool had only around 3 TB of free capacity available. This caused the storage pool to reach 100% utilization.

    Once the primary storage was full, affected virtual machines were no longer able to write to disk. This resulted in widespread I/O errors, and VM crashes.

    As soon as the issue was identified, we immediately removed the incorrectly created backup and began force-restarting all affected VPS instances to restore services as quickly as possible.

    Preventive Measures

    We have implemented safeguards to ensure this cannot happen again:

    • Backups can no longer be created on the primary NVMe storage pool under any circumstances.

    • Additional validation has been added to prevent incorrect backup targets.

    • We are rolling out improved disk usage monitoring and alerting across all host systems to ensure our team is notified before storage reaches critical levels.

    We sincerely apologize for this incident.
    If you are still experiencing any issues caused by this outage, please open a support ticket immediately so we can investigate and resolve it as quickly as possible.

  • Monitoring
    Monitoring

    All VPS have been started. In case your VPS is not working within the next 5 minutes, please create a ticket.

  • Identified
    Identified

    We have found the underlying issue and have fixed the core problem. All VPS should be starting up in 5-10 minutes.

  • Investigating
    Investigating
    We are currently investigating this incident.