NovaCloud-Hosting - Multiple Servers at EYG1 location unreachable – Incident details

Multiple Servers at EYG1 location unreachable

Resolved
Major outage
Started 23 days agoLasted 38 minutes

Affected

SkyLink Data Center

Major outage from 7:33 AM to 7:41 AM, Operational from 7:41 AM to 7:56 AM, Major outage from 7:56 AM to 8:11 AM, Operational from 7:56 AM to 12:00 AM

Game-Cloud Infrastructure

Major outage from 7:33 AM to 7:41 AM, Operational from 7:41 AM to 12:00 AM

Game-DB-01-FFM1

Major outage from 7:33 AM to 7:41 AM, Operational from 7:41 AM to 12:00 AM

VPS Hypervisor EYG1

Major outage from 7:33 AM to 7:41 AM, Operational from 7:41 AM to 7:56 AM, Major outage from 7:56 AM to 8:11 AM, Operational from 7:56 AM to 12:00 AM

RYZEN-01-VHOST Gen-3

Major outage from 7:33 AM to 7:41 AM, Operational from 7:41 AM to 12:00 AM

RYZEN-03-VHOST Gen-3

Major outage from 7:33 AM to 7:41 AM, Operational from 7:41 AM to 12:00 AM

Updates
  • Postmortem
    Postmortem

    On the morning of April 23, a brief utility power failure occurred at the EYG1 data center. One of the Uninterruptible Power Supply (UPS) units failed to engage, causing a total loss of power on one of the power feeds. Servers equipped with only a single Power Supply Unit (PSU) connected to this feed experienced an immediate powerloss. While most systems recovered automatically, the host RYZEN-07-VHOST-EYG1 suffered a secondary hardware issue that required manual intervention.

    Root Cause Analysis

    1. Power Feed Failure: An external power dip was not mitigated by a specific UPS system, leading to a localized blackout on one power feed.

    2. Lack of Redundancy: Impacted Ryzen-based servers utilized non-redundant power configurations. Consequently, the loss of a single feed resulted in an immediate power loss for these machines.

    3. Hardware Damage: The abrupt shutdown triggered a hardware-level fault on host RYZEN-07-VHOST-EYG1, preventing it from rebooting autonomously.

    Incident Timeline

    • 09:33 AM: Monitoring alerts triggered for multiple host systems at EYG1. Investigation initiated.

    • 09:41 AM: Issue identified as a power feed failure. Most affected servers were already successfully restarted and entered monitoring status.

    • 09:56 AM: All systems recovered except for RYZEN-07-VHOST-EYG1. An on-site technician was dispatched to address the hardware failure.

    • 10:11 AM: Hardware repair completed. RYZEN-07-VHOST-EYG1 initiated startup, and all hosted VPS returned to online status shortly thereafter.

  • Resolved
    Resolved

    RYZEN-07-VHOST-EYG1 is now starting up. All VPS on it should be online within the next 5 minutes.

  • Identified
    Identified

    RYZEN-07-VHOST-EYG1 is currently still unreachable. A technitian has already been informed about this incident.

  • Monitoring
    Monitoring

    All affected servers have been started. It seems to be a power-related issue on one of the power feeds, which affected half the servers with non-redundant PSUs. Therefore only AMD Ryzen VPS in Netherlands were partially affected.

    If your server is still unreachable, please create a support ticket so we can address the issue.

  • Investigating
    Investigating

    Multiple of our Hostsystems at the EYG1 location are currently unreachable. We are investigating the issue.