NovaCloud-Hosting - Network Problems – Incident details

Network Problems

Resolved
Major outage
Started 5 months agoLasted about 6 hours

Affected

Website

Major outage from 10:15 PM to 11:04 PM, Operational from 11:04 PM to 11:26 PM, Partial outage from 11:26 PM to 11:51 PM, Operational from 11:51 PM to 4:28 AM

Dashboard

Major outage from 10:15 PM to 11:04 PM, Operational from 11:04 PM to 11:26 PM, Partial outage from 11:26 PM to 11:51 PM, Operational from 11:51 PM to 4:28 AM

Maincubes-FRA01

Major outage from 10:15 PM to 11:04 PM, Operational from 11:04 PM to 11:26 PM, Partial outage from 11:26 PM to 11:51 PM, Operational from 11:51 PM to 4:28 AM

Root-Server Vhosts FRA1

Major outage from 10:15 PM to 11:04 PM, Operational from 11:04 PM to 11:26 PM, Partial outage from 11:26 PM to 11:51 PM, Operational from 11:51 PM to 4:28 AM

RYZEN-02-VHOST Gen-2

Major outage from 10:15 PM to 11:04 PM, Operational from 11:04 PM to 11:26 PM, Partial outage from 11:26 PM to 11:51 PM, Operational from 11:51 PM to 4:28 AM

RYZEN-03-VHOST Gen-2

Major outage from 10:15 PM to 11:04 PM, Operational from 11:04 PM to 11:26 PM, Partial outage from 11:26 PM to 11:51 PM, Operational from 11:51 PM to 4:28 AM

Updates
  • Resolved
    Resolved

    This incident has been resolved.

    Update from ISP:

    „On 23.11.2024 at 23:12 (CET), our network experienced a sophisticated attack directly targeting our edge routing infrastructure. Due to the complexity of the attack, initial debugging efforts mistakenly concluded that we were dealing with a severe hardware failure within our edge routing equipment and that redundancy mechanisms were not functioning as expected. After manually triggering failovers to standby hardware in an attempt to rule out defective components without success, we intentionally disconnected our network from the internet to isolate and restore components one by one. This approach proved effective, with services beginning to recover at 23:41. IPv6 connectivity was restored quickly, while the last IPv4 prefixes came back online at 23:55. Unfortunately, the attackers quickly adjusted their methods, temporarily bypassing the newly implemented filters. As a result, another brief IPv4 outage occurred between 00:22 and 00:35, while IPv6 connectivity remained stable since the first fix was implemented.

    Since the incident occurred six hours ago, our team has been working through the night in collaboration with our upstream providers to implement permanent fixes against these new attack vectors. Although the network has faced further attacks in the meantime, it has remained stable, as our solutions have proven effective. Nevertheless, we remain on high alert over the coming hours and days to respond swiftly to any potential new attack patterns.

    This incident marks the first outage of our entire edge routing infrastructure, impacting every single dataforest customer, including IP transit services across all datacenters, since the launch of AS58212 nearly five years ago. Such an outage falls far short of the standards we set for ourselves, and we deeply apologize for the disruption caused to our customers. It is particularly frustrating that this attack succeeded despite our daily efforts to mitigate hundreds of similar attacks unnoticed. In this case, the attack managed to overwhelm our routers due to insufficient filtering against this unprecedented level of complexity.

    Please note that this incident was purely a reachability issue. There was no power outage, no hacking attempt, and no data breach.“

  • Monitoring
    Monitoring

    Our ISP has updated the incident:

    "Another fix has been implemented a few minutes ago and operation is stable since then. We are working on a permanent solution to avoid further outages. If any occur, we will report here."

  • Identified
    Identified
    We are continuing to work on a fix for this incident.
  • Update
    Update

    Our ISP has deescalated the incident.

    "We have implemented a fix. Network is stable for about five minutes now, IPv6 was not affected as much as IPv4 as remained available during the issue most of the time while IPv4 was affected heaviliy, leading to massive packet loss for a longer period of time. A post-mortem will follow tomorrow. Please rest assured we are 24/7 monitoring the status of our edge routing devices and will take action immediately if needed."

    We are sorry for the inconvenience caused by this rare accident.

  • Monitoring
    Monitoring

    The network is online at the moment, but we are waiting for the confirmation of our ISP.

  • Identified
    Identified

    Update from ISP:
    "We identified the issue and work on a solution"

  • Update
    Update

    Update from ISP:

    "We currently experience a outage in our edge routing, and investigating the situation."

  • Investigating
    Investigating

    We have detected a failure of our infrastructure in the main cube FRA01. The problem is already being investigated.