Upon another physical inspection of node ATLRYZEN101, it was discovered that one of the drive retention clips on this server have become loose (and does not appear to be as secure as a drive retention clip should be). We believe this could be one of the reasons why this node is facing unexpected outages, as something like this can cause random I/O errors.
Because this node is continuing to experience symptoms we are proceeding to take this server offline to the workbench in order to replace the problematic drive retention clip with a new one. Once this is done we will update this status incident and continue to monitor the status of this node to ensure the health of it.
Thank you for your patience and we will have this node back online ASAP.
UPDATE: Facility hands are working on this server. We will share an additional update once available.
UPDATE #2: This node is continuing to experience symptoms, on-site technicians are actively working on implementing a resolution. At this point we have so far attempted full RAM replacements and re-seating NVMe drives on this system (as well as ensuring they are healthy), however, the symptoms are persisting. Before we put this machine back in production, we would like to identify the problematic hardware component in order to ensure long-term stability. We are currently running hardware diagnostics and closely monitoring the server on a physical level to identify the problematic hardware component causing these issues. Thank you for your continued patience and we will share another update on this status incident once available.
UPDATE #3: Current ETA for service restoration is around ~9:45 AM EST, could be sooner - but this is based on the rate the current hardware diagnostics are running.
UPDATE: We have restored services hosted upon this node and VM's should now be accessible (if you notice your VM is offline or if you are having any issues, please feel free to open a ticket). We will mark this status incident as resolved for the time being, however in the background we are continuing to actively monitor the status and health of this node in order to ensure its stability. Thank you for your understanding!