Cheyenne downtime planned August 7-11 to address ongoing network performance issues

June 26, 2023

The Cheyenne InfiniBand high-speed network continues to operate in a degraded state after the failure of two switches. CISL and HPE staff are planning to repair the network during a full system downtime the week of August 7.

As this repair is a very high priority, we are actively pursuing options to accelerate the repair and schedule an earlier maintenance window if at all possible.

The precise duration of the downtime is unknown until the extent of the remediation work is identified – some of which will need to be discovered when the downtime and required disassembly begins. Best estimates would be 3-4 days duration, with the possibility of 5 full days. Progress will be communicated through Notifier emails during the week.

During this time, all Cheyenne compute nodes will be unavailable. Login nodes are expected to remain up. The user cron services associated with Cheyenne will be rebooted briefly during the downtime window. Scheduler reservations will be put in place to ensure that all user jobs have completed by August 7 as the downtime begins. Any jobs that are queued when the downtime begins will be retained for execution when the systems return to service.

In the interim, users may continue to experience a higher rate of job failures than typical, particularly at large node counts. The following error messages are likely related to this network path error:

ERROR: Extracting flags from IB packet of unknown length
Transport retry count exceeded on mlx5_0:1/IB
MPT: rank XXX dropping unexpected RC packet from YYY …, presumed failover
Hung applications that eventually time out
(no error message, but no output, either)

These messages may occur in application logs, and the failure modes can include immediate job termination or application hangs.

Until the network is repaired, current remediation options remain limited. Users are encouraged to resubmit failed jobs and include the PBS directive “#PBS -l place=group=rack” in their batch scripts when requiring 250 nodes or less. This will request PBS to select nodes from the same rack, perhaps reducing – but likely not eliminating – the impact of the failed switches. Users are also encouraged to reach out at the NCAR Research Computing (RC) Helpdesk to request core-hour refunds if significantly impacted by these ongoing disruptions.