REMINDER: Cheyenne downtime planned July 31-August 4 to address ongoing network performance issues

July 21, 2023

The Cheyenne InfiniBand high-speed network continues to operate in a degraded state after the failure of two switches. CISL and HPE staff are planning to repair the network during a full system downtime the week of July 31.

The precise duration of the downtime is unknown until the extent of the remediation work is identified – some of which will need to be discovered when the downtime and required disassembly begins. Best estimates would be 3-4 days duration, with the unlikely possibility of 5 full days. Progress will be communicated through Notifier emails during the week.

During this time, all Cheyenne compute nodes will be unavailable. Login nodes are expected to remain up. The user cron services associated with Cheyenne will be rebooted briefly during the downtime window. Scheduler reservations will be put in place to ensure that all user jobs have completed by July 31 as the downtime begins. Any jobs that are queued when the downtime begins will be retained for execution when the systems return to service.

In the interim, users may continue to experience a higher rate of job failures than typical, particularly at large node counts. The following error messages are likely related to this network path error:

ERROR: Extracting flags from IB packet of unknown length
Transport retry count exceeded on mlx5_0:1/IB
MPT: rank XXX dropping unexpected RC packet from YYY …, presumed failover
Hung applications that eventually time out
(no error message, but no output, either)

These messages may occur in application logs, and the failure modes can include immediate job termination or application hangs.

Until the network is repaired, current remediation options remain limited. Users are encouraged to resubmit failed jobs and include the PBS directive “#PBS -l place=group=rack” in their batch scripts when requiring 250 nodes or less. This will request PBS to select nodes from the same rack, perhaps reducing – but likely not eliminating – the impact of the failed switches. Users are also encouraged to reach out to the NCAR Research Computing (RC) Helpdesk to request core-hour refunds if significantly impacted by these ongoing disruptions.