The Cheyenne cluster is back online after a regional power incident disrupted operations at the NCAR-Wyoming Supercomputing Center on Saturday, March 18, from approximately 10 a.m to 7:30 p.m. MDT. Cheyenne is operating with one switch offline and one switch in a degraded state as a result of the power disruption.
CISL engineers have discovered that certain applications may experience reduced performance as a result of the missing and faulty switches while others will run with expected performance. The most affected jobs will be larger jobs that stress all-to-all network performance. We have noted slowdowns of up to 40% in the worst case for this class of application. Unfortunately, there is no easy fix for this until the network can be repaired.
If your application is running with reduced performance, the best options for the moment are to use a longer wall-clock limit in your PBS job script or shorten the simulation length of your jobs. CISL and HPE are working to replace the faulty network hardware at the highest priority. In the meantime, please contact CISL support with any questions.