The Cheyenne Infiniband high-speed network has suffered two failed switches as a result of cooling system problems that will require a system outage in order for the vendor to repair and replace. The dates for this outage are currently being planned. We expect the outage will require 3-5 days effort and will occur no earlier than 3 weeks from now. If users have any flexibility to defer running large node count jobs until after this outage, we recommend deferring jobs when practical.
In the meantime, users will likely experience a higher rate of job failures than typical, especially at large node counts. Error messages such as
- ERROR: Extracting flags from IB packet of unknown length
- Transport retry count exceeded on mlx5_0:1/IB
- MPT: rank XXX dropping unexpected RC packet from YYY …, presumed failover
- (no error message, but no output, either.)
Are all likely related to this network path error.
Unfortunately at the moment the remediations are limited. Users are encouraged to resubmit failed jobs, and optionally include the PBS directive “#PBS -l place=group=rack” in their batch scripts when requiring 250 nodes or less. This will request PBS to select nodes from the same rack, perhaps reducing but likely not eliminating the impact of the failed switches.