The November 5th HPC Systems Maintenance activities are complete. Queues on Casper and Derecho have been restarted, and jobs are running again. A number of jobs were inadvertently started on Derecho early in the outage window when the system was going down and had to be killed. Job owners have been notified, and these jobs will need to be resubmitted.
Planned maintenance activities needed adjustments due to last-minute changes in the vendor’s software readiness. Unfortunately, the upgrade of the Spectrum Scale client software had to be postponed because the vendor reported technical issues with their latest version. Spectrum Scale, which powers the Campaign Storage, Home, and Work filesystems, will still require an update, though a new timeline has not been set. Fortunately, however, CISL staff applied a recent patch to the Lustre client software, which we anticipate will resolve recent node instability issues on Derecho. These issues have particularly affected certain machine learning workflows on GPU nodes.
The SAM user accounting database has been streamlined so that all systems now pull user preferences from the 'HPC' resource. As a result, users will now have a single default shell setting across all resources. This adjustment will only affect a small group of users who previously had different login shells on Casper and Derecho. Users can change their default shell anytime by going to the User->Settings section at
https://sam.ucar.edu and updating the shell selection for the 'HPC' resource.
Thank you for your patience and understanding with this dynamic outage content. CISL is committed to optimizing system stability and performance while also seeking to minimize system downtime, particularly this time of year as we head into the AGU and AMS conference season.