HPC Systems Downtime October 23–27 to Update Casper and Perform File System Migrations

October 13, 2023

All CISL HPC systems will be down for scheduled maintenance the week of October 23–27 to deploy a new operating system on Casper, and to perform file system migrations in preparation for Cheyenne retirement at the end of this year. CISL engineers will relocate a number of file system datasets during this outage—including /glade/work—to new hardware.

We will update the operating system on Casper nodes to an OpenSUSE installation for better compatibility with Derecho. Users are encouraged to evaluate the test Casper deployment through direct ssh login to a demonstration node, casper01.hpc.ucar.edu, to evaluate the new operating system environment. For additional information see the #casper-users channel in our NCAR HPC Users Group Slack workspace and this Daily Bulletin item.

During this time, all Cheyenne, Casper, and Derecho compute nodes will be unavailable. Login nodes will be unavailable at the beginning of the outage window in order to allow for file system migrations. Scheduler reservations will be put in place to ensure that all user jobs have completed by October 23 as the downtime begins. Any jobs that are queued on Cheyenne and Derecho when the downtime begins will be retained for execution when the systems return to service. Any jobs remaining in the Casper queues at the beginning of the outage will be deleted from the queues, as the system migration would almost certainly preclude successful execution.

Globus transfer services will be paused during the first two days of the outage while the file system migrations are in progress.

Our intention is to return Cheyenne to service as quickly as possible, with a planning date of Wednesday 10/25 morning.Cheyenne will be followed by Derecho, and finally Casper, and precise timing will be determined by maintenance events as they unfold.

Progress will be communicated through the Notifier system throughout the course of the outage.