[osuosl-openpower] MAINTENANCE: OpenStack cluster server moves: Mar 8, 9 & 12

Tue Mar 6 22:36:50 UTC 2018

Service(s) affected:

All VMs hosted on the OpenPOWER OpenStack cluster will be offline for
approximately 2-4 hours during each server move. In addition, any VMs which
have block storage attached to the affected nodes will have an outage.

For a list of affected VMs per hypervisor node, please see the following
spreadsheet which includes the UUID for each instance as it stands today.
You can see what UUID your VM has by looking at the
/run/cloud-init/.instance-id file on your vm. In addition, if you're using
a block storage (cinder) volume, I have a sheet which shows the mappings by
UUID to the host.

 OpenStack Cluster Server Moves
<https://docs.google.com/a/osuosl.org/spreadsheets/d/15D3VE13chSn0jmGWpf5wsPsin6ex0B3I6FTwS74T5uY/edit?usp=drive_web>

Outage Window
s
:

openpower1
Start:   Thu, Mar 8, 9:00AM PST (Thu Mar 8 1700 UTC)
End:    Thu, Mar 8, 11:00AM PST (Thu Mar 8 1900 UTC)

openpower2
Start:   Thu, Mar 8, 3:00PM PST (Thu Mar 8 2300 UTC)
End:    Thu, Mar 8, 5:00PM PST (Fri Mar 9 0100 UTC)

openpower3
Start:   Fri, Mar 9, 8:30AM PST (Fri Mar 9 1630 UTC)
End:    Fri Mar 9, 10:30AM PST (Fri Mar 9 1830 UTC)

openpower5
Start:   Fri, Mar 9, 1:00PM PST (Fri Mar 9 2100 UTC)
End:    Fri Mar 9, 3:00PM PST (Fri Mar 9 2300 UTC)

openpower6 (note DST change for us)
Start:   Mon, Mar 12, 1:00PM PDT (Fri Mar 9 2000 UTC)
End:    Mon Mar 12, 3:00PM PDT (Fri Mar 9 2200 UTC)

Reason for outage:

We are in the process of migrating the storage backend of the cluster
from local storage to using Ceph as a backend. The migration to Ceph should
improve I/O bandwidth and capacity and also provide more flexibility with
doing server maintenance since we can do live migrations on VMs. Thanks to
a donation from IBM, we have a new five node Ceph cluster with 292TB of
capacity including SSD's for journal caching. In addition, we're going to
be upgrading the networking layer from 1Gbps to 40Gbps due to the use of
Ceph thanks to several donations from Mellanox. Since we're going to be
incurring an outage for the server move, we wanted to do a few other items
as the same time to reduce additional outage times.

The first phase of this migration includes the following (which this outage
covers):

1. Moving each compute server to a different rack closer to a Mellanox 40G
switch
2. Installing and configuring a Mellanox 40G NIC card
3. Upgrading the system firmware (which includes Meltdown/Spectre fixes)
4. Switching over to a 4.14 mainline kernel on the host to provide better
feature support on ppc64le (also provides fixes for Meltdown/Spectre)

We have five compute nodes and we're planning on doing two sever moves a
day starting on Thursday of this week. We're going to need to bring the
nodes up and down several times so we'll be disabling the openstack
services on those nodes until the process is complete.

The second phase of the migration will happen in a few weeks and should
only have per VM impacts while we migrate them over to the new Ceph
cluster. I'll send a separate announcement about that once we're ready for
that.

If you have any questions or concerns please let me know directly via email
or IRC.

Thanks!

-- 
Lance Albertson
Director
Oregon State University | Open Source Lab
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/openpower/attachments/20180306/27520b7c/attachment.html>