[osuosl-openpower] MAINTENANCE: OpenStack cluster server moves: Mar 8, 9 & 12
Lance Albertson
lance at osuosl.org
Mon Mar 12 21:52:16 UTC 2018
The move for openpower6 has hit an hardware issue and I'm needing to call
IBM support to try and resolve it. I don't have any ETA on when this server
will come back online at the moment.
On Fri, Mar 9, 2018 at 3:18 PM, Lance Albertson <lance at osuosl.org> wrote:
> The move for openpower5 has been completed. Please let me know if any VMs
> are still unreachable.
>
>
> On Fri, Mar 9, 2018 at 10:03 AM, Lance Albertson <lance at osuosl.org> wrote:
>
>> The move for openpower3 has been completed. Please let me know if any VMs
>> are still unreachable.
>>
>>
>> On Thu, Mar 8, 2018 at 5:25 PM, Lance Albertson <lance at osuosl.org> wrote:
>>
>>> The move for openpower2 has been completed. Sorry it took a little
>>> longer than planned. Please let me know if any VMs are still unreachable.
>>>
>>> On Thu, Mar 8, 2018 at 11:15 AM, Lance Albertson <lance at osuosl.org>
>>> wrote:
>>>
>>>> The move for openpower1 has been completed and all VMs should be
>>>> booting up or already should be back online that were on that hypervisor.
>>>> Please let us know if you have an issue with one of your VMs. We'll be
>>>> moving openpower2 later this afternoon as planned.
>>>>
>>>> Thanks-
>>>>
>>>> On Tue, Mar 6, 2018 at 2:36 PM, Lance Albertson <lance at osuosl.org>
>>>> wrote:
>>>>
>>>>> Service(s) affected:
>>>>>
>>>>> All VMs hosted on the OpenPOWER OpenStack cluster will be offline for
>>>>> approximately 2-4 hours during each server move. In addition, any VMs which
>>>>> have block storage attached to the affected nodes will have an outage.
>>>>>
>>>>> For a list of affected VMs per hypervisor node, please see the
>>>>> following spreadsheet which includes the UUID for each instance as it
>>>>> stands today. You can see what UUID your VM has by looking at the
>>>>> /run/cloud-init/.instance-id file on your vm. In addition, if you're using
>>>>> a block storage (cinder) volume, I have a sheet which shows the mappings by
>>>>> UUID to the host.
>>>>>
>>>>> OpenStack Cluster Server Moves
>>>>> <https://docs.google.com/a/osuosl.org/spreadsheets/d/15D3VE13chSn0jmGWpf5wsPsin6ex0B3I6FTwS74T5uY/edit?usp=drive_web>
>>>>>
>>>>> Outage Window
>>>>> s
>>>>> :
>>>>>
>>>>> openpower1
>>>>> Start: Thu, Mar 8, 9:00AM PST (Thu Mar 8 1700 UTC)
>>>>> End: Thu, Mar 8, 11:00AM PST (Thu Mar 8 1900 UTC)
>>>>>
>>>>> openpower2
>>>>> Start: Thu, Mar 8, 3:00PM PST (Thu Mar 8 2300 UTC)
>>>>> End: Thu, Mar 8, 5:00PM PST (Fri Mar 9 0100 UTC)
>>>>>
>>>>> openpower3
>>>>> Start: Fri, Mar 9, 8:30AM PST (Fri Mar 9 1630 UTC)
>>>>> End: Fri Mar 9, 10:30AM PST (Fri Mar 9 1830 UTC)
>>>>>
>>>>> openpower5
>>>>> Start: Fri, Mar 9, 1:00PM PST (Fri Mar 9 2100 UTC)
>>>>> End: Fri Mar 9, 3:00PM PST (Fri Mar 9 2300 UTC)
>>>>>
>>>>> openpower6 (note DST change for us)
>>>>> Start: Mon, Mar 12, 1:00PM PDT (Fri Mar 9 2000 UTC)
>>>>> End: Mon Mar 12, 3:00PM PDT (Fri Mar 9 2200 UTC)
>>>>>
>>>>> Reason for outage:
>>>>>
>>>>> We are in the process of migrating the storage backend of the
>>>>> cluster from local storage to using Ceph as a backend. The migration to
>>>>> Ceph should improve I/O bandwidth and capacity and also provide more
>>>>> flexibility with doing server maintenance since we can do live migrations
>>>>> on VMs. Thanks to a donation from IBM, we have a new five node Ceph cluster
>>>>> with 292TB of capacity including SSD's for journal caching. In addition,
>>>>> we're going to be upgrading the networking layer from 1Gbps to 40Gbps due
>>>>> to the use of Ceph thanks to several donations from Mellanox. Since we're
>>>>> going to be incurring an outage for the server move, we wanted to do a few
>>>>> other items as the same time to reduce additional outage times.
>>>>>
>>>>> The first phase of this migration includes the following (which this
>>>>> outage covers):
>>>>>
>>>>> 1. Moving each compute server to a different rack closer to a Mellanox
>>>>> 40G switch
>>>>> 2. Installing and configuring a Mellanox 40G NIC card
>>>>> 3. Upgrading the system firmware (which includes Meltdown/Spectre
>>>>> fixes)
>>>>> 4. Switching over to a 4.14 mainline kernel on the host to provide
>>>>> better feature support on ppc64le (also provides fixes for Meltdown/Spectre)
>>>>>
>>>>> We have five compute nodes and we're planning on doing two sever moves
>>>>> a day starting on Thursday of this week. We're going to need to bring the
>>>>> nodes up and down several times so we'll be disabling the openstack
>>>>> services on those nodes until the process is complete.
>>>>>
>>>>> The second phase of the migration will happen in a few weeks and
>>>>> should only have per VM impacts while we migrate them over to the new Ceph
>>>>> cluster. I'll send a separate announcement about that once we're ready for
>>>>> that.
>>>>>
>>>>> If you have any questions or concerns please let me know directly via
>>>>> email or IRC.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> --
>>>>> Lance Albertson
>>>>> Director
>>>>> Oregon State University | Open Source Lab
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Albertson
>>>> Director
>>>> Oregon State University | Open Source Lab
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Albertson
>>> Director
>>> Oregon State University | Open Source Lab
>>>
>>
>>
>>
>> --
>> Lance Albertson
>> Director
>> Oregon State University | Open Source Lab
>>
>
>
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>
--
Lance Albertson
Director
Oregon State University | Open Source Lab
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/openpower/attachments/20180312/34a7cfa7/attachment.html>
More information about the openpower
mailing list