[osuosl-openpower] Ongoing VM network connectivity issues since Pike upgrade

Lance Albertson lance at osuosl.org
Wed May 29 18:51:58 UTC 2019


Sending another update on this to just the OpenPOWER cluster users.

This problem is still unfortunately happening, even on the nodes which have
been rebooted. To reboot the nodes, I need to live migrate all of the VMs
onto other nodes. Normally this isn't an issue but I noticed yesterday that
some instances were failing during the migration. Upon further
investigation, I noticed that some of these VMs were running on both the
old and new nodes which is not a good thing for file systems. I've already
fixed a few VMs that were in this state, but I'm still working through
others which still might be in a bad state. I'm doing a force filesystem
check in a rescue mode before booting the systems. I'll let you know once
I'm done going through all the VMs in case I missed anything. In the
meantime, I'm not going to do any more live migrations until I get this
resolved.

On the original issue, I unfortunately have not made any progress on
narrowing down what is causing the issue. One option is to go ahead with
the Queens upgrade to see if the problem persists or not. But I'd feel much
better if I got this fixed before we attempted the upgrade.

I'll continue looking into this this week.

Thanks for your patience.

On Tue, May 21, 2019 at 12:16 PM Lance Albertson <lance at osuosl.org> wrote:

> All,
>
> I wanted to send you an update on where we are at on this issue. So far
> I've narrowed down the problem to happening when a VM using a private
> network is removed causing certain iptable rules on the hypervisor to get
> out of order. It only seems to effect inbound connections to the VM as
> outbound seems to still work. I haven't been able to easily reproduce the
> issue unfortunately which makes it difficult to troubleshoot. I've looked
> through the source code and also looked online to see if anyone else had
> run into this without success.
>
> I've rebooted all of the hypervisors on our x86 cluster and two on our ppc
> cluster (which was needed for the MDS updates). So far on the nodes that
> have been rebooted we haven't seen any issues, but I need to let those run
> for a few days to verify that theory. These machines were also due for a
> reboot also because of the CentOS 7.5 -> 7.6 upgrade so perhaps it's
> related to that.
>
> At any rate, I've deployed a temporary cronjob on the nodes that haven't
> been rebooted which should "fix" the networking issue. I have it set to run
> every minute so that the downtime should be minimal.
>
> I'll send another update as I have one.
>
> Thanks-
>
> On Thu, May 16, 2019 at 8:58 AM Lance Albertson <lance at osuosl.org> wrote:
>
>> All,
>>
>> Since the upgrade to Pike we've noticed virtual machines suddenly losing
>> network connectivity. This issue seems to sometimes fix itself or when we
>> restart the  neutron-linuxbridge-agent service on the hypervisors. We
>> are doing our best to track down why this is happening and how to fix it.
>> Since we're not monitoring every host on the cluster, it's difficult for us
>> to know when it happens so if you do have a problem with one of your VMs,
>> please let us know either via IRC in #osuosl on Freenode, or via a support
>> email.
>>
>> I'll be sending further updates as we have them.
>>
>> Thanks for your patience!
>>
>> --
>> Lance Albertson
>> Director
>> Oregon State University | Open Source Lab
>>
>
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>


-- 
Lance Albertson
Director
Oregon State University | Open Source Lab
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/openpower/attachments/20190529/6ab56ed4/attachment.html>


More information about the openpower mailing list