[Intel-wired-lan] intermittent ixgbe transmit queue timeouts in v5.18 kernels

Jeff Layton jlayton at kernel.org
Wed Jun 8 12:44:49 UTC 2022


On Tue, 2022-06-07 at 21:22 +0000, Switzer, David wrote:
> > -----Original Message-----
> > From: Intel-wired-lan <intel-wired-lan-bounces at osuosl.org> On Behalf
> > Of
> > Jeff Layton
> > Sent: Thursday, June 2, 2022 2:38 PM
> > To: intel-wired-lan at lists.osuosl.org; Nguyen, Anthony L
> > <anthony.l.nguyen at intel.com>; Brandeburg, Jesse
> > <jesse.brandeburg at intel.com>
> > Cc: Ilya Dryomov <idryomov at gmail.com>; Xiubo Li <xiubli at redhat.com>;
> > Venky Shankar <vshankar at redhat.com>
> > Subject: [Intel-wired-lan] intermittent ixgbe transmit queue
> > timeouts in v5.18
> > kernels
> > 
> > The Ceph project test lab has a fairly large cluster of machines
> > with ixgbe
> > adapters:
> > 
> >    03:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit
> > SFI/SFP+
> > Network Connection (rev 01)
> > 
> We are attempting to reproduce your issue, and the output from lspci -
> s 03:00.0
> -vv would help us make sure we're looking at the exact adapter that
> the issue is
> Being seen on.
> 
> > Recently, we've started getting intermittent tx queue timeouts with
> > these
> > machines. One of them is reported here:
> > 
> >    https://tracker.ceph.com/issues/55823
> > 
> > Usually this happens when we're trying to do a sync, and there is a
> > flurry of
> > transmission activity. Afterward we see a lot of fallout in ceph
> > culminating in
> > softlockups.
> > 
> > The kernels we're testing have some patches that are not yet in
> > mainline, but
> > mostly they are confined to net/ceph and fs/ceph, and shouldn't
> > really affect
> > hw drivers.
> > 
> > The problem manifested pretty regularly during v5.18 and then I
> > didn't see it
> > for a while. I had figured it was something that had been fixed, but
> > I think it
> > was just "luck".
> > 
> > I attempted a bisect a while back, and ruled out recent ceph changes
> > as the
> > issue. Unfortunately, I wasn't able to get to a conclusive patch
> > that broke it,
> > but I think it likely crept in during the initial merge window for
> > v5.18 (pre-rc1).
> > 
> > One other oddity: the test lab often installs bleeding-edge kernels
> > on old
> > distros (RHEL8 and Ubuntu from similar era). Is it possible that the
> > firmware
> > that ships with these older distros is not suitable for the more
> > recent driver in
> > v5.18 ?
> > 
> Thank you for this information, we'll look into it if we're having
> trouble
> reproducing the issue!
> 
> 
> > Any thoughts or suggestions on things we can do to fix this?
> > 
> Nothing yet, but we'll be sure to let you know when we find it.
> 

Thanks for getting back to us.

Since I emailed you, I've found a bug in ceph that could make the cephfs
client spin in an (essentially) infinite loop if there were delays
getting MDS replies in some situations. We've fixed that and I haven't
seen any tx queue timeouts since, though I've only had the fix in place
for a day or so.

For now, I think we can just consider this to be fallout from the ceph
bug. If the problems return though, I'll let you know!

Thanks again!
-- 
Jeff Layton <jlayton at kernel.org>


More information about the Intel-wired-lan mailing list