[Intel-wired-lan] igb Detected Tx Unit Hang after upgrade to 4.18-rc6 [was Re: igb Detected Tx Unit Hang after upgrade to 4.17]

Tue Aug 28 17:01:59 UTC 2018

On Tue, Aug 28, 2018 at 8:20 AM Marco Berizzi <pupilla at libero.it> wrote:
>
> > Il 6 agosto 2018 alle 13.39 Marco Berizzi <pupilla at libero.it> ha scritto:
> >
> > > Il 30 luglio 2018 alle 15.11 Marco Berizzi <pupilla at libero.it> ha scritto:
> > >
> > > > Il 27 luglio 2018 alle 10.47 Marco Berizzi <pupilla at libero.it> ha scritto:
> > > >
> > > > Should I disable the TSO? Or should I set all my linux boxes MTU to 1500?
> > >
> > > Hello everyone,
> > >
> > > One hour ago I upgraded to linux 4.18-rc7 and I set the MTU=1500
> > >
> > > I will keep you updated.
> >
> > After 6 days uptime I got the same error message with MTU=1500.
> > Now I will upgrade to 4.18-rc8 with MTU=1500 and I will disable
> > the tso with the following command: ethtool -K eth0 tso off
>
> Hello everyone,
> same error also with tso disabled:
>
> [2226325.797978] igb 0000:08:00.0: Detected Tx Unit Hang
>                    Tx Queue             <1>
>                    TDH                  <cf>
>                    TDT                  <cf>
>                    next_to_use          <d0>
>                    next_to_clean        <cf>
>                  buffer_info[next_to_clean]
>                    time_stamp           <184aee600>
>                    next_to_watch        <000000003e213f5a>
>                    jiffies              <184aeea00>
>                    desc.status          <a8010>

Okay, so this looks like a Tx hang of some sort where writeback is not
being triggered.

What we may want do is turn up the level of information we are getting
out of the error. You can do that by running:
ethtool -s eth0 msglvl hw on tx_done on pktdata on

That should cause us to dump the Tx descriptor rings and packet data
when we encounter one of these hangs. It is possible we are looking at
some sort of error in the way the data is being formatted on the
descriptor ring that is resulting int he device stopping on whatever
descriptor is at offset <d0>.

Thanks.

- Alex