[Intel-wired-lan] [BUG] ixgbe: Detected Tx Unit Hang (XDP)

Tobias Böhm tobias.boehm at hetzner-cloud.de
Mon May 5 15:23:02 UTC 2025


Am 24.04.25 um 12:19 schrieb Tobias Böhm:
> Am 23.04.25 um 20:39 schrieb Maciej Fijalkowski:
>> On Wed, Apr 23, 2025 at 04:20:07PM +0200, Marcus Wichelmann wrote:
>>> Am 17.04.25 um 16:47 schrieb Maciej Fijalkowski:
>>>> On Fri, Apr 11, 2025 at 10:14:57AM +0200, Michal Kubiak wrote:
>>>>> On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote:
>>>>>> Am 10.04.25 um 16:30 schrieb Michal Kubiak:
>>>>>>> On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> in a setup where I use native XDP to redirect packets to a 
>>>>>>>> bonding interface
>>>>>>>> that's backed by two ixgbe slaves, I noticed that the ixgbe 
>>>>>>>> driver constantly
>>>>>>>> resets the NIC with the following kernel output:
>>>>>>>>
>>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP)
>>>>>>>>      Tx Queue             <4>
>>>>>>>>      TDH, TDT             <17e>, <17e>
>>>>>>>>      next_to_use          <181>
>>>>>>>>      next_to_clean        <17e>
>>>>>>>>    tx_buffer_info[next_to_clean]
>>>>>>>>      time_stamp           <0>
>>>>>>>>      jiffies              <10025c380>
>>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 
>>>>>>>> 4, resetting adapter
>>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx 
>>>>>>>> timeout
>>>>>>>>    ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter
>>>>>>>>
>>>>>>>> This only occurs in combination with a bonding interface and 
>>>>>>>> XDP, so I don't
>>>>>>>> know if this is an issue with ixgbe or the bonding driver.
>>>>>>>> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 
>>>>>>>> and 6.15.0-rc1
>>>>>>>> show the same issue.
>>>>>>>>
>>>>>>>>
>>>>>>>> I managed to reproduce this bug in a lab environment. Here are 
>>>>>>>> some details
>>>>>>>> about my setup and the steps to reproduce the bug:
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>> Do you have any ideas what may be causing this issue or what I 
>>>>>>>> can do to
>>>>>>>> diagnose this further?
>>>>>>>>
>>>>>>>> Please let me know when I should provide any more information.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Marcus
>>>>>>>>
>>>>>>>
>>>>> [...]
>>>>>
>>>>> Hi Marcus,
>>>>>
>>>>>> thank you for looking into it. And not even 24 hours after my 
>>>>>> report, I'm
>>>>>> very impressed! ;)
>>>>>
>>>>> Thanks! :-)
>>>>>
>>>>>> Interesting. I just tried again but had no luck yet with 
>>>>>> reproducing it
>>>>>> without a bonding interface. May I ask how your setup looks like?
>>>>>
>>>>> For now, I've just grabbed the first available system with the HW
>>>>> controlled by the "ixgbe" driver. In my case it was:
>>>>>
>>>>>    Ethernet controller: Intel Corporation Ethernet Controller X550
>>>>>
>>>>> Also, for my first attempt, I didn't use the upstream kernel - I 
>>>>> just tried
>>>>> the kernel installed on that system. It was the Fedora kernel:
>>>>>
>>>>>    6.12.8-200.fc41.x86_64
>>>>>
>>>>>
>>>>> I think that may be the "beauty" of timing issues - sometimes you 
>>>>> can change
>>>>> just one piece in your system and get a completely different 
>>>>> replication ratio.
>>>>> Anyway, the higher the repro probability, the easier it is to debug
>>>>> the timing problem. :-)
>>>>
>>>> Hi Marcus, to break the silence could you try to apply the diff 
>>>> below on
>>>> your side?
>>>
>>> Hi, thank you for the patch. We've tried it and with your changes we 
>>> can no
>>> longer trigger the error and the NIC is no longer being reset.
>>>
>>>> We see several issues around XDP queues in ixgbe, but before we
>>>> proceed let's this small change on your side.
>>>
>>> How confident are you that this patch is sufficient to make things 
>>> stable enough
>>> for production use? Was it just the Tx hang detection that was 
>>> misbehaving for
>>> the XDP case, or is there an underlying issue with the XDP queues 
>>> that is not
>>> solved by disabling the detection for it?
>>
>> I believe that correct way to approach this is to move the Tx hang
>> detection onto ixgbe_tx_timeout() as that is the place where this logic
>> belongs to. By doing so I suppose we would kill two birds with one stone
>> as mentioned ndo is called under netdev watchdog which is not a subject
>> for XDP Tx queues.
>>
>>>
>>> With our current setup we cannot verify accurately, that we have no 
>>> packet loss
>>> or stuck queues. We can do additional tests to verify that.
> 
> 
> Hi Maciej,
> 
> I'm a colleague of Marcus and involved in the testing as well.
>>>> Additional question, do you have enabled pause frames on your setup?
>>>
>>> Pause frames were enabled, but we can also reproduce it after 
>>> disabling them,
>>> without your patch.
>>
>> Please give your setup a go with pause frames enabled and applied patch
>> that i shared previously and let us see the results. As said above I do
>> not think it is correct to check for hung queues in Tx descriptor 
>> cleaning
>> routine. This is a job of ndo_tx_timeout callback.
>>
> 
> We have tested with pause frames enabled and applied patch and can not 
> trigger the error anymore in our lab setup.
> 
>>>
>>> Thanks!
>>
>> Thanks for feedback and testing. I'll provide a proper fix tomorrow 
>> and CC
>> you so you could take it for a spin.
>>
> 
> That sounds great. We'd be happy to test with the proper fix in our 
> original setup.

Hi,

During further testing with this patch applied we noticed new warnings 
that show up. We've also tested with the new patch sent ("[PATCH 
iwl-net] ixgbe: fix ndo_xdp_xmit() workloads") and see the same warnings.

I'm sending this observation to this thread because I'm not sure if it 
is related to those patches or if it was already present but hidden by 
the resets of the original issue reported by Marcus.

After processing test traffic (~10kk packets as described in Marcus' 
reproducer setup) and idling for a minute the following warnings keep 
being logged as long as the NIC idles:

   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 
60 sec
   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 
60 sec
   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 
120 sec
   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 
120 sec
   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 
181 sec
   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 
181 sec
   page_pool_release_retry() stalled pool shutdown: id 968, 2 inflight 
241 sec
   page_pool_release_retry() stalled pool shutdown: id 963, 2 inflight 
241 sec

Just sending a single packet makes the warnings stop being logged.

After sending heavy test traffic again new warnings start to be logged 
after a minute of idling:

   page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 
60 sec
   page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 
60 sec
   page_pool_release_retry() stalled pool shutdown: id 987, 2 inflight 
120 sec
   page_pool_release_retry() stalled pool shutdown: id 979, 2 inflight 
120 sec

Detaching the XDP program stops the warnings as well.

As before pause frames were enabled.

Just like with the original issue we were not always successful to 
reproduce those warnings. With more traffic chances seem to be higher to 
trigger it.

Please let me know if I should provide any further information.

Thanks,
Tobias


More information about the Intel-wired-lan mailing list