[Intel-wired-lan] [REGRESSION] Intel ICE Ethernet driver in linux >= 6.6.9 triggers extra memory consumption and cause continous kswapd* usage and continuous swapping
Paul Menzel
pmenzel at molgen.mpg.de
Wed Jan 24 16:30:37 UTC 2024
Dear Jaroslav,
Am 24.01.24 um 15:29 schrieb Linux regression tracking (Thorsten Leemhuis):
> On 11.01.24 09:26, Jaroslav Pulchart wrote:
>>> On 1/8/2024 2:49 AM, Jaroslav Pulchart wrote:
>>> First, thank you for your work trying to chase this!
>>>> I would like to report a regression triggered by recent change in
>>>> Intel ICE Ethernet driver in the 6.6.9 linux kernel. The problem was
>>>> bisected and the regression is triggered by
>>>> fc4d6d136d42fab207b3ce20a8ebfd61a13f931f "ice: alter feature support
>>>> check for SRIOV and LAG" commit and originally reported as part of
>>>> https://lore.kernel.org/linux-mm/CAK8fFZ4DY+GtBA40Pm7Nn5xCHy+51w3sfxPqkqpqakSXYyX+Wg@mail.gmail.com/T/#m5217c62beb03b3bc75d7dd4b1d9bab64a3e68826
>>>> thread.
>>>
>>> I think that's a bad bisect. There is no reason I could understand for
>>> that change to cause a continuous or large leak, it really doesn't make
>>> any sense. Reverting it consistently helps? You're not just rewinding
>>> the tree back to that point, right? just running 6.6.9 without that
>>> patch? (sorry for being pedantic, just trying to be certain)
>>
>> Reverting just the single bisected commit continuously helps for >=
>> 6.6.9 and as well for current 6.7.
>> We cannot use any new linux kernel without reverting it due to this
>> extra memory utilization.
>
> Quick query: what's the status wrt to this regression? Looks like
> nothing happened in the past week.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
>
> #regzbot poke
According to Linux’ “no regression rule” [1], I recommend to send in a
revert for the bisected commit.
Kind regards,
Paul
[1]:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-regressions.html
>>>>> However, after the following patch we see that more NUMA nodes have
>>>>> such a low amount of memory and that is causing constant reclaiming
>>>>> of memory because it looks like something inside of the kernel ate all
>>>>> the memory. This is right after the start of the system as well.
>>>>
>>>> I'm reporting it here as it is a different problem than the original
>>>> thread. The commit introduces a low memory problem per each numa node
>>>> of the first socket (node0 .. node3 in our case) and cause constant
>>>> kswapd* 100% CPU usage. See attached 6.6.9-kswapd_usage.png. The low
>>>> memory issue is nicely visible in "numastat -m", see attached files:
>>>> * numastat_m-6.6.10_28GB_HP_ice_revert.txt >= 6.6.9 with reverted ice commit
>>>> * numastat_m-6.6.10_28GB_HP_no_revert.txt >= 6.6.9 vanilla
>>>> the server "is fresh" (after reboot), without running any application load.
>>>
>>> OK, so the initial allocations of your system is running your system out
>>> of memory.
>>>
>>> Are you running jumbo frames on your ethernet interfaces?
>>
>> Yes, we are (MTU 9000).
>>
>>> Do you have /proc/slabinfo output from working/non-working boot?
>>
>> Yes, I have a complete sos report so I can pick-up files from there.
>> See attached
>> slabinfo.vanila (non-working)
>> slabinfo.reverted (working)
>>
>>>> $ grep MemFree numastat_m-6.6.10_28GB_HP_ice_revert.txt
>>>> numastat_m-6.6.10_28GB_HP_no_revert.txt
>>>> numastat_m-6.6.10_28GB_HP_ice_revert.txt:MemFree
>>>> 2756.89 2754.86 100.39 2278.43 < ice
>>>> fix is reverted, we have ~2GB free per numa, except one, like before
>>>> == no issue
>>>> numastat_m-6.6.10_28GB_HP_ice_revert.txt:MemFree
>>>> 3551.29 1530.52 2212.04 3488.09
>>>> ...
>>>> numastat_m-6.6.10_28GB_HP_no_revert.txt:MemFree
>>>> 127.52 66.49 120.23 263.47 <
>>>
>>>
>>>> ice fix is present, we see just few MB free per each node, this will
>>>> cause kswapd utilization!
>>>> numastat_m-6.6.10_28GB_HP_no_revert.txt:MemFree
>>>> 3322.18 3134.47 195.55 879.17
>>>> ...
>>>>
>>>> If you have some hints on how to debug what is actually occupying all
>>>> that memory and some fix of the problem will be nice. We can provide
>>>> testing and more reports if needed to analyze the issue. We reverted
>>>> the commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f as a workaround
>>>> till we know a proper fix.
>>>
>>> My first suspicion is that we're contributing to the problem by running
>>> out of receive descriptors memory.
>>>
>>> Can we see the ethtool -S stats from the freshly booted system that's
>>> running out of memory or doing OOM? Also, all the standard debugging
>>> info (at least once please), devlink dev info, any other configuration
>>> specifics? What networking config (bonding? anything else?)
>>
>> The system is not in OOM, it starts to continuously utilize four
>> kswapd0-4 of each numa node from the first CPU socket processes (each
>> at 100% and all doing swap in/out) after the system start to be used
>> by application due to "low memory".
>>
>> We have two 25G 2P E810-XXV Adapters. The first port of each (em1 +
>> p3p1) is connected and they're bonded in LACP. Second ports (em2 and
>> p3p2) are unused.
>>
>> See attached file for working:
>> ethtool_-S_em1.reverted
>> ethtool_-S_em2.reverted
>> ethtool_-S_p3p1.reverted
>> ethtool_-S_p3p2.reverted
>>
>> See attached file for non-working:
>> ethtool_-S_em1.vanila
>> ethtool_-S_em2.vanila
>> ethtool_-S_p3p1.vanila
>> ethtool_-S_p3p2.vanila
>>
>>> Do you have a bugzilla.kernel.org bug yet where you can upload larger
>>> files like dmesg and others?
>>
>> I do not have yet, I will create a new one and ping you then.
>>
>>> Also, I'm curious if your problem goes away if you change / reduce the
>>> number of queues per port. use ethtool -L eth0 combined 4 ?
>>
>> I will try and give you feedback soon.
>>
>>> You also said something about reproducing when launching / destroying
>>> virtual machines with VF passthrough?
>>
>> The memory usage is there from boot without running any VMs. The issue
>> is that the host has low memory for self and it starts to use kswapd
>> when we start to use it by starting vms.
>>
>>>
>>> Can you reproduce the issue without starting qemu (just doing bare-metal
>>> SR-IOV instance creation/destruction via
>>> /sys/class/net/eth0/device/sriov_numvfs ?)
>>>
>>
>> Yes we can reproduce it without qemu running, the extra memory usage
>> is from the beginning after boot, not depending on any running VM.
>>
>> We do not use SR-IOV.
More information about the Intel-wired-lan
mailing list