[Intel-wired-lan] Linux 4.12+ memory leak on router with i40e NICs

Anders K. Pedersen | Cohaesio akp at cohaesio.com
Sun Oct 22 13:56:40 UTC 2017


On tor, 2017-10-19 at 08:40 -0700, Alexander Duyck wrote:
> On Thu, Oct 19, 2017 at 5:19 AM, Anders K. Pedersen | Cohaesio
> <akp at cohaesio.com> wrote:
> > Hi Alex,
> > 
> > On ons, 2017-10-18 at 16:37 -0700, Alexander Duyck wrote:
> > > When we last talked I had asked if you could do a git bisect to
> > > find
> > > the memory leak and you said you would look into it. The most
> > > useful
> > > way to solve this would be to do a git bisect between your
> > > current
> > > kernel and the 4.11 kernel to find the point at which this
> > > started.
> > > If
> > > we can do that then fixing this becomes much simpler as we just
> > > have
> > > to fix the patch that introduced the issue.
> > 
> > We're also seeing a smaller memory leak (about 1 GB per day) than
> > the
> > original one even with the "Fix memory leak related filter
> > programming
> > status" fix applied. So far I've determined that the leak is
> > present on
> > 4.13.7 and was introduced between 4.11 and 4.12, so I'll do another
> > round of bisection to identify the patch that introduced this.
> > 
> > Since the router must run for a couple of hours before I can be
> > sure
> > whether a kernel is good or bad, and I can't reboot it during
> > working
> > hours, it'll probably be about a week before I have a result.
> > 
> > --
> > Venlig hilsen / Best Regards
> > 
> > Anders K. Pedersen
> > Senior Technical Manager
> 
> Anders,
> 
> I'll do some digging on my side to see if I can find any other memory
> leaks that might be floating around in the driver that could have
> been
> introduced during that time-frame.
> 
> One thing you might try that would help with your testing would be to
> just disable the ATR functionality in i40e. You can do that with the
> ethtool command "ethtool --set-priv-flags <iface> flow-director-atr
> off". That should allow you to bisect this without needing to deal
> with the "programming status" patches since you won't be programming
> ATR filters which is what caused that leak.
> 
> Thanks for looking into this.
> 
> - Alex

Hi Alex,

I began bisecting, where I applied the known fix patches to the steps,
where they were applicable (i.e. without changing the flow-director-atr 
flag), but some of the steps had a high amount of packet drops, which
caused problems for our network, so I couldn't leave them running for
several hours, which is necessary to determine if the leak is present
or not. The part of the bisection I got through had the same outcome as
the last bisection, which led to "i40e: Fix support for flow
director programming status".

After that I experimented a bit with the flow-director-atr flag, and it
turns out that if I disable this flag on all the NICs, then the memory
leak is gone, so I suspected that the smaller memory leak was also
caused by "i40e: Fix support for flow director programming status".

I tried to revert this patch from 4.13 (with manual fixup for the trace
point that had been added later), but that brought back the packet
drops, so I couldn't let it run.

This morning I saw your "i40e: Add programming descriptors to
cleaned_count" patch, so I tried 4.13.9 with that patch and the
previous "i40e: Fix memory leak related filter programming status"
without turning off the flow-director-atr flag. So far this combination
is running stable without any memory leaks.

Thanks for fixing this.

Regards,
Anders


More information about the Intel-wired-lan mailing list