[Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit

Fri Jul 24 22:41:15 UTC 2020

On Fri, Jul 24, 2020 at 11:51 PM Alexander Duyck
<alexander.duyck at gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> >
> > On Fri, Jul 24, 2020 at 10:45 PM Alexander Duyck
> > <alexander.duyck at gmail.com> wrote:
> > >
> > > On Fri, Jul 24, 2020 at 12:23 PM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> > > >
> > > > On Fri, Jul 24, 2020 at 4:57 PM Alexander Duyck
> > > > <alexander.duyck at gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 24, 2020 at 5:33 AM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 24, 2020 at 2:01 PM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jul 17, 2020 at 3:45 PM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> > > > > >
> > > > > > [--8<--]
> > > > > >
> > > > > > > As a side note, would something like this fix it - not even compile tested
> > > > > > >
> > > > > > >
> > > > > > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > b/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > index 8bb3db2cbd41..1a7240aae85c 100644
> > > > > > > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > > > > > > @@ -3396,6 +3396,13 @@ static int igb_probe(struct pci_dev *pdev,
> > > > > > > const struct pci_device_id *ent)
> > > > > > >                           "Width x2" :
> > > > > > >                           (hw->bus.width == e1000_bus_width_pcie_x1) ?
> > > > > > >                           "Width x1" : "unknown"), netdev->dev_addr);
> > > > > > > +               /* quirk */
> > > > > > > +#ifdef CONFIG_PCIEASPM
> > > > > > > +               if (hw->bus.width == e1000_bus_width_pcie_x1) {
> > > > > > > +                       /* single lane pcie causes problems with ASPM */
> > > > > > > +                       pdev->pcie_link_state->aspm_enabled = 0;
> > > > > > > +               }
> > > > > > > +#endif
> > > > > > >         }
> > > > > > >
> > > > > > >         if ((hw->mac.type >= e1000_i210 ||
> > > > > > >
> > > > > > > I don't know where the right place to put a quirk would be...
> > > > > >
> > > > > > Ok so that was a real brainfart... turns out that there is a lack of
> > > > > > good ways to get to that but it was more intended to
> > > > > > know where the quirk should go...
> > > > > >
> > > > > > Due to the lack of api:s i started wondering if this will apply to
> > > > > > more devices than just network cards - potentially we could
> > > > > > be a little bit more selective and only not enable it in one direction but...
> > > > > >
> > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > > > > > index b17e5ffd31b1..96a3c6837124 100644
> > > > > > --- a/drivers/pci/pcie/aspm.c
> > > > > > +++ b/drivers/pci/pcie/aspm.c
> > > > > > @@ -584,15 +584,16 @@ static void pcie_aspm_cap_init(struct
> > > > > > pcie_link_state *link, int blacklist)
> > > > > >          * given link unless components on both sides of the link each
> > > > > >          * support L0s.
> > > > > >          */
> > > > > > -       if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> > > > > > -               link->aspm_support |= ASPM_STATE_L0S;
> > > > > > -       if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > -               link->aspm_enabled |= ASPM_STATE_L0S_UP;
> > > > > > -       if (upreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > -               link->aspm_enabled |= ASPM_STATE_L0S_DW;
> > > > > > -       link->latency_up.l0s = calc_l0s_latency(upreg.latency_encoding_l0s);
> > > > > > -       link->latency_dw.l0s = calc_l0s_latency(dwreg.latency_encoding_l0s);
> > > > > > -
> > > > > > +       if (pcie_get_width_cap(child) != PCIE_LNK_X1) {
> > > > > > +               if (dwreg.support & upreg.support & PCIE_LINK_STATE_L0S)
> > > > > > +                       link->aspm_support |= ASPM_STATE_L0S;
> > > > > > +               if (dwreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > +                       link->aspm_enabled |= ASPM_STATE_L0S_UP;
> > > > > > +               if (upreg.enabled & PCIE_LINK_STATE_L0S)
> > > > > > +                       link->aspm_enabled |= ASPM_STATE_L0S_DW;
> > > > > > +               link->latency_up.l0s =
> > > > > > calc_l0s_latency(upreg.latency_encoding_l0s);
> > > > > > +               link->latency_dw.l0s =
> > > > > > calc_l0s_latency(dwreg.latency_encoding_l0s);
> > > > > > +       }
> > > > > >
> > > > > > this time it's compile tested...
> > > > > >
> > > > > > It could also be  if (pcie_get_width_cap(child) > PCIE_LNK_X1) {
> > > > > >
> > > > > > I assume that ASPM is not enabled for: PCIE_LNK_WIDTH_RESRV ;)
> > > > >
> > > > > This is probably a bit too broad of a scope to be used generically
> > > > > since this will disable ASPM for all devices that have a x1 link
> > > > > width.
> > > >
> > > > I agree, but also, the change to enable aspm on the controllers was
> > > > quite recent so it could be a general
> > > > issue that a lot of people could be suffering from... I haven't seen
> > > > any reports though...
> > >
> > > The problem is your layout is very likely specific. It may effect
> > > others with a similar layout, but for example I have the same
> > > controller in one of my systems and I have not been having any issues.
> > >
> > > > But otoh worst case would be a minor revert in power usage ;)
> > > >
> > > > > It might make more sense to look at something such as
> > > > > e1000e_disable_aspm as an example of how to approach this.
> > > >
> > > > Oh, my grepping completely failed to dig up this, thanks!
> > >
> > > https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/e1000e/netdev.c#L6743
> > >
> > > > > As far as what triggers it we would need to get more details about the
> > > > > setup. I'd be curious if we have an "lspci -vvv" for the system
> > > > > available. The assumption is that the ASPM exit latency is high on
> > > > > this system and that in turn is causing the bandwidth issues as you
> > > > > start entering L1. If I am not mistaken the device should advertise
> > > > > about 16us for the exit latency. I'd be curious if we have a device
> > > > > somewhere between the NIC and the root port that might be increasing
> > > > > the delay in exiting L1, and then if we could identify that we could
> > > > > add a PCIe quirk for that.
> > > >
> > > > We only disabled the L0s afair, but from e1000e_disable_aspm - "can't
> > > > have L1 without L0s"
> > > > so perhaps they are disabled as well...
> > >
> > > Not sure where you got that from. It looks like with your system the
> > > L0s is disabled and you only have support for L1.
> >
> > First of all, sorry, I accidentally dropped the mailinglist :(
> >
> > And the comment I quoted was from the e1000e_disable_aspm:
> >         switch (state) {
> >         case PCIE_LINK_STATE_L0S:
> >         case PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1:
> >                 aspm_dis_mask |= PCI_EXP_LNKCTL_ASPM_L0S;
> >                 /* fall-through - can't have L1 without L0s */
> >        <====
> >         case PCIE_LINK_STATE_L1:
> >                 aspm_dis_mask |= PCI_EXP_LNKCTL_ASPM_L1;
> >                 break;
> >         default:
> >                 return;
> >         }
> > ----
> >
> > > >
> > > > And:
> > > > lspci -t
> > > > -[0000:00]-+-00.0
> > > >            +-00.2
> > > >            +-01.0
> > > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> > >
> > > I think I now know what patch broke things for you. It is most likely
> > > this one that enabled ASPM on devices behind bridges:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
> >
> > Ah, yes, correct
> >
> > > My advice would be to revert that patch and see if it resolves the
> > > issue for you.
> >
> > Could do that yes =)
> >
> > I'm mainly looking for a more generic solution...
>
> That would be the generic solution. The patch has obviously broken
> things so we need to address the issues. The immediate solution is to
> revert it, but the more correct solution may be to do something like
> add an allowlist for the cases where enabling ASPM will not harm
> system performance.

more like a generic solution like the one you mention below where we
get the best of both worlds... =)

> > > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > > a bridge between it and the root complex. This can be problematic as I
> > > believe the main reason for the code that was removed in the patch is
> > > that wakeups can end up being serialized if all of the links are down
> > > or you could end up with one of the other devices on the bridge
> > > utilizing the PCIe link an reducing the total throughput, especially
> > > if you have the link to the root complex also taking part in power
> > > management. Starting at the root complex it looks like you have the
> > > link between the bridge and the PCIe switch. It is running L1 enabled
> > > with a 32us time for it to reestablish link according to the root
> > > complex side (00:01.2->1:00.0). The next link is the switch to the
> > > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > > bridge says it only needs 32us while the NIC is saying it will need
> > > 64us. That upper bound is already a pretty significant value, however
> > > you have two links to contend with so in reality you are looking at
> > > something like 96us to bring up both links if they are brought up
> > > serially.
> >
> > hummm... Interesting... I have never managed to parse that lspci thing
> > properly...
>
> Actually I parsed it a bit incorrectly too.
>
> The i211 lists that it only supports up to 64us maximum delay in L1
> wakeup latency. The switch is advertising 32us delay to come out of L1
> on both the upstream and downstream ports. As such the link would be
> considered marginal with L1 enabled and so it should be disabled.
>
> > It is also interesting that the realtek card seems to be on the same link then?
> > With ASPM disabled, I wonder if that is due to the setpci command or
> > if it was disabled before..
> > (playing with setpci makes no difference but it might require a reboot.. )
>
> Are you using the same command you were using for the i211? Did you
> make sure to update the offset since the PCIe configuration block
> starts at a different offset? Also you probably need to make sure to
> only try to update function 0 of the device since I suspect the other
> functions won't have any effect.

Ah, no, i only toggled the i211 to see if that's what caused the ASPM
to be disabled...

But it seems it isn't -- will have to reboot to verify though

> > > When you consider that you are using a Gigabit Ethernet connection
> > > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > > per microsecond. At that rate we should have roughly 270us worth of
> > > packets we can buffer before we are forced to start dropping packets
> > > assuming the device is configured with a default 34K Rx buffer. As
> > > such I am not entirely sure ASPM is the only issue we have here. I
> > > assume you may also have CPU C states enabled as well? By any chance
> > > do you have C6 or deeper sleep states enabled on the system? If so
> > > that might be what is pushing us into the issues that you were seeing.
> > > Basically we are seeing something that is causing the PCIe to stall
> > > for over 270us. My thought is that it is likely a number of factors
> > > where we have too many devices sleeping and as a result the total
> > > wakeup latency is likely 300us or more resulting in dropped packets.
> >
> > It seems like I only have C2 as max...
> >
> > grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> >
> > Anyway, we should bring this back to the mailing list
>
> That's fine. I assumed you didn't want to post the lspci to the
> mailing list as it might bounce for being too large.

Good thinking, but it was actually a slip :/

> So a generic solution for this would be to have a function that would
> scan the PCIe bus and determine the total L1 and L0s exit latency. If
> a device advertises an acceptable ASPM power state exit latency and we
> have met or exceeded that we should be disabling that ASPM feature for
> the device.

Yeah, since I'm on vacation I'll actually see if I can look in to that!
(I mean, I'm not that used to these kinds of things but if my messing
around inspires someone
or if noone else is working on it, then... what the hey ;) )