[Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit

Alexander Duyck alexander.duyck at gmail.com
Sat Jul 25 00:45:27 UTC 2020


On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien at gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 12:49 AM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> >
> > On Sat, Jul 25, 2020 at 12:41 AM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> > >
> > > On Fri, Jul 24, 2020 at 11:51 PM Alexander Duyck
> > > <alexander.duyck at gmail.com> wrote:
> > > >
> > > > On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien at gmail.com> wrote:
>
> [--8<--]
>
> > > > > > >
> > > > > > > And:
> > > > > > > lspci -t
> > > > > > > -[0000:00]-+-00.0
> > > > > > >            +-00.2
> > > > > > >            +-01.0
> > > > > > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> > > > > >
> > > > > > I think I now know what patch broke things for you. It is most likely
> > > > > > this one that enabled ASPM on devices behind bridges:
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
> > > > >
> > > > > Ah, yes, correct
> > > > >
> > > > > > My advice would be to revert that patch and see if it resolves the
> > > > > > issue for you.
> > > > >
> > > > > Could do that yes =)
> > > > >
> > > > > I'm mainly looking for a more generic solution...
> > > >
> > > > That would be the generic solution. The patch has obviously broken
> > > > things so we need to address the issues. The immediate solution is to
> > > > revert it, but the more correct solution may be to do something like
> > > > add an allowlist for the cases where enabling ASPM will not harm
> > > > system performance.
> > >
> > > more like a generic solution like the one you mention below where we
> > > get the best of both worlds... =)
> > >
> > > > > > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > > > > > a bridge between it and the root complex. This can be problematic as I
> > > > > > believe the main reason for the code that was removed in the patch is
> > > > > > that wakeups can end up being serialized if all of the links are down
> > > > > > or you could end up with one of the other devices on the bridge
> > > > > > utilizing the PCIe link an reducing the total throughput, especially
> > > > > > if you have the link to the root complex also taking part in power
> > > > > > management. Starting at the root complex it looks like you have the
> > > > > > link between the bridge and the PCIe switch. It is running L1 enabled
> > > > > > with a 32us time for it to reestablish link according to the root
> > > > > > complex side (00:01.2->1:00.0). The next link is the switch to the
> > > > > > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > > > > > bridge says it only needs 32us while the NIC is saying it will need
> > > > > > 64us. That upper bound is already a pretty significant value, however
> > > > > > you have two links to contend with so in reality you are looking at
> > > > > > something like 96us to bring up both links if they are brought up
> > > > > > serially.
> > > > >
> > > > > hummm... Interesting... I have never managed to parse that lspci thing
> > > > > properly...
> > > >
> > > > Actually I parsed it a bit incorrectly too.
> > > >
> > > > The i211 lists that it only supports up to 64us maximum delay in L1
> > > > wakeup latency. The switch is advertising 32us delay to come out of L1
> > > > on both the upstream and downstream ports. As such the link would be
> > > > considered marginal with L1 enabled and so it should be disabled.
> > > >
> > > > > It is also interesting that the realtek card seems to be on the same link then?
> > > > > With ASPM disabled, I wonder if that is due to the setpci command or
> > > > > if it was disabled before..
> > > > > (playing with setpci makes no difference but it might require a reboot.. )
> > > >
> > > > Are you using the same command you were using for the i211? Did you
> > > > make sure to update the offset since the PCIe configuration block
> > > > starts at a different offset? Also you probably need to make sure to
> > > > only try to update function 0 of the device since I suspect the other
> > > > functions won't have any effect.
> > >
> > > Ah, no, i only toggled the i211 to see if that's what caused the ASPM
> > > to be disabled...
> > >
> > > But it seems it isn't -- will have to reboot to verify though
> > >
> > > > > > When you consider that you are using a Gigabit Ethernet connection
> > > > > > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > > > > > per microsecond. At that rate we should have roughly 270us worth of
> > > > > > packets we can buffer before we are forced to start dropping packets
> > > > > > assuming the device is configured with a default 34K Rx buffer. As
> > > > > > such I am not entirely sure ASPM is the only issue we have here. I
> > > > > > assume you may also have CPU C states enabled as well? By any chance
> > > > > > do you have C6 or deeper sleep states enabled on the system? If so
> > > > > > that might be what is pushing us into the issues that you were seeing.
> > > > > > Basically we are seeing something that is causing the PCIe to stall
> > > > > > for over 270us. My thought is that it is likely a number of factors
> > > > > > where we have too many devices sleeping and as a result the total
> > > > > > wakeup latency is likely 300us or more resulting in dropped packets.
> > > > >
> > > > > It seems like I only have C2 as max...
> > > > >
> > > > > grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> > > > > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> > > > > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> > > > > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> > > > >
> > > > > Anyway, we should bring this back to the mailing list
> > > >
> > > > That's fine. I assumed you didn't want to post the lspci to the
> > > > mailing list as it might bounce for being too large.
> > >
> > > Good thinking, but it was actually a slip :/
> > >
> > > > So a generic solution for this would be to have a function that would
> > > > scan the PCIe bus and determine the total L1 and L0s exit latency. If
> > > > a device advertises an acceptable ASPM power state exit latency and we
> > > > have met or exceeded that we should be disabling that ASPM feature for
> > > > the device.
> > >
> > > Yeah, since I'm on vacation I'll actually see if I can look in to that!
> > > (I mean, I'm not that used to these kinds of things but if my messing
> > > around inspires someone
> > > or if noone else is working on it, then... what the hey ;) )
> >
> > Uhm... so, in the function that determines latency they only do MAX
> >
> > Ie:
> > static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > {
> > ...
> >                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > ---
> >
> > I just want to see if I'm understanding you right, is it correct that
> > the latency should be:
> > a.up + b.dw + b.up + c.dw
> >
> > for a (root?) to go through b (bridge/switch?) to c (device)

Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
what you want is the maximum time to bring the link up so technically
you only have 2 links so you just have to add up the maximum time to
create each link.

> Also, we only disabled L0, which isn't counted as a total at all, it
> only checks each side.

Not sure what you mean here. L0 is the link fully powered on. The two
link states we have to worry about are L0s and L1 as those involve
various states of power-down. The L1 latency is the nasty one as that
basically involves fully powering down the link and requires time for
the link to be reestablished.

> Since pcie is serial and judging from your previous statements I
> assume that the max statement is a bug.
> I also assume that l0 counts, and should be dealt with the same way
> and it should also be cumulative...

That latency check looks like it would be for a single link. Not each
link in the chain.

> The question becomes, is this latency from root? or is it "one step"?

Actually the function is doing both. I had to reread the spec.
Basically the switch is supposed to start trying to bring up the other
links as soon as it detects that we are trying to bring up the link.
In theory this is only supposed to add about 1us. So in theory the
total bring-up time should be 33us.

> Also they add one microsecond but that looks like it should be
> parent.l0.up + link.l0.dw latency values

Yes, the 1us is the value I reference above. Basically the assumption
is that as soon as one link starts retraining it should start working
on the other ones so the serialization penalty is only supposed to be
1us.

> So are my assumptions correct that the serial nature means that all
> latenies stack?
>
> l0 is done first, so latency is actually l0 + l1 as max latency? (per side)

I don't think the L0s latency needs to be added if that is what you
are asking. Basically you either go from L0s->L0 or L1->L0. There is
no jumping between L0s and L1.

Something still isn't adding up with all this as the latency shouldn't
be enough to trigger buffer overruns. I wonder if we don't have
something that is misreporting the actual L1 wakeup latency. One thing
that I notice is that the link between the root complex and the PCIe
switch seems to have some sort of electrical issue. If you look at the
section from the upstream side of the switch:
LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

One thing that catches my eye is that it is only linked at x4 when
both sides are listing themselves as x8. Do you know if this has ever
linked at x8 or has it always been x4? With this being a Gen 4 x8
connection it would normally take a little while to establish a link,
but with it having to fail down to a x4 that would add extra time and
likely push it out of the expected exit latency. Also I notice that
there are mentions of lane errors in the config, however I suspect
those are Gen 4 features that I am not familiar with so I don't know
if those are normal. It might be interesting to reboot and see if the
link goes back to a x8 and if the lane errors clear at some point. If
it does then we might want to disable ASPM on the upstream port of the
switch since I have seen ASPM cause link downgrades and that may be
what is occurring here.


More information about the Intel-wired-lan mailing list