[Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit

Sat Jul 25 18:52:49 UTC 2020

On Sat, Jul 25, 2020 at 7:30 PM Alexander Duyck
<alexander.duyck at gmail.com> wrote:
>
> On Fri, Jul 24, 2020 at 6:03 PM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> >
> > On Sat, Jul 25, 2020 at 2:45 AM Alexander Duyck
> > <alexander.duyck at gmail.com> wrote:
> > >
> > > On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> > > >
> > > > On Sat, Jul 25, 2020 at 12:49 AM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> > > > >
> > > > > On Sat, Jul 25, 2020 at 12:41 AM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 24, 2020 at 11:51 PM Alexander Duyck
> > > > > > <alexander.duyck at gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jul 24, 2020 at 2:14 PM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> > > >
> > > > [--8<--]
> > > >
> > > > > > > > > >
> > > > > > > > > > And:
> > > > > > > > > > lspci -t
> > > > > > > > > > -[0000:00]-+-00.0
> > > > > > > > > >            +-00.2
> > > > > > > > > >            +-01.0
> > > > > > > > > >            +-01.2-[01-07]----00.0-[02-07]--+-03.0-[03]----00.0
> > > > > > > > >
> > > > > > > > > I think I now know what patch broke things for you. It is most likely
> > > > > > > > > this one that enabled ASPM on devices behind bridges:
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=next&id=66ff14e59e8a30690755b08bc3042359703fb07a
> > > > > > > >
> > > > > > > > Ah, yes, correct
> > > > > > > >
> > > > > > > > > My advice would be to revert that patch and see if it resolves the
> > > > > > > > > issue for you.
> > > > > > > >
> > > > > > > > Could do that yes =)
> > > > > > > >
> > > > > > > > I'm mainly looking for a more generic solution...
> > > > > > >
> > > > > > > That would be the generic solution. The patch has obviously broken
> > > > > > > things so we need to address the issues. The immediate solution is to
> > > > > > > revert it, but the more correct solution may be to do something like
> > > > > > > add an allowlist for the cases where enabling ASPM will not harm
> > > > > > > system performance.
> > > > > >
> > > > > > more like a generic solution like the one you mention below where we
> > > > > > get the best of both worlds... =)
> > > > > >
> > > > > > > > > Device 3:00.0 is your i211 gigabit network controller. Notice you have
> > > > > > > > > a bridge between it and the root complex. This can be problematic as I
> > > > > > > > > believe the main reason for the code that was removed in the patch is
> > > > > > > > > that wakeups can end up being serialized if all of the links are down
> > > > > > > > > or you could end up with one of the other devices on the bridge
> > > > > > > > > utilizing the PCIe link an reducing the total throughput, especially
> > > > > > > > > if you have the link to the root complex also taking part in power
> > > > > > > > > management. Starting at the root complex it looks like you have the
> > > > > > > > > link between the bridge and the PCIe switch. It is running L1 enabled
> > > > > > > > > with a 32us time for it to reestablish link according to the root
> > > > > > > > > complex side (00:01.2->1:00.0). The next link is the switch to the
> > > > > > > > > i211 which is 2:03.0 -> 3:00.0. The interesting bit here is that the
> > > > > > > > > bridge says it only needs 32us while the NIC is saying it will need
> > > > > > > > > 64us. That upper bound is already a pretty significant value, however
> > > > > > > > > you have two links to contend with so in reality you are looking at
> > > > > > > > > something like 96us to bring up both links if they are brought up
> > > > > > > > > serially.
> > > > > > > >
> > > > > > > > hummm... Interesting... I have never managed to parse that lspci thing
> > > > > > > > properly...
> > > > > > >
> > > > > > > Actually I parsed it a bit incorrectly too.
> > > > > > >
> > > > > > > The i211 lists that it only supports up to 64us maximum delay in L1
> > > > > > > wakeup latency. The switch is advertising 32us delay to come out of L1
> > > > > > > on both the upstream and downstream ports. As such the link would be
> > > > > > > considered marginal with L1 enabled and so it should be disabled.
> > > > > > >
> > > > > > > > It is also interesting that the realtek card seems to be on the same link then?
> > > > > > > > With ASPM disabled, I wonder if that is due to the setpci command or
> > > > > > > > if it was disabled before..
> > > > > > > > (playing with setpci makes no difference but it might require a reboot.. )
> > > > > > >
> > > > > > > Are you using the same command you were using for the i211? Did you
> > > > > > > make sure to update the offset since the PCIe configuration block
> > > > > > > starts at a different offset? Also you probably need to make sure to
> > > > > > > only try to update function 0 of the device since I suspect the other
> > > > > > > functions won't have any effect.
> > > > > >
> > > > > > Ah, no, i only toggled the i211 to see if that's what caused the ASPM
> > > > > > to be disabled...
> > > > > >
> > > > > > But it seems it isn't -- will have to reboot to verify though
> > > > > >
> > > > > > > > > When you consider that you are using a Gigabit Ethernet connection
> > > > > > > > > that is moving data at roughly 1000 bits per microsecond, or 125 bytes
> > > > > > > > > per microsecond. At that rate we should have roughly 270us worth of
> > > > > > > > > packets we can buffer before we are forced to start dropping packets
> > > > > > > > > assuming the device is configured with a default 34K Rx buffer. As
> > > > > > > > > such I am not entirely sure ASPM is the only issue we have here. I
> > > > > > > > > assume you may also have CPU C states enabled as well? By any chance
> > > > > > > > > do you have C6 or deeper sleep states enabled on the system? If so
> > > > > > > > > that might be what is pushing us into the issues that you were seeing.
> > > > > > > > > Basically we are seeing something that is causing the PCIe to stall
> > > > > > > > > for over 270us. My thought is that it is likely a number of factors
> > > > > > > > > where we have too many devices sleeping and as a result the total
> > > > > > > > > wakeup latency is likely 300us or more resulting in dropped packets.
> > > > > > > >
> > > > > > > > It seems like I only have C2 as max...
> > > > > > > >
> > > > > > > > grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
> > > > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
> > > > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
> > > > > > > > /sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2
> > > > > > > >
> > > > > > > > Anyway, we should bring this back to the mailing list
> > > > > > >
> > > > > > > That's fine. I assumed you didn't want to post the lspci to the
> > > > > > > mailing list as it might bounce for being too large.
> > > > > >
> > > > > > Good thinking, but it was actually a slip :/
> > > > > >
> > > > > > > So a generic solution for this would be to have a function that would
> > > > > > > scan the PCIe bus and determine the total L1 and L0s exit latency. If
> > > > > > > a device advertises an acceptable ASPM power state exit latency and we
> > > > > > > have met or exceeded that we should be disabling that ASPM feature for
> > > > > > > the device.
> > > > > >
> > > > > > Yeah, since I'm on vacation I'll actually see if I can look in to that!
> > > > > > (I mean, I'm not that used to these kinds of things but if my messing
> > > > > > around inspires someone
> > > > > > or if noone else is working on it, then... what the hey ;) )
> > > > >
> > > > > Uhm... so, in the function that determines latency they only do MAX
> > > > >
> > > > > Ie:
> > > > > static void pcie_aspm_check_latency(struct pci_dev *endpoint)
> > > > > {
> > > > > ...
> > > > >                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> > > > > ---
> > > > >
> > > > > I just want to see if I'm understanding you right, is it correct that
> > > > > the latency should be:
> > > > > a.up + b.dw + b.up + c.dw
> > > > >
> > > > > for a (root?) to go through b (bridge/switch?) to c (device)
> > >
> > > Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
> > > what you want is the maximum time to bring the link up so technically
> > > you only have 2 links so you just have to add up the maximum time to
> > > create each link.
> >
> > Ah so it's not cumulative per link, only max value on one, got it!
> >
> > > > Also, we only disabled L0, which isn't counted as a total at all, it
> > > > only checks each side.
> > >
> > > Not sure what you mean here. L0 is the link fully powered on. The two
> > > link states we have to worry about are L0s and L1 as those involve
> > > various states of power-down. The L1 latency is the nasty one as that
> > > basically involves fully powering down the link and requires time for
> > > the link to be reestablished.
> >
> > we basically did the &= ~ASPM_STATE_L0S - is the S indicative of something?
>
> So the command I gave you was basically clearing both the L1 and L0S
> states. It disabled ASPM entirely. However it looks like only L1 is
> supported on your platform.

Ah ok, perhaps i missed that somewhere, it looked like L0s was off by
L1 was still on

> > > > Since pcie is serial and judging from your previous statements I
> > > > assume that the max statement is a bug.
> > > > I also assume that l0 counts, and should be dealt with the same way
> > > > and it should also be cumulative...
> > >
> > > That latency check looks like it would be for a single link. Not each
> > > link in the chain.
> >
> > Yes, it checks each link in the chain, which is incorrect, it's actually the
> > cumulative latency that is important... Well... according to what I have
> > been able to gather from various documents anyway ;)
>
> Right. We would need to determine the latency of the entire chain. So
> that would effectively be the max for any one link plus 1us for every
> switch it has to pass through.

Which the patch i did actually does, =)

Gonna test it soon to see if it handles my case

> > > > The question becomes, is this latency from root? or is it "one step"?
> > >
> > > Actually the function is doing both. I had to reread the spec.
> > > Basically the switch is supposed to start trying to bring up the other
> > > links as soon as it detects that we are trying to bring up the link.
> > > In theory this is only supposed to add about 1us. So in theory the
> > > total bring-up time should be 33us.
> >
> > Ah ok, thanks, that answers another question in the chain ;)
> >
> > > > Also they add one microsecond but that looks like it should be
> > > > parent.l0.up + link.l0.dw latency values
> > >
> > > Yes, the 1us is the value I reference above. Basically the assumption
> > > is that as soon as one link starts retraining it should start working
> > > on the other ones so the serialization penalty is only supposed to be
> > > 1us.
> >
> > AH!
> >
> > > > So are my assumptions correct that the serial nature means that all
> > > > latenies stack?
> > > >
> > > > l0 is done first, so latency is actually l0 + l1 as max latency? (per side)
> > >
> > > I don't think the L0s latency needs to be added if that is what you
> > > are asking. Basically you either go from L0s->L0 or L1->L0. There is
> > > no jumping between L0s and L1.
> >
> > Ok!
> >
> > > Something still isn't adding up with all this as the latency shouldn't
> > > be enough to trigger buffer overruns. I wonder if we don't have
> > > something that is misreporting the actual L1 wakeup latency. One thing
> > > that I notice is that the link between the root complex and the PCIe
> > > switch seems to have some sort of electrical issue. If you look at the
> > > section from the upstream side of the switch:
> > > LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit Latency L1 <32us
> > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> > > LnkCtl: ASPM L1 Enabled; Disabled- CommClk+
> > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > LnkSta: Speed 16GT/s (ok), Width x4 (downgraded)
> > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > >
> > > One thing that catches my eye is that it is only linked at x4 when
> > > both sides are listing themselves as x8. Do you know if this has ever
> > > linked at x8 or has it always been x4? With this being a Gen 4 x8
> > > connection it would normally take a little while to establish a link,
> > > but with it having to fail down to a x4 that would add extra time and
> > > likely push it out of the expected exit latency. Also I notice that
> > > there are mentions of lane errors in the config, however I suspect
> > > those are Gen 4 features that I am not familiar with so I don't know
> > > if those are normal. It might be interesting to reboot and see if the
> > > link goes back to a x8 and if the lane errors clear at some point. If
> > > it does then we might want to disable ASPM on the upstream port of the
> > > switch since I have seen ASPM cause link downgrades and that may be
> > > what is occurring here.
> >
> > Humm... And that would mean disabling ASPM completely to test?
>
> Maybe. Right after boot there is a good likelihood that the link will
> be the most healthy, so it is likely to still be x8 if that is the
> true width of the link. If we are seeing it degrade over time that
> would be a sign that maybe we should disable L1 on the link between
> the switch and the root complex instead of messing with the NICs.

I doubt it, I suspect that the chip can do more but it's just not there...

Just like f.ex:
LnkSta: Speed 8GT/s (ok), Width x8 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

PCIEv3 x16 card in a x8 slot