[Intel-wired-lan] NAT performance issue 944mbit -> ~40mbit

Sat Jul 25 17:43:25 UTC 2020

On Sat, Jul 25, 2020 at 6:53 AM Ian Kumlien <ian.kumlien at gmail.com> wrote:
>
> On Sat, Jul 25, 2020 at 3:03 AM Ian Kumlien <ian.kumlien at gmail.com> wrote:
> > On Sat, Jul 25, 2020 at 2:45 AM Alexander Duyck
> > <alexander.duyck at gmail.com> wrote:
> > > On Fri, Jul 24, 2020 at 4:08 PM Ian Kumlien <ian.kumlien at gmail.com> wrote:
>
> [--8<--]
>
> > > Actually I think it is max(a.dw, b.up) + max(b.dw, a.up). Basically
> > > what you want is the maximum time to bring the link up so technically
> > > you only have 2 links so you just have to add up the maximum time to
> > > create each link.
> >
> > Ah so it's not cumulative per link, only max value on one, got it!
> >
> > > > Also, we only disabled L0, which isn't counted as a total at all, it
> > > > only checks each side.
> > >
> > > Not sure what you mean here. L0 is the link fully powered on. The two
> > > link states we have to worry about are L0s and L1 as those involve
> > > various states of power-down. The L1 latency is the nasty one as that
> > > basically involves fully powering down the link and requires time for
> > > the link to be reestablished.
> >
> > we basically did the &= ~ASPM_STATE_L0S - is the S indicative of something?
> >
> > > > Since pcie is serial and judging from your previous statements I
> > > > assume that the max statement is a bug.
> > > > I also assume that l0 counts, and should be dealt with the same way
> > > > and it should also be cumulative...
> > >
> > > That latency check looks like it would be for a single link. Not each
> > > link in the chain.
> >
> > Yes, it checks each link in the chain, which is incorrect, it's actually the
> > cumulative latency that is important... Well... according to what I have
> > been able to gather from various documents anyway ;)
> >
> > > > The question becomes, is this latency from root? or is it "one step"?
> > >
> > > Actually the function is doing both. I had to reread the spec.
> > > Basically the switch is supposed to start trying to bring up the other
> > > links as soon as it detects that we are trying to bring up the link.
> > > In theory this is only supposed to add about 1us. So in theory the
> > > total bring-up time should be 33us.
> >
> > Ah ok, thanks, that answers another question in the chain ;)
>
> So, then this is what should be done:
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index b17e5ffd31b1..2b8f7ea7f7bc 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -434,7 +434,7 @@ static void pcie_get_aspm_reg(struct pci_dev *pdev,
>
>  static void pcie_aspm_check_latency(struct pci_dev *endpoint)
>  {
> -       u32 latency, l1_switch_latency = 0;
> +       u32 latency, l1_max_latency = 0, l1_switch_latency = 0;
>         struct aspm_latency *acceptable;
>         struct pcie_link_state *link;
>
> @@ -470,8 +470,9 @@ static void pcie_aspm_check_latency(struct pci_dev
> *endpoint)
>                  * substate latencies (and hence do not do any check).
>                  */
>                 latency = max_t(u32, link->latency_up.l1, link->latency_dw.l1);
> +               l1_max_latency = max_t(u32, latency, l1_max_latency)
>                 if ((link->aspm_capable & ASPM_STATE_L1) &&
> -                   (latency + l1_switch_latency > acceptable->l1))
> +                   (l1_max_latency + l1_switch_latency > acceptable->l1))
>                         link->aspm_capable &= ~ASPM_STATE_L1;
>                 l1_switch_latency += 1000;
> ---

This makes sense to me. You might want to submit it to the linux-pci
mailing list.

> for l1 latency... I do however find it odd that you don't disable it
> on the endpoint but on the
> potential bridge/switch/root you're on - shouldn't the disable be on
> endpoint->bus->self->link_state?

I think the idea is that we want to leave the leaves of the root
complex with ASPM enabled and disable it as we approach the trunk as
it is more likely that we are going to see more frequent wakeups as we
approach the root complex, or at least that would be my theory anyway.
Basically the closer you get to the root complex the more likely you
are to have more devices making use of the path so the more likely it
is to have to stay on anyway.

> Anyway, this should handle any latency bumps... and could be done
> differently reusing latency and:
> latency = max_t(u32, latency, max_t(u32, link->latency_up.l1,
> link->latency_dw.l1));
>
> but kept it this way for legibility...
>
> But for L0 -- been looking at it and I wonder... from what I can see
> it's cumulative for the link, but L0S seems
> different and is perhaps not quite the same...

You have mentioned L0 several times now and I wonder what you are
referring to. L0 is the fully powered on state. That is the state we
are trying to get back to. L0s and L1 are the lower power states that
we have to get out of with L1 being a much more complicated state to
get out of as we shut down the clocks and link if I recall and have to
reestablish both before we can resume operation.