[Intel-wired-lan] Kernel 4.6.7-rt13: Intel Ethernet driver igb causes huge latencies in cyclictest

Thu Oct 13 16:18:39 UTC 2016

Hey Mathias-

On Thu, Oct 13, 2016 at 10:57:18AM +0000, Koehrer Mathias (ETAS/ESW5) wrote:
[..]

Interesting indeed!

> > Here are some places where I added traces:
> > In file igb_ptp.c:
> > void igb_ptp_rx_hang(struct igb_adapter *adapter) {
> >         struct e1000_hw *hw = &adapter->hw;
> >         unsigned long rx_event;
> >         u32 tsyncrxctl;
> >         trace_igb(700);
> >         tsyncrxctl = rd32(E1000_TSYNCRXCTL);
> >         trace_igb(701);
> >
> >         /* Other hardware uses per-packet timestamps */
> >         if (hw->mac.type != e1000_82576)
> >                 return;
> > ...
> >
> > In file igb_main.c:
> > static void igb_check_lvmmc(struct igb_adapter *adapter) {
> >         struct e1000_hw *hw = &adapter->hw;
> >         u32 lvmmc;
> >
> >         trace_igb(600);
> >         lvmmc = rd32(E1000_LVMMC);
> >         trace_igb(601);
> >         if (lvmmc) {
> > ...
> >
[..]
> > The time between my trace points 700 and 701 is about 30us, the time between my
> > trace points 600 and 601 is even 37us!!
> >
> > The code in between is
> >      tsyncrxctl = rd32(E1000_TSYNCRXCTL); resp.
> >      lvmmc = rd32(E1000_LVMMC);
> >
> > In both cases this is a single read from a register.
> >
> > I have no idea why this single read could take that much time!

Are these the only registers you see this amount of delay when reading?

It's also possible that it's not these registers themselves that cause
problems, but any writes prior to these reads.  That is, given to PCI's
posted write behavior, it could be that these reads are delayed only
because it's flushing previously writes to the device.

> > Is it possible that the igb hardware is in a state that delays the read access and this is
> > why the whole I/O system might be delayed?

One additional hypothesis is that some register accesses trigger
accesses to off-chip resources synchronously; for example, a write to
enable timestamping needs to access an external phy on a slower bus,
etc.  I don't know enough about this device to say whether or not that
happens or not.

> To have a proper comparison, I did the same with kernel 3.18.27-rt27.
> Also here, I instrumented the igb driver to get traces for the rd32 calls.
> However, here everything is generally much faster!

> In the idle system the maximum I got for a read was about 6us, most times it was 1-2us.
> On the 4.8 kernel this is always much slower (see above).
> My question is now: Is there any kernel config option that has been introduced in the meantime
> that may lead to this effect and which is not set in my 4.8 config?

Have you tested on a vanilla (non-RT) kernel?  I doubt there is anything
RT specific about what you are seeing, but it might be nice to get
confirmation.  Also, bisection would probably be easier if you confirm
on a vanilla kernel.

I find it unlikely that it's a kernel config option that changed which
regressed you, but instead was a code change to a driver.  Which driver
is now the question, and the surface area is still big (processor
mapping attributes for this region, PCI root complex configuration, PCI
brige configuration, igb driver itself, etc.).

Big enough that I'd recommend a bisection.  It looks like a bisection
between 3.18 and 4.8 would take you about 18 tries to narrow down,
assuming all goes well.

   Julia