[Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver

Winkler, Tomas tomas.winkler at intel.com
Tue Sep 3 09:28:30 UTC 2019



> On Tue, Sep 03, 2019 at 10:35:30AM +0200, Paul Menzel wrote:
> > Dear Gavin,
> >
> >
> > Thank you for following up on this.
> >
> > On 03.09.19 09:56, Gavin Lambert wrote:
> > > On 2019-08-20 14:15, I wrote:
> > > > Does anyone have any ideas about this?  Either towards further
> > > > investigation or to a possible resolution?
> > > >
> > > > This is at the point of hardware internals now, so I have no idea
> > > > how to proceed in either area.
> > >
> > > To recap (plus some new info):
> > >
> > > 1. I am using a kernel module which uses the code from the e1000e
> > > driver to communicate with the hardware without actually registering
> > > it as a Linux netdev.  (This is partly because it can get used in a
> > > Xenomai context outside of Linux itself, although I'm not doing that
> > > myself.) This historically works fine.
> > >
> > > 2. On certain Linux versions, I encountered an issue where
> > > disconnecting the network cable and reconnecting it almost always
> > > results in not being able to send any packets.  (I cannot determine
> > > if receiving packets works in this case, as the network design will
> > > not receive packets unless some are sent first.)  Restarting the
> > > driver (rmmod+modprobe) does recover from this case (until the next
> > > link loss), but simply replugging the cable never does.
> > >
> > > 3. The problem was observed with both I219-V and I219-LM (on
> > > motherboard), but was *not* observed with 82571EB (PCIE).  The
> > > problem was not observed with a motherboard igb-based I211.  I
> > > suspect the issue is limited to motherboard-based e1000e adapters.
> > > (Or perhaps there's something different about how the IGBs are
> > > internally connected.)
> > >
> > > 4. The problem does not occur when the e1000e driver is registered
> > > "normally" as a Linux netdev.
> > >
> > > 5. The problem was introduced by "mei: me: allow runtime pm for
> > > platform with D0i3" (which has been backported to 4.4+, as far as I can
> tell).
> > > Excluding this commit reliably resolves the issue and including it
> > > reliably breaks it.
> >
> > The commit hash in the master branch is
> > cc365dcf0e56271bedf3de95f88922abe248e951 and is there since v4.16-rc1.
> >
> > Strange, that it is in 4.4 and 4.9, as it was only tagged for v4.13+.
> >
> > > 6. Applying the previously suggested patch
> > > https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.
> > > git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b6
> > > 2a05712f35a7fa5f5e56 has no effect; the E1000_STATUS_PCIM_STATE bit
> > > is not set when the issue occurs.
> > >
> > > 7. Given the content of the change in #5, I assumed that the problem
> > > was power-management related, perhaps a side effect of the e1000e
> > > driver not being registered as a netdev.  (So perhaps something
> > > thinks that no devices are in use and turns something off?)
> > >
> > > 8. I've previously posted register dumps from an e1000e in both the
> > > "normal" and "link up but not transmitting" states.  They seemed
> > > very similar, but as I'm not familiar with the register meanings I
> > > may have overlooked something significant.  (Note that the dumps
> > > were captured inside the watchdog task, when it detects link up but
> > > before it sets
> > > E1000_TCTL_EN.)
> > >
> > > 9. I enabled debug logging in the mei driver; it logs a couple of
> > > runtime_idles and then a runtime_suspend during system startup.  (I
> > > added a log to runtime_resume that is missing in the driver source,
> > > but it appears this does not get called in my scenario.)  Note that
> > > the e1000e driver is still working ok after this.. at least at first.
> > >
> > > 10. "cat /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
> > > => "suspended"
> > >      "cat
> > >
> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
> > > => "unsupported"
> > >      "cat /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
> > > => "active"
> > >      "cat /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
> > > => "active" (this is the actual NIC)
> > >      These don't change between the working and non-working states.
> > > (It's possible that some other device does, but I haven't found it
> > > yet.)
> > >
> > > 11. I did try forcing the above to unsuspend, but this did not
> > > recover from the e1000e issue.
> > >
> > > 12. I also tried calling e1000e_reset on link-down.  This produces
> > > different register output on link-up, but doesn't recover from the
> > > issue.
> > >
> > > 13. I also tried recompiling the kernel with CONFIG_PM disabled (no
> > > power management).  This *does* resolve the problem (but is a very
> > > big hammer).
> > >
> > > 14. Possibly also of interest is that if I do *both* #12 and #13,
> > > the problem remains (suggesting #12 was counter-productive).
> > >
> > > FYI the hardware on one of the test machines is as follows:
> > >      00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
> > >      00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller
> > > (x16) (rev 05)
> > >      00:02.0 VGA compatible controller: Intel Corporation Device
> > > 5912 (rev 04)
> > >      00:08.0 System peripheral: Intel Corporation Skylake Gaussian
> > > Mixture Model
> > >      00:14.0 USB controller: Intel Corporation Sunrise Point-H USB
> > > 3.0  xHCI Controller (rev 31)
> > >      00:14.2 Signal processing controller: Intel Corporation Sunrise
> > > Point-H Thermal subsystem (rev 31)
> > >      00:15.0 Signal processing controller: Intel Corporation Sunrise
> > > Point-H Serial IO I2C Controller #0 (rev 31)
> > >      00:15.1 Signal processing controller: Intel Corporation Sunrise
> > > Point-H Serial IO I2C Controller #1 (rev 31)
> > >      00:16.0 Communication controller: Intel Corporation Sunrise
> > > Point-H CSME HECI #1 (rev 31)
> > >      00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA
> > > controller [AHCI mode] (rev 31)
> > >      00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
> > > Port #19 (rev f1)
> > >      00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
> > > Port #20 (rev f1)
> > >      00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
> > > Express Root Port #5 (rev f1)
> > >      00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
> > > Express Root Port #11 (rev f1)
> > >      00:1e.0 Signal processing controller: Intel Corporation Sunrise
> > > Point-H Serial IO UART #0 (rev 31)
> > >      00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC
> > > Controller (rev 31)
> > >      00:1f.2 Memory controller: Intel Corporation Sunrise Point-H
> > > PMC (rev 31)
> > >      00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
> > >      00:1f.6 Ethernet controller: Intel Corporation Ethernet
> > > Connection (2) I219-LM (rev 31)
> > >      02:00.0 Ethernet controller: Intel Corporation I211 Gigabit
> > > Network Connection (rev 03)
> > >      03:00.0 Ethernet controller: Intel Corporation I211 Gigabit
> > > Network Connection (rev 03)
> > >      05:00.0 Ethernet controller: Intel Corporation I211 Gigabit
> > > Network Connection (rev 03)
> > >
> > > I'm happy to add any code instrumentation or make any other changes
> > > needed to locate and resolve the problem, and I can readily
> > > reproduce it
> > > -- I'm just at a complete loss as to where to start looking, and am
> > > still hoping for some suggestions in that regard.
> > >
> > > If there's anywhere (or anyone) else better for me to talk to about
> > > this issue, please let me know that too.
> >
> > It is not clear to me, if this is still reproducible on Linux 5.3-rc7
> > (or Linus’ master branch).
> >
> > If it is, this is a definitely regression, and the commits need to be
> > reverted due to Linux’ no regression policy.
> 
> So I should revert this from 4.4.y and 4.9.y?

The issue is not in mei driver,  it is in e1000 driver, I my best knowledge there should be fix, please Vitaly can it be backported to older kernels?
Thanks
Tomas




More information about the Intel-wired-lan mailing list