[Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver

Gavin Lambert intel at mirality.co.nz
Tue Sep 3 07:56:26 UTC 2019


On 2019-08-20 14:15, I wrote:
> Does anyone have any ideas about this?  Either towards further
> investigation or to a possible resolution?
> 
> This is at the point of hardware internals now, so I have no idea how
> to proceed in either area.

To recap (plus some new info):

1. I am using a kernel module which uses the code from the e1000e driver 
to communicate with the hardware without actually registering it as a 
Linux netdev.  (This is partly because it can get used in a Xenomai 
context outside of Linux itself, although I'm not doing that myself.)  
This historically works fine.

2. On certain Linux versions, I encountered an issue where disconnecting 
the network cable and reconnecting it almost always results in not being 
able to send any packets.  (I cannot determine if receiving packets 
works in this case, as the network design will not receive packets 
unless some are sent first.)  Restarting the driver (rmmod+modprobe) 
does recover from this case (until the next link loss), but simply 
replugging the cable never does.

3. The problem was observed with both I219-V and I219-LM (on 
motherboard), but was *not* observed with 82571EB (PCIE).  The problem 
was not observed with a motherboard igb-based I211.  I suspect the issue 
is limited to motherboard-based e1000e adapters.  (Or perhaps there's 
something different about how the IGBs are internally connected.)

4. The problem does not occur when the e1000e driver is registered 
"normally" as a Linux netdev.

5. The problem was introduced by "mei: me: allow runtime pm for platform 
with D0i3" (which has been backported to 4.4+, as far as I can tell).  
Excluding this commit reliably resolves the issue and including it 
reliably breaks it.

6. Applying the previously suggested patch 
https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b62a05712f35a7fa5f5e56 
has no effect; the E1000_STATUS_PCIM_STATE bit is not set when the issue 
occurs.

7. Given the content of the change in #5, I assumed that the problem was 
power-management related, perhaps a side effect of the e1000e driver not 
being registered as a netdev.  (So perhaps something thinks that no 
devices are in use and turns something off?)

8. I've previously posted register dumps from an e1000e in both the 
"normal" and "link up but not transmitting" states.  They seemed very 
similar, but as I'm not familiar with the register meanings I may have 
overlooked something significant.  (Note that the dumps were captured 
inside the watchdog task, when it detects link up but before it sets 
E1000_TCTL_EN.)

9. I enabled debug logging in the mei driver; it logs a couple of 
runtime_idles and then a runtime_suspend during system startup.  (I 
added a log to runtime_resume that is missing in the driver source, but 
it appears this does not get called in my scenario.)  Note that the 
e1000e driver is still working ok after this.. at least at first.

10. "cat /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status" 
=> "suspended"
     "cat 
/sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status" 
=> "unsupported"
     "cat /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status" 
=> "active"
     "cat /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status" 
=> "active" (this is the actual NIC)
     These don't change between the working and non-working states.  
(It's possible that some other device does, but I haven't found it yet.)

11. I did try forcing the above to unsuspend, but this did not recover 
from the e1000e issue.

12. I also tried calling e1000e_reset on link-down.  This produces 
different register output on link-up, but doesn't recover from the 
issue.

13. I also tried recompiling the kernel with CONFIG_PM disabled (no 
power management).  This *does* resolve the problem (but is a very big 
hammer).

14. Possibly also of interest is that if I do *both* #12 and #13, the 
problem remains (suggesting #12 was counter-productive).

FYI the hardware on one of the test machines is as follows:
     00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
     00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) 
(rev 05)
     00:02.0 VGA compatible controller: Intel Corporation Device 5912 
(rev 04)
     00:08.0 System peripheral: Intel Corporation Skylake Gaussian 
Mixture Model
     00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 
xHCI Controller (rev 31)
     00:14.2 Signal processing controller: Intel Corporation Sunrise 
Point-H Thermal subsystem (rev 31)
     00:15.0 Signal processing controller: Intel Corporation Sunrise 
Point-H Serial IO I2C Controller #0 (rev 31)
     00:15.1 Signal processing controller: Intel Corporation Sunrise 
Point-H Serial IO I2C Controller #1 (rev 31)
     00:16.0 Communication controller: Intel Corporation Sunrise Point-H 
CSME HECI #1 (rev 31)
     00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA 
controller [AHCI mode] (rev 31)
     00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port 
#19 (rev f1)
     00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port 
#20 (rev f1)
     00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express 
Root Port #5 (rev f1)
     00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express 
Root Port #11 (rev f1)
     00:1e.0 Signal processing controller: Intel Corporation Sunrise 
Point-H Serial IO UART #0 (rev 31)
     00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller 
(rev 31)
     00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC 
(rev 31)
     00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
     00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection 
(2) I219-LM (rev 31)
     02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network 
Connection (rev 03)
     03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network 
Connection (rev 03)
     05:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network 
Connection (rev 03)

I'm happy to add any code instrumentation or make any other changes 
needed to locate and resolve the problem, and I can readily reproduce it 
-- I'm just at a complete loss as to where to start looking, and am 
still hoping for some suggestions in that regard.

If there's anywhere (or anyone) else better for me to talk to about this 
issue, please let me know that too.


More information about the Intel-wired-lan mailing list