[Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver
Paul Menzel
pmenzel at molgen.mpg.de
Tue Sep 3 08:35:30 UTC 2019
Dear Gavin,
Thank you for following up on this.
On 03.09.19 09:56, Gavin Lambert wrote:
> On 2019-08-20 14:15, I wrote:
>> Does anyone have any ideas about this? Either towards further
>> investigation or to a possible resolution?
>>
>> This is at the point of hardware internals now, so I have no idea how
>> to proceed in either area.
>
> To recap (plus some new info):
>
> 1. I am using a kernel module which uses the code from the e1000e driver
> to communicate with the hardware without actually registering it as a
> Linux netdev. (This is partly because it can get used in a Xenomai
> context outside of Linux itself, although I'm not doing that myself.)
> This historically works fine.
>
> 2. On certain Linux versions, I encountered an issue where disconnecting
> the network cable and reconnecting it almost always results in not being
> able to send any packets. (I cannot determine if receiving packets
> works in this case, as the network design will not receive packets
> unless some are sent first.) Restarting the driver (rmmod+modprobe)
> does recover from this case (until the next link loss), but simply
> replugging the cable never does.
>
> 3. The problem was observed with both I219-V and I219-LM (on
> motherboard), but was *not* observed with 82571EB (PCIE). The problem
> was not observed with a motherboard igb-based I211. I suspect the issue
> is limited to motherboard-based e1000e adapters. (Or perhaps there's
> something different about how the IGBs are internally connected.)
>
> 4. The problem does not occur when the e1000e driver is registered
> "normally" as a Linux netdev.
>
> 5. The problem was introduced by "mei: me: allow runtime pm for platform
> with D0i3" (which has been backported to 4.4+, as far as I can tell).
> Excluding this commit reliably resolves the issue and including it
> reliably breaks it.
The commit hash in the master branch is
cc365dcf0e56271bedf3de95f88922abe248e951 and is there since v4.16-rc1.
Strange, that it is in 4.4 and 4.9, as it was only tagged for v4.13+.
> 6. Applying the previously suggested patch
> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b62a05712f35a7fa5f5e56
> has no effect; the E1000_STATUS_PCIM_STATE bit is not set when the issue
> occurs.
>
> 7. Given the content of the change in #5, I assumed that the problem was
> power-management related, perhaps a side effect of the e1000e driver not
> being registered as a netdev. (So perhaps something thinks that no
> devices are in use and turns something off?)
>
> 8. I've previously posted register dumps from an e1000e in both the
> "normal" and "link up but not transmitting" states. They seemed very
> similar, but as I'm not familiar with the register meanings I may have
> overlooked something significant. (Note that the dumps were captured
> inside the watchdog task, when it detects link up but before it sets
> E1000_TCTL_EN.)
>
> 9. I enabled debug logging in the mei driver; it logs a couple of
> runtime_idles and then a runtime_suspend during system startup. (I
> added a log to runtime_resume that is missing in the driver source, but
> it appears this does not get called in my scenario.) Note that the
> e1000e driver is still working ok after this.. at least at first.
>
> 10. "cat /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
> => "suspended"
> "cat
> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
> => "unsupported"
> "cat /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
> => "active"
> "cat /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
> => "active" (this is the actual NIC)
> These don't change between the working and non-working states.
> (It's possible that some other device does, but I haven't found it yet.)
>
> 11. I did try forcing the above to unsuspend, but this did not recover
> from the e1000e issue.
>
> 12. I also tried calling e1000e_reset on link-down. This produces
> different register output on link-up, but doesn't recover from the issue.
>
> 13. I also tried recompiling the kernel with CONFIG_PM disabled (no
> power management). This *does* resolve the problem (but is a very big
> hammer).
>
> 14. Possibly also of interest is that if I do *both* #12 and #13, the
> problem remains (suggesting #12 was counter-productive).
>
> FYI the hardware on one of the test machines is as follows:
> 00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
> 00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 05)
> 00:02.0 VGA compatible controller: Intel Corporation Device 5912 (rev 04)
> 00:08.0 System peripheral: Intel Corporation Skylake Gaussian Mixture Model
> 00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
> 00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
> 00:15.0 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO I2C Controller #0 (rev 31)
> 00:15.1 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO I2C Controller #1 (rev 31)
> 00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
> 00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA controller [AHCI mode] (rev 31)
> 00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #19 (rev f1)
> 00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Root Port #20 (rev f1)
> 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1)
> 00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #11 (rev f1)
> 00:1e.0 Signal processing controller: Intel Corporation Sunrise Point-H Serial IO UART #0 (rev 31)
> 00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
> 00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
> 00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
> 00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
> 02:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> 03:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
> 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
>
> I'm happy to add any code instrumentation or make any other changes
> needed to locate and resolve the problem, and I can readily reproduce it
> -- I'm just at a complete loss as to where to start looking, and am
> still hoping for some suggestions in that regard.
>
> If there's anywhere (or anyone) else better for me to talk to about this
> issue, please let me know that too.
It is not clear to me, if this is still reproducible on Linux 5.3-rc7
(or Linus’ master branch).
If it is, this is a definitely regression, and the commits need to be
reverted due to Linux’ no regression policy.
Kind regards,
Paul
More information about the Intel-wired-lan
mailing list