[Intel-wired-lan] [e1000e] Linux 4.9: unable to send packets after link recovery with patched driver

Paul Menzel pmenzel at molgen.mpg.de
Tue Sep 3 09:39:35 UTC 2019


Dear Tomas,


On 2019-09-03 11:28, Winkler, Tomas wrote:

>> On Tue, Sep 03, 2019 at 10:35:30AM +0200, Paul Menzel wrote:

>>> On 03.09.19 09:56, Gavin Lambert wrote:
>>>> On 2019-08-20 14:15, I wrote:
>>>>> Does anyone have any ideas about this?  Either towards further
>>>>> investigation or to a possible resolution?
>>>>>
>>>>> This is at the point of hardware internals now, so I have no idea
>>>>> how to proceed in either area.
>>>>
>>>> To recap (plus some new info):
>>>>
>>>> 1. I am using a kernel module which uses the code from the e1000e
>>>> driver to communicate with the hardware without actually registering
>>>> it as a Linux netdev.  (This is partly because it can get used in a
>>>> Xenomai context outside of Linux itself, although I'm not doing that
>>>> myself.) This historically works fine.
>>>>
>>>> 2. On certain Linux versions, I encountered an issue where
>>>> disconnecting the network cable and reconnecting it almost always
>>>> results in not being able to send any packets.  (I cannot determine
>>>> if receiving packets works in this case, as the network design will
>>>> not receive packets unless some are sent first.)  Restarting the
>>>> driver (rmmod+modprobe) does recover from this case (until the next
>>>> link loss), but simply replugging the cable never does.
>>>>
>>>> 3. The problem was observed with both I219-V and I219-LM (on
>>>> motherboard), but was *not* observed with 82571EB (PCIE).  The
>>>> problem was not observed with a motherboard igb-based I211.  I
>>>> suspect the issue is limited to motherboard-based e1000e adapters.
>>>> (Or perhaps there's something different about how the IGBs are
>>>> internally connected.)
>>>>
>>>> 4. The problem does not occur when the e1000e driver is registered
>>>> "normally" as a Linux netdev.
>>>>
>>>> 5. The problem was introduced by "mei: me: allow runtime pm for
>>>> platform with D0i3" (which has been backported to 4.4+, as far as I can
>> tell).
>>>> Excluding this commit reliably resolves the issue and including it
>>>> reliably breaks it.
>>>
>>> The commit hash in the master branch is
>>> cc365dcf0e56271bedf3de95f88922abe248e951 and is there since v4.16-rc1.
>>>
>>> Strange, that it is in 4.4 and 4.9, as it was only tagged for v4.13+.
>>>
>>>> 6. Applying the previously suggested patch
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue.
>>>> git/commit/drivers/net/ethernet/intel/e1000e?id=def4ec6dce393e2136b6
>>>> 2a05712f35a7fa5f5e56 has no effect; the E1000_STATUS_PCIM_STATE bit
>>>> is not set when the issue occurs.
>>>>
>>>> 7. Given the content of the change in #5, I assumed that the problem
>>>> was power-management related, perhaps a side effect of the e1000e
>>>> driver not being registered as a netdev.  (So perhaps something
>>>> thinks that no devices are in use and turns something off?)
>>>>
>>>> 8. I've previously posted register dumps from an e1000e in both the
>>>> "normal" and "link up but not transmitting" states.  They seemed
>>>> very similar, but as I'm not familiar with the register meanings I
>>>> may have overlooked something significant.  (Note that the dumps
>>>> were captured inside the watchdog task, when it detects link up but
>>>> before it sets
>>>> E1000_TCTL_EN.)
>>>>
>>>> 9. I enabled debug logging in the mei driver; it logs a couple of
>>>> runtime_idles and then a runtime_suspend during system startup.  (I
>>>> added a log to runtime_resume that is missing in the driver source,
>>>> but it appears this does not get called in my scenario.)  Note that
>>>> the e1000e driver is still working ok after this.. at least at first.
>>>>
>>>> 10. "cat /sys/bus/devices/pci0000:00/0000:00:16.0/power/runtime_status"
>>>> => "suspended"
>>>>      "cat
>>>>
>> /sys/bus/devices/pci0000:00/0000:00:16.0/mei/mei0/power/runtime_status"
>>>> => "unsupported"
>>>>      "cat /sys/bus/devices/pci0000:00/0000:00:1f.0/power/runtime_status"
>>>> => "active"
>>>>      "cat /sys/bus/devices/pci0000:00/0000:00:1f.6/power/runtime_status"
>>>> => "active" (this is the actual NIC)
>>>>      These don't change between the working and non-working states.
>>>> (It's possible that some other device does, but I haven't found it
>>>> yet.)
>>>>
>>>> 11. I did try forcing the above to unsuspend, but this did not
>>>> recover from the e1000e issue.
>>>>
>>>> 12. I also tried calling e1000e_reset on link-down.  This produces
>>>> different register output on link-up, but doesn't recover from the
>>>> issue.
>>>>
>>>> 13. I also tried recompiling the kernel with CONFIG_PM disabled (no
>>>> power management).  This *does* resolve the problem (but is a very
>>>> big hammer).
>>>>
>>>> 14. Possibly also of interest is that if I do *both* #12 and #13,
>>>> the problem remains (suggesting #12 was counter-productive).
>>>>
>>>> FYI the hardware on one of the test machines is as follows:
>>>>      00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
>>>>      00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller
>>>> (x16) (rev 05)
>>>>      00:02.0 VGA compatible controller: Intel Corporation Device
>>>> 5912 (rev 04)
>>>>      00:08.0 System peripheral: Intel Corporation Skylake Gaussian
>>>> Mixture Model
>>>>      00:14.0 USB controller: Intel Corporation Sunrise Point-H USB
>>>> 3.0  xHCI Controller (rev 31)
>>>>      00:14.2 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Thermal subsystem (rev 31)
>>>>      00:15.0 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Serial IO I2C Controller #0 (rev 31)
>>>>      00:15.1 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Serial IO I2C Controller #1 (rev 31)
>>>>      00:16.0 Communication controller: Intel Corporation Sunrise
>>>> Point-H CSME HECI #1 (rev 31)
>>>>      00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA
>>>> controller [AHCI mode] (rev 31)
>>>>      00:1b.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
>>>> Port #19 (rev f1)
>>>>      00:1b.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Root
>>>> Port #20 (rev f1)
>>>>      00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>>> Express Root Port #5 (rev f1)
>>>>      00:1d.0 PCI bridge: Intel Corporation Sunrise Point-H PCI
>>>> Express Root Port #11 (rev f1)
>>>>      00:1e.0 Signal processing controller: Intel Corporation Sunrise
>>>> Point-H Serial IO UART #0 (rev 31)
>>>>      00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC
>>>> Controller (rev 31)
>>>>      00:1f.2 Memory controller: Intel Corporation Sunrise Point-H
>>>> PMC (rev 31)
>>>>      00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
>>>>      00:1f.6 Ethernet controller: Intel Corporation Ethernet
>>>> Connection (2) I219-LM (rev 31)
>>>>      02:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>> Network Connection (rev 03)
>>>>      03:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>> Network Connection (rev 03)
>>>>      05:00.0 Ethernet controller: Intel Corporation I211 Gigabit
>>>> Network Connection (rev 03)

(Tomas, your MUA wrapped the lines messing up the formatting.)

>>>> I'm happy to add any code instrumentation or make any other changes
>>>> needed to locate and resolve the problem, and I can readily
>>>> reproduce it
>>>> -- I'm just at a complete loss as to where to start looking, and am
>>>> still hoping for some suggestions in that regard.
>>>>
>>>> If there's anywhere (or anyone) else better for me to talk to about
>>>> this issue, please let me know that too.
>>>
>>> It is not clear to me, if this is still reproducible on Linux 5.3-rc7
>>> (or Linus’ master branch).
>>>
>>> If it is, this is a definitely regression, and the commits need to be
>>> reverted due to Linux’ no regression policy.
>>
>> So I should revert this from 4.4.y and 4.9.y?
> 
> The issue is not in mei driver, it is in e1000 driver, I my best
> knowledge there should be fix, please Vitaly can it be backported to
> older kernels?

Tomas, backporting the commit supposedly fixing this, does *not* help.
Also, it does not matter for the no regression policy.

Let’s wait until Gavin can confirm if it is happening with Linux 5.3-rc7.


Kind regards,

Paul

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5174 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20190903/a8484910/attachment-0001.p7s>


More information about the Intel-wired-lan mailing list